Skip to main content

My Philosophy on Alerting


My Philosophy on Alerting
based my observations while I was a Site Reliability Engineer at Google

Author: Rob Ewaschuk <rob@infinitepigeons.org>

 

Summary

When you are auditing or writing alerting rules, consider these things to keep your oncall rotation happier:

      Pages should be urgent, important, actionable, and real.
      They should represent either ongoing or imminent problems with your service.
      Err on the side of removing noisy alerts – over-monitoring is a harder problem to solve than under-monitoring.
      You should almost always be able to classify the problem into one of: availability & basic functionality; latency; correctness (completeness, freshness and durability of data); and feature-specific problems.
      Symptoms are a better way to capture more problems more comprehensively and robustly with less effort.
      Include cause-based information in symptom-based pages or on dashboards, but avoid alerting directly on causes.
      The further up your serving stack you go, the more distinct problems you catch in a single rule.  But don't go so far you can't sufficiently distinguish what's going on.
      If you want a quiet oncall rotation, it's imperative to have a system for dealing with things that need timely response, but are not imminently critical.

Introduction

After seven years of being oncall for a variety of different services, including both massive- and small-scale, fast-moving products, and several parts of core infrastructure, I have developed a philosophy on monitoring and alerting.  It reflects my fundamental view on pages and pagers:

      Every time my pager goes off, I should be able to react with a sense of urgency.  I can only do this a few times a day before I get fatigued.
      Every page should be actionable; simply noting "this paged again" is not an action.
      Every page should require intelligence to deal with: no robotic, scriptable responses.

Overall, it's a bit aspirational, but it guided me when I wrote or reviewed a new paging rule in the monitoring systems.  These are some questions I use when I'm writing or reviewing a new rule that might result in a page:

      Does it detect an otherwise undetected condition that is urgent, actionable and actively or imminently user-visible.  Note that "N+0" zero-redundancy situations count as imminent, as do "nearly full and getting fuller" parts of your service, like storage being maxed out.
      Will I ever be able to ignore this rule, knowing it's benign?  When and why, and can I refine the rule to avoid this situation?
      Is it identifying a situation which is definitely (going to be) hurting users?  Are there detectable cases where it doesn't hurt users that should be filtered out?  Think things like server clusters with test-traffic, etc.
      Can I take action in response to this alert? Is that action urgent, or could it wait until after I wake up or the end of the weekend or next quarter?
      Are other people getting paged at the same time?  Are they going to fix the problem?  Or maybe I'm going to fix the problem for someone else?  Can we tie these things together?  Can my version of the rule wait a bit for them to try to fix it?

The ideas below are certainly aspirational—no pager rotation of a growing, changing service is ever as clean as it could be—but there are some tricks that get you much closer.

Vernacular

This document uses the following terms:
      page: anything that tries to urgently and actively get the attention of a specific human (e.g. via a pager or cell phone going beep beep beep)
      rule: any kind of logic for detecting some interesting condition in any monitoring system.
      alert: is a manifestation of a rule that (intends to) reach a human, e.g. as a page, an email, a message in an IRC channel, auto-filing a ticket, etc.

Monitor for your users

I call this "symptom-based monitoring," in contrast to "cause-based monitoring". Do your users care if your MySQL servers are down?  No, they care if their queries are failing. (Perhaps you're cringing already, in love with your Nagios rules for MySQL servers? Your users don't even know your MySQL servers exist!)  Do your users care if a support (i.e. non-serving-path) binary is in a restart-loop?  No, they care if their features are failing.  Do they care if your data push is failing?  No, they care about whether their results are fresh.

Users, in general, care about a small number of things:

      Basic availability and correctness. no "Oops!", no 500s, no hung requests or half-loaded pages or missing Javascript or CSS or images or videos.  Anything that breaks the core service in some way should be considered unavailability.
      Latency.  Fast.  Fast.  Fast.  Also, fast.
      Completeness/freshness/durability.  Your users data should be safe, should come back when you ask, and search indices should be up-to-date.   Even if it is temporarily unavailable, users should have complete faith that it's coming back unc#rÃ÷Ʊted.
      Features. Your users care that all the features of the service work—you should be monitoring for anything that is an important aspect of your service even if it's not core functionality/availability (e.g. the Calculator and stock ticker showing up in search results).

That's pretty much it.  There's a subtle but important difference between database servers being unavailable and user data being unavailable.  The former is a proximate cause, the latter is a symptom.  You can't always cleanly distinguish these things, particularly when you don't have a way to mimic the client's perspective (e.g. a blackbox probe or monitoring their perspective directly). But when you can, you should.

Cause-based alerts are bad (but sometimes necessary)

"But," you might say, "I know database servers that are unreachable results in user data unavailability."  That's fine.  Alert on the data unavailability.  Alert on the symptom: the 500, the Oops!, the whitebox metric that indicates that not all servers were reached from the database's client.  Why?

      You're going to have to catch the symptom anyway.  Maybe it can happen because of network disconnection, or CPU contention, or myriad other problems you haven't thought of yet.  So you have to catch the symptom.
      Once you catch the symptom and the cause, you have redundant alerts; these need separate tuning, and result in either duplication or complicated dependency trees
      The allegedly inevitable result is not always inevitable: maybe your database servers are unavailable because you're turning up a new instance or turning down an old one.  Or maybe a feature was added to do fast-failover of requests, and so you don't care anymore about a single server's availability.  Sure, you can catch all these cases with increasingly complicated rules, but why bother?  The failure mode is more bogus pages, more confusion, and more tuning, with no gain, and less time spent on fixing the alerts that matter.

But sometimes they're necessary.  There's (often) no symptoms to "almost" running out of quota or memory or disk I/O, etc., so you want rules to know you're walking towards a cliff.  Use these sparingly; don't write cause-based paging rules for symptoms you can catch otherwise.

Alerting from the spout (or beyond!)

The best alerts in a layered client/server system come from the client's perspective:

      The client sees the results of retries, network latency between client & server, and has a better perspective on the user-facing latency and errors than the server
      In many cases the client (e.g. a mixer or application server) is aggregating responses from many backends, like caching services, databases, account management/authorization services, query shards, etc.  Your monitoring is more robust to changes in underlying infrastructure (and in application-level failover and retries) if you see what the client actually does.
      In many cases, the client can present a simpler view of the world than the backends. For example if a request fans out to hundreds of query-servers, each query server has too limited of a view of the world to be a useful source of alerting.

For many services, this means alerting on what your front-most load-balancers see in terms of latency, errors, etc.  This way you only see the result of broken servers if those results are making it to the user.  Conversely, you're seeing a bigger class of problems than you can see from your servers: if they're all down, or serving out uncounted 500s, or dropping 10% of connections on the floor, your load balancer knows but your server might not.

Note that going too far can introduce agents that are beyond your control and responsibility.  If you can reliably capture a view of exactly what your users sees (e.g. via browser-side instrumentation), that's great!  But remember that signal is full of noise—their ISP, browser, client-side load and performance—so it probably shouldn't be the only way you see the world.  It may also be lossy, if your external monitoring can't always contact you.  Taken to this kind of extreme, it's still a useful signal but maybe not one you want to page on.

Causes are still useful

Cause-based rules can still be useful.  In particular, they can help you jump quickly to a known deficiency in your production system.

If you gain a lot of value in automatically tying symptoms back to causes, perhaps because there are causes that are outside of your control to eliminate, I advocate this technique:

  1. When you write (or discover) a rule that represents a cause, check that the symptom is also caught.  If not, make it so.
  2. Print a terse summary of all of your cause-based rules that are firing in every page that you send out.  A quick skim by a human can identify whether the symptom they just got paged for has an already-identified cause.  This might look like:
TooMany500StatusCodes
Served 10.7% 5xx results in the last 3 minutes!
Also firing:
      JanitorProcessNotKeepingUp
      UserDatabaseShardDown
      FreshnessIndexBehind   
In this case it's clear that the most likely source of 500s is a database problem; if instead the firing symptom had been that a disk was getting full, or that result pages were coming back empty or stale, the other two causes might have been interesting.
  1. Remove or tune cause-based rules that are noisy or persistent or otherwise low-value.

Using this approach, the mental burden of the mistuned, noisy rules has been changed from a pager beep & ack (and investigation, and followup, and..) to a single line of text to be skimmed over.  Finally, since you need clear debugging dashboards anyway (for problems that don't start with an alert), this is another good place to expose cause-based rules. 

That said, if your debugging dashboards let you move quickly enough from symptom to cause to amelioration, you don't need to spend time on cause-based rules anyway.

Tickets, Reports and Email

One way or another, you have some alerts that need attention soon, but not right now.   I call these "sub-critical alerts".

      Bug or ticket-tracking systems can be useful.  Having alerts open a bug can work out great, as long as multiple firings of the same alert get correctly threaded into a single ticket/bug.  This system fails if there's no accountability for triaging and closing bugs; if the alert-opened bugs might go unseen for weeks, this clearly fails as a way of dealing with sub-critical alerts before they become critical!  It also fails if your team is simply overloaded or is not assigning enough people to deal with followup; you need to be honest about how much time this is consuming, or you'll fall further and further behind.
      A daily (or more frequent) report can work too.  One way this can work is to write sub-critical rules that are long-lived (e.g. "the database is over 90% full" or "we've served over 1000 very slow requests in the last day"), and send out a report periodically that shows all currently-firing rules.  Again, without a system of accountability this amounts to less-spammy email alerts, so make sure the oncall person (or someone else) is designated to triage these every day (or every shift hand-off, or whatever works).
      Every alert should be tracked through a workflow system.  Don't only dump them into an email list or IRC channel.  In general, this quickly turns into specialized "foo-alerts" mailing lists or channels so that they can be summarily ignored.  Except as a brief (usually days, at most weeks) period to vet that a new rule will not page too often, it's almost always a bad idea.  It's also easy to ignore the volume of these alerts, and suddenly some old, mis-tuned rule is firing every minute for all of your thousand application servers, clogging up mailboxes.  Oops.

The underlying point is to create a system that still has accountability for responsiveness, but doesn't have the high cost of waking someone up, interrupting their dinner, or preventing snuggling with a significant other.

Playbooks

Playbooks (or runbooks) are an important part of an alerting system; it's best to have an entry for each alert or family of alerts that catch a symptom, which can further explain what the alert means and how it might be addressed.

In general, if your playbook has a long detailed flow chart, you're potentially spending too much time documenting what could be wrong and too little time fixing it—unless the root causes are completely out of your control or fundamentally require human intervention (like calling a vendor).  The best playbooks I've seen have a few notes about exactly what the alert means, and what's currently interesting about an alert ("We've had a spate of power outages from our widgets from VendorX; if you find this, please add it to Bug 12345 where we're tracking things for patterns".)  Most such notes should be ephemeral, so a wiki or similar is a great tool.

Tracking & Accountability

Track your pages, and all your other alerts.  If a page is firing and people just say "I looked, nothing was wrong", that's a pretty strong sign that you need to remove the paging rule, or demote it or collect data in some other way.  Alerts that are less than 50% accurate are broken; even those that are false positives 10% of the time merit more consideration.

Having a system in place (e.g. a weekly review of all pages, and quarterly statistics) can help keep a handle on the big picture of what's going on, and tease out patterns that are lost when the pager is handed from one human to the next.

You're being naïve!

Yup, though I prefer the term "aspirational".  Here are some great reasons to break the above guidelines:

      You have a known cause that actually sits below the noise in your symptoms.  For example, if your service has 99.99% availability, but you have a common event that causes 0.001% of requests to fail, you can't alert on it as a symptom (because it's in the noise) but you can catch the causing event.  It might be worth trying to trickle this information up the stack, but maybe it really is simplest just to alert on the cause.  Caveat oncaller.
      You can't monitor at the spout, because you lose data resolution.  For example, maybe you tolerate some handlers/endpoints/backends/URLs being pretty slow (like a credit card validation compared to browsing items for sale) or low-availability (like a background refresh of an inbox).  At your load balancers, this distinction may be lost.  Walk down the stack and alert from the highest place where you have the distinction.
      Your symptoms don't appear until it's too late, like you've run out of quota. Of course, you need to page before it's too late, and sometimes that means finding a cause to page on (e.g. usage > 80% and will run out in < 4h at the growth rate of the last 1h).  But if you can do that, you should also be able to find a similar cause that's less urgent (e.g. quota > 90% and will run out in < 4d at the growth rate of the last 1d) that will catch most cases, and deal with that as a ticket or email alert or daily problem report, rather than the last-ditch escalation that a page represents.
      Your alert setup sound more complex than the problems they're trying to detect. Sometimes they will be.  The goal should be to tend towards simplicity, robust, self-protecting systems (how did you not notice that you were running out of quota? Why can't that data go somewhere else?)  In the long term, they should trend towards simplicity, but at any given time the local optimum may be relatively complex rules to keep things quiet and accurate.



May the queries flow, and your pagers be quiet.

Comments

Popular posts from this blog

OWASP Top 10 Threats and Mitigations Exam - Single Select

Last updated 4 Aug 11 Course Title: OWASP Top 10 Threats and Mitigation Exam Questions - Single Select 1) Which of the following consequences is most likely to occur due to an injection attack? Spoofing Cross-site request forgery Denial of service   Correct Insecure direct object references 2) Your application is created using a language that does not support a clear distinction between code and data. Which vulnerability is most likely to occur in your application? Injection   Correct Insecure direct object references Failure to restrict URL access Insufficient transport layer protection 3) Which of the following scenarios is most likely to cause an injection attack? Unvalidated input is embedded in an instruction stream.   Correct Unvalidated input can be distinguished from valid instructions. A Web application does not validate a client’s access to a resource. A Web action performs an operation on behalf of the user without checking a shared sec

CKA Simulator Kubernetes 1.22

  https://killer.sh Pre Setup Once you've gained access to your terminal it might be wise to spend ~1 minute to setup your environment. You could set these: alias k = kubectl                         # will already be pre-configured export do = "--dry-run=client -o yaml"     # k get pod x $do export now = "--force --grace-period 0"   # k delete pod x $now Vim To make vim use 2 spaces for a tab edit ~/.vimrc to contain: set tabstop=2 set expandtab set shiftwidth=2 More setup suggestions are in the tips section .     Question 1 | Contexts Task weight: 1%   You have access to multiple clusters from your main terminal through kubectl contexts. Write all those context names into /opt/course/1/contexts . Next write a command to display the current context into /opt/course/1/context_default_kubectl.sh , the command should use kubectl . Finally write a second command doing the same thing into /opt/course/1/context_default_no_kubectl.sh , but without the use of k

标 题: 关于Daniel Guo 律师

发信人: q123452017 (水天一色), 信区: I140 标  题: 关于Daniel Guo 律师 关键字: Daniel Guo 发信站: BBS 未名空间站 (Thu Apr 26 02:11:35 2018, 美东) 这些是lz根据亲身经历在 Immigration版上发的帖以及一些关于Daniel Guo 律师的回 帖,希望大家不要被一些马甲帖广告帖所骗,慎重考虑选择律师。 WG 和Guo两家律师对比 1. fully refund的合约上的区别 wegreened家是case不过只要第二次没有file就可以fully refund。郭家是要两次case 没过才给refund,而且只要第二次pl draft好律师就可以不退任何律师费。 2. 回信速度 wegreened家一般24小时内回信。郭律师是在可以快速回复的时候才回复很快,对于需 要时间回复或者是不愿意给出确切答复的时候就回复的比较慢。 比如:lz问过郭律师他们律所在nsc区域最近eb1a的通过率,大家也知道nsc现在杀手如 云,但是郭律师过了两天只回复说让秘书update最近的case然后去网页上查,但是上面 并没有写明tsc还是nsc。 lz还问过郭律师关于准备ps (他要求的文件)的一些问题,模版上有的东西不是很清 楚,但是他一般就是把模版上的东西再copy一遍发过来。 3. 材料区别 (推荐信) 因为我只收到郭律师写的推荐信,所以可以比下两家推荐信 wegreened家推荐信写的比较长,而且每封推荐信会用不同的语气和风格,会包含lz写 的research summary里面的某个方面 郭家四封推荐信都是一个格式,一种语气,连地址,信的称呼都是一样的,怎么看四封 推荐信都是同一个人写出来的。套路基本都是第一段目的,第二段介绍推荐人,第三段 某篇或几篇文章的abstract,最后结论 4. 前期材料准备 wegreened家要按照他们的模版准备一个十几页的research summary。 郭律师在签约之前说的是只需要准备五页左右的summary,但是在lz签完约收到推荐信 ,郭律师又发来一个很长的ps要lz自己填,而且和pl的格式基本差不多。 总结下来,申请自己上心最重要。但是如果选律师,lz更倾向于wegreened,