Musings about "On Call"
In the cloud enabled world where everything runs 24/7 someone behind the scenes is usually on the hook to make sure that when things break off hours that a human reacts. On call is often a team based rotation of a few days to a week with varying degrees of guarantees. Some teams have SLA’s that require a reaction immediately, others can tolerate more graceful degradation. In any case at some point someone gets the dreaded page from PagerDuty.
I’ve been on call thousands of times in my career with varying degrees of annoyance and I got to thinking about how I view on call, what it is, what’s the responsibility, and how to improve it when it’s painful.
What is it?
The on call person is a first responder. In my mind their only job is to be around off-hours to react when something horribly breaks. Things that go down during business hours the entire team should react to. Engineering is a team sport. Off hours, if paged the on call person should be able to quickly check some dashboards, graphs, logs and determine if something is on fire and whether they need to escalate to others to help out or do some sort of obvious manual intervention to get the system stable.
That’s basically it.
What it isn’t
Defining on call is easy, but defining what on call isn’t requires a bit more exposition.
First, the on call person isn’t a dumping ground of “keep the lights on” tasks, and they certainly aren’t responsible for fixing the broken thing. Their only job is triage, provide an immediate response (if it’s obvious) and otherwise it’s just to say ACK
and raise the alarm.
Some teams like to have the on call person be the point of contact for team questions during the week and thats mostly fine, but those again should be not super invasive. Like with code you can minimize interruptions by DRY - create FAQ’s, document your systems, invest time into self-serve solutions. Being pestered all day is just as bad as being paged all day.
In fact pages should be rare. For about 4 years I was the primary on call at a small startup (by nature of being the founding engineer on a very small team) and I was paged only a handful of times the entire time. This was an ideal on call scenario! It’s not that we ignored pages, its that we prioritized making sure that things don’t page. That means making systems resilient, and being actionable to every page to ensure it doesn’t happen again (unless things are critically on fire).
The on call person should mostly be able to live their lives without even realizing they are on call. It should not be impactful, and certainly people should feel empowered to go to the gym, shop, eat out, during their on call time. Paired with a backup secondary, and a tertiary manager based escalation, the amount of times all 3 people are gone is very rare.
I have been on teams where people are terrified to be away from a laptop for more than a few minutes and their lives grind to a halt for their on call time. I think it’s annoying and a sign of poor team health when people are coordinating minute-to-minute “I will be AFK from 2:15 to 2:32”. This is not healthy, and is in fact extraordinarily toxic because it self selects for people without families, hobbies, caretakers, etc. It doesn’t have to be this way!
Making on call better
What do you do if your on call is a dumpster fire? There’s a handful of ways to improve this process and if you have executive and leadership buy in this is absolutely possible to improve.1
The rule for improving on call long term is to be actionable about a page. An ignored page is one that will come back. There are four actions that should be considered every single time a page occurs
Tune - If the alert is too sensitive, tune it. Should the alert page you at all? Is it actually important that someone at 2am gets up to look at this? Should you adopt a multi tiered paging model of business hours vs off-hours paging? I discuss topic this in my book “Building a Startup - A primer for the individual contributor” in depth (due to be released this spring).
Remove - Ask yourself, does this alert even need to exist. If you just ack the alert and it goes away, or you mute it and move on, then delete it. Be ruthless in your decisioning.
Fix - Regardless if the alert was real or not, is there a way to fix the underlying issue? If a queue is backing up and you are alerting on it, a fix can be to auto scale workers based on queue size. If a 3rd party provider is failing with 429’s (rate limited) can you slow down your request rate programmatically? What other ways can you do to make your system more resilient. Can you detect errors ahead of time and react to them? Can you repair broken data automatically? Can you use twilio to call the support of a bad banking processor partner2?
Escalate - If the world is actually on fire the on call person is not going to fix this themselves and needs to put the bat-signal up. This is the only acceptable time something should page someone in my mind.
There is no world where a page does result in one of these actions. Teams that don’t take action are doomed to have tough on call rotations, and that breeds resentment and burnout. Teams that ignore pages or do an “ack and wait” model since things “tend to resolve themselves” are basically telling each other “we don’t value your time”.
Alerts that actually do fire should include information about why they were set up in the first place. What are they monitoring, why are they monitoring it?
Runbooks3 are common to include on alerts, but in my mind the runbook should not be “here is how you fix it”, instead the runbook should be “here is how you validate the severity of this and get more details”. If the runbook describes a step by step situation to solve a page can it be programmatically enforced? Obviously some things like “the MySQL instance needs to be restarted because this rare scenario has occurred” is not worth programmatically controlling but this alert should also be extraordinarily rare!
Reacting to pages takes a lot of work, especially if they’ve been neglected for a long time. But it is possible to make improvements, the team just needs to be militant about responding to each and every one. Avoid the temptation to have dedicated time to improve things, that never works. When you fall into the trap of doing “let’s have a sprint where we just squash bugs” you’re telling the team that reacting continuously doesn’t matter and that it’s ok to ignore things when its not part of regular scheduled programming.
To me, off hours pages should be treated like a catastrophe. It’s all-hands-on-deck the next morning if a page happened and the entire world isn’t melting. Every person on the team should be clamoring to make improvements, because it’s could very well happen on their watch next time!
If an incident does happen the team needs to do work to make sure that the cause of that incident can’t ever happen again. I like to frame a lot of my architectural decisions in how much I hate getting paged. I mull on every failure mode and whether the system is resilient enough to handle it. In the beginning it feels a little like whack-a-mole, plugging holes as you find them. But over time, systems become hardened, almost bullet proof, if teams actually follow through on it.
With practice this mindset becomes part of every system design, so you build resilient systems first instead of building reactive systems that need to become resilient later.
Conclusion
Investing time and energy into improving the on call process can massively impact the morale of a team. Not to mention this is a huge selling point in recruitment - if you care about the on call process people will clamor to join your team, and you can encourage the best and brightest to do their best work with fewer distractions.
A large part of the on call process is cultural - what does your team care about and how do they prioritize their work? Teams that don’t iterate and action on the on call process tend to sink under their own weight and it becomes a self fulfilling cycle of failure. Pages get ignored, then real issues get missed, incidents take longer to resolve, leadership gets frustrated, the team feels more pressure, goto 1.
With focus and determination, the on call process can be easy and forgettable!
If you don’t have buy in, then leave. You must have support to make improvements otherwise things cannot improve. It’s sad to say but it’s well known that most people leave bosses, not jobs.
I actually did this. They kept telling us to call their support anytime they had an outage, but why should I wake up at 2am to call them? Computers are cool as hell, use them!
A runbook is just a description of steps