Firefighter

Decision

Every two weeks, a different team is on firefighter duty. The firefighter is responsible for monitoring Slack, both for messages from our monitoring systems, as well as for @firefighter mentions from other colleagues. The firefighting team is responsible to ensure response times below 1 hour during work hours from 9am to 6pm. It needs to ensure that this is covered also e.g. when someone in the team gets sick or the team is on an offsite.

Each team can decide on their own how they split up the firefighting work internally.

We also keep a Confluence page up-to-date with more details on how to firefight.

The firefighter role is also used to easily rotate other duties like facilitating meetings, so that everyone only needs to remember which team is firefighter this week, not multiple changing responsibilities.

Problems

Both our existing but also newly written systems don't always do what we expect. They might also not be as self-explanatory to their users as we would like to. Therefore we need to monitor the system and react to requests coming from our users, both internal and external.

Internal users of our systems need to know who to contact if they have questions or problems with our systems. This needs to be easy without having to understand which system is affected and who is responsible for it.

At the same time, there are other external changes which affect our system, like depreciations of libraries or outages of services we rely on. We need to be able to react to those as well.

Context

We've already used the firefighter process described here for almost two years and it has worked reasonably well.

Options

  1. Rotate firefighting duty between teams
  2. Have all teams do firefighting for their own services
  3. Have one or a few centrally assigned firefighters
  4. Don't have dedicated firefighters

Reasoning

We rotate firefighters between teams to share knowledge, reduce the impact on individual teams by sharing the burden, and to ensure that the people who write the systems are the same that suffer if the systems cause trouble.

We assign only one team at a time to allow the others teams to fully focus instead of having to check Slack regularly.

Consequences

How do we implement this change?

The process is already used in practice. There are calendar entries to remind teams to hand-over, and to make it visible which team is firefighter now.

Who will implement the change?

The process is already used in practice.

How do we teach this change?

Teams are encouraged to use pair- and ensemble work to share their knowledge with new joiners. We share learnings across teams in the cross-team retro.

What could go wrong?

Splitting the work of firefighter across multiple people could lead to a firefighter missing knowledge that other firefighters have.

Sharing the role might force people to do firefighting work who don't enjoy it.

What do we do if something goes wrong?

Since this is a process that only impacts one team at a time, it is easy to experiment with changes and then roll it out team by team.

What is still unclear?

How do we ensure that firefighters can maintain new systems that they haven't written? How do we unify logging, observability, metrics and dashboards to make it easier to understand what is going on?

Related ADRs