- being on-call
- being on-call
- alerting principles
- before an incident
- what is an incident
- severity levels
- different roles for incidents
- incident call etiquette
- complex incidents
- during an incident
- during an incident
- security incident response
- after an incident
- after an incident
- postmortem process
- follow-ups(finish in time for high priority ones)
- practice
```
Roles:
Incident commander -> coordinating
Scribe -> timeline and records decisions
Drivers -> engineers involed in fixing the issue
incident management lifecycle:
1. start comms(identify level, roles assigned)
2. contain and fix(find ways to restore/rollback,keep evidence)
3. resolve and review(verify, stakeholders, post-mortem)
tools:
slack(incident-bot)
attribution(owner: slack channel, slack_aliases, emails, app_id, radar, pagerduty_service_id, runbook, stakeholders)
```
-
in real life
oncall -> first responder(oncall)
firefighter -> incident channel, live call, rollout blocks(notify leader on next action)
comms -> sending periodic updates , escalating, get more firefighters
scribe -> document incident, recording any incidents for future reference
leader -> should not be firefighter, coordinate, keep people focused, care for the participants(not in good state, tired), take decisions, high level viewing -
slow burning incidents
Don’t start new features without sufficient alerts