personal experience
triage -> where is it ? what’s the impact ? incident level and workflow
root cause -> history, metrics/log, recent changes, machine/process metrics, is it the cause or the effect ?
migitate -> revert/rollback, extend, simple fix(avoid huge change)
resolve(software way) -> issue sample(log,evidence), reproduce(state ->input+logic->new state), code and test
war room -> update progress
-
Identify the problem
a. gather information
b. reproduce the problem
c. determine recent change -
Establish a theory of root cause
a. bottom-up
b. top-down -
Attempt a fix based on finds
a. make one change at a time and test
b. have a rollback plan -
Document the solution