At 3am, half-asleep, with a page in your hand and 200 lines of logs scrolling past, you do not need AI to be clever. You need AI to be a structured second brain that asks the right triage question, ranks three hypotheses by likelihood, and pulls you back from the impulse to “just restart the service and see.” This workflow is the version I wish I had during my first on-call rotation: short steps, hard limits, and a capture pass that turns one bad night into a runbook entry so next on-call does not relive it.
What this covers
A page-to-fix workflow that uses AI for triage and hypothesis ranking, with explicit guardrails for the high-stress decision points (when to wake up a teammate, when to revert vs investigate, when to stop and accept a longer outage in exchange for a safer fix). Plus the post-incident capture pass that converts the AI chat log into a runbook entry.
Who this is for
Engineers in any on-call rotation, especially first-time on-callers and teams without a strong runbook tradition. Senior on-call engineers using AI to keep cognitive load down during long incidents. Tech leads who want their team to handle pages more consistently.
When to reach for it
Production page, you are the responder, the system is in a degraded or unknown state. New alerts you have not seen before — the AI is mostly useful for the first 10 minutes of an unfamiliar incident. Cascading alerts where you need to figure out which one is the root.
When this is NOT the right tool
Pages with a known fix written into the alert (“DB connection pool exhausted — run X”). Just do the runbook. Security incidents — escalate, do not chat. Pages during a known maintenance window — check the maintenance schedule first. Pages where the answer is obviously “the deploy from 20 minutes ago broke it” — revert first, AI later.
Before you start
- Have AI accessible from your on-call setup. Phone, laptop, terminal — whichever you reach for first. Friction at 3am kills the workflow.
- Have a paste buffer ready. You will be copy-pasting alert text, log lines, and metric snapshots. The whole workflow rides on giving AI enough context to be useful.
- Know your service map at a vague level. AI cannot debug what you cannot describe. “The auth service talks to Redis and to the user database” is enough scaffolding to start.
- Pre-commit to your calm-down protocol. Before you type any command into production, take a breath and tell AI what you are about to do. The 5 seconds catch most “fat-finger” outages.
Step by step
- Triage. Paste the alert text and the first 50 lines of relevant logs into AI. Ask: “Given this alert and these logs, what are the top 3 hypotheses for the root cause, ranked by likelihood, with one specific check for each. Do NOT suggest fixes yet.”
# Quick log grab on a typical Linux service journalctl -u myservice --since "10 min ago" | tail -200 # Or for k8s kubectl logs deployment/myservice --tail=200 --since=10m - Check the recent change. Before you go down the hypothesis tree, ask: “Was there a deploy or config change in the last hour?” If yes, the answer is usually revert. Stop, revert, observe. AI’s role here is to remind you that this is the cheapest check.
- Run the hypothesis checks in order of likelihood. Report results back to AI as you go. “Check 1 (Redis memory) passes — used 30%. Check 2 (slow query) confirmed —
SELECTonordersrunning 15s.” Now AI’s context grows with real data, not guesses. - Calm-down protocol before any write action in production. Before running anything destructive —
kubectl delete,DROP,kill, restart — paste the exact command into AI and ask: “I am about to run this in production. What could go wrong? What is the recovery if it makes things worse?” 5 seconds, often saves a second outage. - Decide: investigate or revert. If the root cause is unclear after 15 minutes of investigation and the system is degrading, revert is usually correct. AI can help frame this trade-off: “Symptom is X, cost of revert is Y, cost of continued investigation is Z. Should I revert?”
- Apply fix or revert. Run the action. Observe the system. Wait for the recovery signal (alerts clearing, metrics returning to baseline) — do not declare resolved until the metrics agree.
- Wake up a teammate when: 30 minutes in with no root cause, multiple services affected, you are about to take a destructive action you have not done before, or you are too tired to think clearly. AI can suggest “consider escalating” but the call is yours.
A triage prompt that holds up at 3am
On-call page. I need triage.
Alert: \{paste alert text\}
Recent logs (last 10 min):
\{paste 50-200 lines\}
Service context: \{1-2 sentences — what the service does, key
dependencies\}
Recent activity (deploys / config changes in last hour, if known):
\{paste or "none known"\}
Produce:
1. Top 3 hypotheses for root cause, ranked by likelihood. For each,
give ONE specific diagnostic check (a command, a metric to look at,
a log search). Do not write fixes.
2. One sentence: "If you have not yet checked recent deploys, do that
first."
3. One sentence: "If you have been investigating more than 15 min
without convergence, consider reverting the most recent change."
Do not soften. Do not propose three "could also be" causes I should
ignore. The 3 hypotheses must be your actual best guesses.
Quality check
- You ran the diagnostic check before applying a fix. The most common 3am mistake is jumping to a fix that addresses a hypothesis you never confirmed.
- For any destructive command, you ran the calm-down prompt first. No exceptions, even if you have done the command before.
- You checked recent deploys / config changes in the first 5 minutes. Most pages trace back to “something changed.”
- The recovery is verified by metrics returning to baseline, not by “the alert went away” — alerts can clear momentarily for the wrong reasons.
- You escalated within the time / scope thresholds you set. Hero on-call is how 30-minute outages become 3-hour outages.
How to reuse this workflow
- After the page resolves, ask AI to summarize the chat into a runbook entry. “Take our conversation and produce a runbook section: symptom, diagnostic steps in order, fix, rollback. Plain language.” Tighten and paste into the team runbook.
- Save the triage prompt as a snippet you can paste into AI in under 10 seconds. Friction at 3am is the enemy.
- Build a personal “things I check first” list — the top 3-5 hypotheses for your service. AI is general; your service has specific recurring causes. After a few rotations, your list beats AI’s general ranking for your service.
- Review at the next team meeting: which pages were AI-useful, which were not? AI is most useful for unfamiliar alerts and least useful for repeat alerts with known runbooks.
Recommended workflow
Page → paste alert + logs → AI gives 3 ranked hypotheses → check recent deploys → run hypothesis check 1 → report back → confirm cause → calm-down prompt before any destructive command → apply fix or revert → verify with metrics → capture chat as runbook entry. Typical page: 15-30 minutes from page to resolved when the workflow holds.
Common mistakes
- Pasting only the alert text, no logs. AI cannot triage without data.
- Skipping the “recent deploys” check. Most pages are caused by something that shipped in the last hour.
- Letting AI propose a fix before you confirm the root cause. “Increase the memory limit” is a fine suggestion for the wrong cause.
- Running destructive commands without the calm-down prompt. The cost is 5 seconds; the savings is occasionally a second outage.
- Refusing to revert because “we should understand the bug first.” During an active outage, understand later. Restore service first.
- Not waking up a teammate when you should. The team has a rotation specifically so one person does not do hour 3 of a hard incident alone.
- Not capturing the chat as a runbook entry. The same alert will fire again; the next on-call deserves your notes.
FAQ
- What if AI’s hypotheses are all wrong?: Give it negative evidence and re-prompt. “It is NOT a memory issue (checked), NOT a recent deploy (checked). Re-rank.” AI converges faster with negative info than with vague positive prompts.
- Can AI access my logs / metrics directly?: With agentic setups (Claude Code with MCP, similar) you can wire it to log search and dashboards. Useful, but verify the AI is reading actual data and not hallucinating. For high-stakes calls, paste data manually.
- Should I let AI take actions on production?: No. The AI suggests, you act. The 5-second pause between suggestion and action is what catches the bad ideas.
- What if I am too tired to think?: Wake up your teammate. AI is a thinking aid, not a substitute for a human who is awake.
- How do I avoid AI rabbit-holing me?: Set a 15-minute timer at the start. If the system is still degrading at 15 minutes and you have no confirmed cause, revert and investigate later.
- What about post-page debrief?: Use the postmortem workflow. The on-call chat is one of the best inputs to a postmortem — keep it.
Related
- AI incident postmortem workflow
- AI rollback workflow
- AI debug workflow
- Feed project reports to agents
Tags: #AI coding #Workflow