What if AI's hypotheses are all wrong?

Give it negative evidence and re-prompt: "It is NOT memory (checked), NOT a recent deploy (checked). Re-rank." AI converges far faster on negative info than on vague positive prompts.

Can AI read my logs and metrics directly?

Yes, with the official Grafana, PagerDuty, Datadog, Splunk, or Honeycomb MCP servers wired into Claude Code or Cursor. Keep the token read-only for on-call and verify the agent quoted a real query, not a memory. For high-stakes calls, paste data manually.

Should I let AI take actions in production?

No. AI suggests, you act. The pause between suggestion and action is what catches the bad ideas — and the 2026 incident-response guidance still puts a human on the trigger for mitigation.

What if I am too tired to think?

Wake your teammate. AI is a thinking aid, not a substitute for a human who is awake.

How do I stop AI from rabbit-holing me?

Set a 15-minute timer at the start. If the system is still degrading at 15 minutes with no confirmed cause, revert and investigate later.

What about the post-page debrief?

Use the [AI incident postmortem workflow](/en/articles/ai-incident-postmortem-workflow/). The on-call chat is one of the best inputs to a postmortem — keep it.

AI Tool Tutorials

AI On-Call Debugging: From Page to Fix Without Panic

Get paged at 3am? A tested AI workflow for triage, hypothesis ranking, a calm-down protocol before destructive commands, and turning the chat into a runbook.

Published: May 24, 2026 Updated: Jun 04, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

At 3am, half-asleep, with a page in your hand and 200 lines of logs scrolling past, you do not need AI to be clever. You need it to be a structured second brain: ask the right triage question, rank three hypotheses by likelihood, and pull you back from the impulse to “just restart the service and see.” This is the workflow I wish I had on my first on-call rotation. Short steps, hard limits, and a capture pass that turns one bad night into a runbook entry so the next responder does not relive it.

TL;DR

Paste the alert plus 50-200 log lines into AI and ask for the top 3 ranked hypotheses with one diagnostic check each — no fixes yet. The “no fixes” constraint is what keeps you from chasing an unconfirmed cause.
Check recent deploys in the first 5 minutes. Most pages trace back to “something shipped in the last hour.” If so, the answer is revert, not investigation.
Before any destructive command (kubectl delete, DROP, kill, restart), run the calm-down prompt: paste the exact command and ask what could go wrong and how to recover. Five seconds, often saves a second outage.
If you have agentic access, the official Grafana, PagerDuty, and Datadog MCP servers let Claude Code or Cursor query logs and metrics directly. Useful, but verify it read real data — for high-stakes calls, paste manually.
AI suggests, you act. The pause between suggestion and action is the safety mechanism. Capture the resolved chat as a runbook entry before you go back to bed.

When AI helps on a page — and when it does not

This workflow earns its keep on unfamiliar alerts — the first 10 minutes of an incident you have not seen before, or cascading alerts where you cannot tell which one is the root. AI is a fast, tireless triage partner that never panics.

It is the wrong tool when:

The alert already names the fix (“DB connection pool exhausted — run X”). Run the runbook.
It is a security incident. Escalate, do not chat.
The page fired during a known maintenance window. Check the schedule first.
The cause is obviously “the deploy from 20 minutes ago.” Revert first, debug later.

This matches what incident-management teams (incident.io, PagerDuty) now treat as orthodoxy: during an active incident, mitigation comes before root cause. Restore service, then understand it.

Before the page: 4 things to set up

Make AI reachable from your on-call setup — phone, laptop, terminal, whichever you grab first. Friction at 3am kills the workflow.
Have a paste buffer ready. You will copy alert text, log lines, and metric snapshots constantly. The whole workflow rides on feeding AI enough context.
Know your service map at a vague level. AI cannot debug what you cannot describe. “The auth service talks to Redis and the user database” is enough scaffolding to start.
Pre-commit to the calm-down protocol. Before you type any command into production, take a breath and tell AI what you are about to do. Those 5 seconds catch most fat-finger outages.

Optional: wire AI to your observability stack (MCP)

If you want AI reading logs and metrics instead of you pasting them, the major observability vendors shipped official Model Context Protocol (MCP) servers through early 2026. As of June 2026 these connect to Claude Code, Cursor, and other MCP clients:

MCP server	What the agent can do	Setup note
Grafana (official)	PromQL queries against Prometheus, LogQL log + metric queries against Loki, dashboards, alerting, Grafana OnCall, Sift investigations	Needs `GRAFANA_URL` + `GRAFANA_SERVICE_ACCOUNT_TOKEN`; Grafana 9.0+
PagerDuty (official, GA ~Mar 2026)	60+ tools, including full incident read/write APIs, on-call schedules, escalation policies	OAuth or API token
Datadog (official)	Monitors, metrics, logs, traces, incidents	API + app key
Splunk (v1.1.0 GA), Honeycomb (hosted)	Search, query, dashboards	Vendor token

# List what is already wired up
claude mcp list

# Add the Grafana server (env vars hold the credentials)
claude mcp add grafana -- npx -y @grafana/mcp-grafana

# Verify a single server responds
claude mcp test grafana

Two cautions. First, scope the token read-only for on-call — you do not want a 3am hallucination silencing alerts or acking incidents you have not triaged. Second, always confirm the agent quoted real numbers from a real query and is not summarizing from memory. For any destructive decision, fall back to pasting the data yourself.

Note that Claude Code runs Anthropic models only (Opus 4.7 or Sonnet 4.6, both 1M-token context as of June 2026), so a long incident chat will not overflow the window. Cursor can run those plus GPT-5.5 and Gemini 3.1 Pro against the same MCP servers if you prefer.

The page-to-fix loop

Triage. Paste the alert text and the first 50-200 relevant log lines into AI. Ask for the top 3 hypotheses ranked by likelihood, each with one specific check, and explicitly no fixes yet.

# Quick log grab on a typical Linux service
journalctl -u myservice --since "10 min ago" | tail -200
# Or for Kubernetes
kubectl logs deployment/myservice --tail=200 --since=10m

Check the recent change. Before going down the hypothesis tree, ask: “Was there a deploy or config change in the last hour?” If yes, the answer is usually revert. Stop, revert, observe. AI’s job here is to remind you this is the cheapest check — most pages are caused by something that just shipped.
Run the checks in likelihood order and report results back as you go: “Check 1 (Redis memory) passes — 30% used. Check 2 (slow query) confirmed — SELECT on orders running 15s.” Now AI’s context grows with real data, not guesses.
Calm-down protocol before any write action. Before anything destructive — kubectl delete, DROP, kill, restart — paste the exact command and ask: “I am about to run this in production. What could go wrong? What is the recovery if it makes things worse?” Five seconds, often saves a second outage.
Decide: investigate or revert. If the root cause is unclear after 15 minutes and the system is degrading, revert is usually correct. AI can frame the trade-off: “Symptom is X, cost of revert is Y, cost of continued investigation is Z. Should I revert?”
Apply the fix or revert, then verify. Run the action and watch the system. Wait for the recovery signal — alerts clearing and metrics returning to baseline. Do not declare resolved until the metrics agree; alerts can clear momentarily for the wrong reasons.
Wake up a teammate when: 30 minutes in with no root cause, multiple services affected, you are about to take a destructive action you have never done, or you are too tired to think clearly. AI can say “consider escalating,” but the call is yours.

A triage prompt that holds up at 3am

On-call page. I need triage.

Alert: [paste alert text]

Recent logs (last 10 min):
[paste 50-200 lines]

Service context: [1-2 sentences — what the service does, key
dependencies]

Recent activity (deploys / config changes in last hour, if known):
[paste or "none known"]

Produce:

1. Top 3 hypotheses for root cause, ranked by likelihood. For each,
   give ONE specific diagnostic check (a command, a metric to look at,
   a log search). Do not write fixes.

2. One sentence: "If you have not yet checked recent deploys, do that
   first."

3. One sentence: "If you have been investigating more than 15 min
   without convergence, consider reverting the most recent change."

Do not soften. Do not propose three "could also be" causes I should
ignore. The 3 hypotheses must be your actual best guesses.

Quality check before you call it resolved

You ran the diagnostic check before applying a fix. The most common 3am mistake is jumping to a fix for a hypothesis you never confirmed.
For every destructive command, you ran the calm-down prompt — no exceptions, even for commands you have run before.
You checked recent deploys and config changes in the first 5 minutes.
Recovery is verified by metrics returning to baseline, not by “the alert went away.”
You escalated within the time and scope thresholds you set in advance. Hero on-call is how a 30-minute outage becomes a 3-hour one.

Turn the night into a runbook (and beat AI next time)

After it resolves, ask AI to summarize the chat: “Take our conversation and produce a runbook section: symptom, diagnostic steps in order, fix, rollback. Plain language.” Tighten it and paste into the team runbook.
Save the triage prompt as a snippet you can paste in under 10 seconds.
Build a personal “things I check first” list — the top 3-5 recurring causes for your service. AI’s ranking is general; after a few rotations your service-specific list beats it. This also attacks the metric that matters more than MTTR: repeat-incident rate. A team that resolves fast but keeps re-firing the same page is not actually winning.
At the next team meeting, review which pages AI helped with and which it did not. It shines on unfamiliar alerts and adds little to repeat alerts that already have runbooks.

Common mistakes

Pasting only the alert text, no logs. AI cannot triage without data.
Skipping the recent-deploys check.
Letting AI propose a fix before the root cause is confirmed. “Increase the memory limit” is a fine suggestion for the wrong cause.
Running destructive commands without the calm-down prompt.
Refusing to revert because “we should understand the bug first.” During an active outage, restore service first and understand later.
Not waking a teammate when you should. The rotation exists so one person does not solo hour 3 of a hard incident.
Giving an MCP-connected agent a write-scoped token and trusting it to ack or resolve on its own.
Not capturing the chat as a runbook entry.

FAQ

What if AI’s hypotheses are all wrong? Give it negative evidence and re-prompt: “It is NOT memory (checked), NOT a recent deploy (checked). Re-rank.” AI converges far faster on negative info than on vague positive prompts.
Can AI read my logs and metrics directly? Yes, with the official Grafana, PagerDuty, Datadog, Splunk, or Honeycomb MCP servers wired into Claude Code or Cursor. Keep the token read-only for on-call and verify the agent quoted a real query, not a memory. For high-stakes calls, paste data manually.
Should I let AI take actions in production? No. AI suggests, you act. The pause between suggestion and action is what catches the bad ideas — and the 2026 incident-response guidance still puts a human on the trigger for mitigation.
What if I am too tired to think? Wake your teammate. AI is a thinking aid, not a substitute for a human who is awake.
How do I stop AI from rabbit-holing me? Set a 15-minute timer at the start. If the system is still degrading at 15 minutes with no confirmed cause, revert and investigate later.
What about the post-page debrief? Use the AI incident postmortem workflow. The on-call chat is one of the best inputs to a postmortem — keep it.

Tags: #AI coding #Workflow

TL;DR

When AI helps on a page — and when it does not

Before the page: 4 things to set up

Optional: wire AI to your observability stack (MCP)

The page-to-fix loop

A triage prompt that holds up at 3am

Quality check before you call it resolved

Turn the night into a runbook (and beat AI next time)

Common mistakes

FAQ

Related

Related Articles

AI Changelog Generation: From Commits to a Release Note Humans Read

AI-Assisted Database Migrations — Reversible, Backfilled, Tested

AI for Incident Postmortems Without Sanitizing the Lessons

AI Merge Conflict Resolution: When to Trust the Auto-Merge

AI PR Descriptions: From Diff to Reviewable

Aider Getting Started: Terminal AI Coding With Per-Edit Git Commits