让 AI 自动回滚行吗？

不行。信号明确时（合并后 main 立刻飘红）可以自动判定，但真正执行 `git revert` 要有人签字。

AI 能定位是哪个依赖出问题吗？

经常能——读 changelog 加上 lockfile diff。再把那一个包固定回上一个版本来确认。

"重试一次"算修法吗？

只能当标记 flaky 的临时方案。既然大约三分之二的重试失败是真 flake，剩下三分之一就是盲目重试会掩盖的真 bug——去掉非确定性的源头才算修。

15 分钟。再拖就是在占用整个团队的时间。

红 CI 该卡所有 PR 吗？

卡 main 合并，不卡 PR 分支上的工作，否则全员被挡。

缓存中毒怎么处理？

定位中毒的 key，只清那一个 key（不是整个缓存），并在复盘里写明。

AI 提示词库

Build 失败排查 Prompt：12 个红 CI 调查模板

别再瞎猜红色 CI。12 个 Prompt 模板，从环境、缓存、依赖、flake、执行顺序来缩小 build / 测试失败范围。

发布于: 2026/05/19 更新于: 2026/06/04 作者: AI Productivity Guide Team 🌐 查看英文版本

“CI 红了”不是问题描述。一个有用的 build 失败 Prompt 要点出失败的 job、引起问题的 diff、它运行的环境，然后在代码 / 依赖 / 缓存 / flake 之间做二分。其余的都是对着机器扔扳手。

这件事的代价比看起来大。2026 年一项针对 GitHub Actions 的实证研究发现：3.2% 的构建会被重跑，而其中 67.7% 在没有任何代码改动的情况下重跑就变绿了——也就是说，大多数”点一下 re-run 就好了”的失败其实是 flake，不是真信号（arXiv 2602.02307）。下面这套 Prompt 的意义，就是在你浪费一晚、或因为误以为是 flake 而放过真 bug 之前，快速把这两类区分开。

TL;DR

永远先读日志里的第一条错，不是最后一条。下游噪音会掩盖根因。
把完整日志、diff、CI 环境（Node / OS / lockfile sha）一起喂给模型。缺这三样它就只能猜。
大约三分之二”重试就过”的失败是真 flake，剩下三分之一是重试会掩盖的真实时序 bug。用第 3 个 Prompt 来判定。
设时间盒：调查 15 分钟没头绪就回滚，在分支里修。
这些可以直接贴进 Claude Code、Cursor、ChatGPT 或 Gemini——任何带 1M token 上下文的模型（Opus 4.7、Sonnet 4.6、Gemini 3.1 Pro、GPT-5.5）都能一次装下完整 CI 日志加 diff。

适合哪些人

盯着飘红 main 分支的工程师、被卡合并的 release captain，以及被绿转红流水线偷走一晚的独立开发者。

该贴进哪个模型

CI 日志很长，所以这里上下文窗口比纯推理能力更重要。截至 2026 年 6 月：

工具 / 模型	上下文	为什么适合 CI 日志
Claude Code（Opus 4.7 / Sonnet 4.6）	1M token	能同时读仓库 + workflow YAML + 日志，还能替你跑确认命令
Cursor（Sonnet 4.6、GPT-5.5、Gemini 3.1 Pro）	最高 1M	在编辑器里，可把失败测试文件和日志并排打开
Gemini 3.1 Pro	1M token	大批量贴日志最便宜，每 1M token 输入 / 输出为 $2 / $12
ChatGPT Plus（GPT-5.5）	应用内约 320 页	单个 job 日志够用；完整 1M 需 $200 的 Pro 档

日志很大、跨多个 job 时优先用 1M 上下文的工具；只是单个失败步骤，几个都行。

Prompt 结构

每个 build 失败 Prompt 都该带这六个要素：

角色：AI 扮演谁（SRE / Release Captain / staff 工程师 / QA Lead）。
上下文：技术栈 / 分支 / 失败日志 / diff / dashboard URL。
目标：一个具体可交付物——根因、checklist、计划、ticket 列表、runbook。
限制：AI 不能做什么（别自动修、别瞎造文件路径）。
输出格式：编号清单、markdown 表格、JSON、unified diff、可运行代码。
信号：1-2 条”好输出”示例，或一个反例。

什么时候不该用

没读日志的失败别急着用——先把日志贴上来。也别用它们去”修”CI——把失败的 job 静默掉不叫修，那是埋一个未来的事故。

12 个可直接复制的 Prompt 模板

发送前把 [方括号] 占位符换成你的真实值。

1. 读这段日志，找到根

Here is the failing CI log: [log]. Identify the FIRST real error (not symptoms downstream). Output: (1) the error line, (2) the most likely cause among code bug / dep mismatch / cache poison / env diff / flake, (3) the next 1-2 commands to confirm. No speculation — only what the log supports.

可替换： [log] = 完整失败 job 日志

2. 本地 vs CI 环境差异

I cannot reproduce a CI failure locally. Compare these envs: local node [nodeLocal], CI node [nodeCI], OS, env vars, lockfile drift, cached vs fresh install. Output the 5 most likely diffs and one command each to verify.

可替换： [nodeLocal]、[nodeCI]

3. flake 还是真失败

A test failed once on CI but passes on retry. Decide flake vs real: (1) Look at the diff that landed before the run — does it touch the test's subject? (2) Check the failure frequency over the last 7 days. (3) Inspect the error for non-deterministic terms (timeout / Date.now / random / port / order). Output: probability of flake (0-1), reasoning, and the next action.

4. 缓存中毒诊断

CI succeeded last run and fails now on the same diff. Suspect cache. Check: (1) Last cache key change, (2) Lockfile changes that altered hoist resolution, (3) Postinstall scripts that read env, (4) Restore-time anomalies. Output: the most likely cache layer plus the single cache key to flush.

5. 依赖漂移定位

The lockfile changed in this commit. Find the upgraded package(s) most likely to have broken CI: list each upgrade with old -> new version, type (direct / transitive), and changelog highlights for the version jump. Don't propose downgrades — propose investigation order.

6. 执行顺序问题

Build passes locally and on CI in isolation but fails when run after another job. Trace likely sources: (1) shared cache between jobs, (2) env vars set by a previous job, (3) DB state not reset, (4) file artifacts leaked. Output: 4 checks ordered cheapest first.

7. “Cannot find module” 专项

CI fails with `Cannot find module [modName]`. Identify the cause: (1) dep mis-listed (in devDependencies but needed at build), (2) workspace package not built first, (3) case-sensitivity (works on macOS, fails on Linux), (4) path alias not resolved. Output: probable cause plus fix.

可替换： [modName]

8. OOM 诊断

CI failed with OOM (heap out of memory). Decide: (1) Is the build itself heavier (new deps, larger bundle)? (2) Is a test leaking memory? (3) Is concurrency too high for the runner? Output: the cheapest experiment first — usually setting NODE_OPTIONS=--max-old-space-size=4096 to confirm whether it's a memory ceiling or runaway growth.

9. timeout vs hang

A CI step hit its timeout. Distinguish slow from hung: (1) If the last log line is right after a network call, it's hung on a dependency. (2) If it's mid-test, it's hung on an unresolved async. (3) If logs progressed steadily until cutoff, it's just slow. Pick one diagnosis and cite the evidence line.

10. 回滚 vs 修正

Main is broken. Decide revert vs fix-forward: (1) If a fix-forward takes under 15 min and the cause is known, fix forward. (2) Otherwise revert, restore green main, and re-attempt in a branch. Confirm which criterion applies, then output the exact commands (git revert SHA, then open the PR).

11. PR 内部二分

A PR has 8 commits; the last fails CI. Bisect to the offending commit without running 8 builds: (1) Group commits by file area, (2) Identify the most-likely commit by static reasoning (which one introduces the imports or config the failure touches), (3) Run CI on that commit only to confirm.

12. CI 故障复盘

CI was red for 4 hours today. Write a brief blameless post-mortem: (1) Trigger commit, (2) Time to detect, (3) Time to revert, (4) Why the bad commit landed (test gap / missing CI check / skipped review), (5) One follow-up action with an owner. 200 words max.

容易踩的坑

只读最后一条错，不读第一条——下游噪音掩盖了根因。
加 retry 来”修” flake——这会掩盖真实的时序和顺序 bug。
随手升 dep 版本”试试”——每次升级都多一个变量。
不定位就回滚——下一个人重新踩同一个 bug。
清光所有缓存——能过一次，真问题再也找不到。
禁用失败测试——只是把 bug 推进了生产。
长事故不写复盘——两个月后同样的事故再来一次。

优化技巧

动手前先抓住 CI 环境（Node 版本、OS、lockfile sha）。
flake 有签名：timeout、Date.now、random、端口冲突、测试顺序、network。真 bug 会带着 stack trace 指向你自己的代码。
缓存全局存储，别缓存 node_modules。actions/cache v4 的建议是缓存 ~/.npm，让 npm ci 从 lockfile 作为 key 的缓存里重建——这样跨 Node 版本仍可复现。
卡住时打开深度日志：把仓库 secret ACTIONS_STEP_DEBUG 和 ACTIONS_RUNNER_DEBUG 设为 true，只跑一次失败的 run。
在到达 CI 前就抓出 workflow 错误：本地用 actionlint 校验、用 nektos/act 试跑 job。
CI 里永远用 npm ci，不要 npm install，让 lockfile 说了算。
每次长事故后，新增一条本可拦住它的 CI 检查。

FAQ

让 AI 自动回滚行吗？ 不行。信号明确时（合并后 main 立刻飘红）可以自动判定，但真正执行 git revert 要有人签字。
AI 能定位是哪个依赖出问题吗？ 经常能——读 changelog 加上 lockfile diff。再把那一个包固定回上一个版本来确认。
“重试一次”算修法吗？ 只能当标记 flaky 的临时方案。既然大约三分之二的重试失败是真 flake，剩下三分之一就是盲目重试会掩盖的真 bug——去掉非确定性的源头才算修。
多久才回滚？ 15 分钟。再拖就是在占用整个团队的时间。
红 CI 该卡所有 PR 吗？ 卡 main 合并，不卡 PR 分支上的工作，否则全员被挡。
缓存中毒怎么处理？ 定位中毒的 key，只清那一个 key（不是整个缓存），并在复盘里写明。