“CI 红了”不是问题描述。好的 build 失败 Prompt 要点出失败的 job、引起的 diff、它跑的 env,然后在代码 / 依赖 / 缓存 / flake 之间二分,否则就是对着机器扔扳手。
适合哪些场景
主分支飘红的工程师、被卡合并的 release captain、被绿转红的 CI 偷走一晚的独立开发者。
什么时候不建议这样写 Prompt
没读日志别用。别用 AI 来”修”CI——把失败 job 静默了不叫修。
Prompt 结构公式
每个 build 失败 Prompt 都要带这六个要素:
- 角色:AI 扮演谁(SRE / Release Captain / staff 工程师 / QA Lead)。
- 上下文:技术栈 / 分支 / 失败日志 / diff / dashboard URL。
- 目标:一个具体可交付物——根因、checklist、计划、ticket 列表、runbook。
- 限制:AI 不能做什么(别自动修、别瞎造文件路径)。
- 输出格式:编号清单、markdown 表格、JSON、unified diff、可运行代码。
- 示例 / 信号:1-2 条”好输出”示例,或反例。
这套 Prompt 适合用在哪
- 把红 CI 缩到具体 diff
- 区分 flake 和真失败
- CI 环境和本地的差异
- 缓存中毒调查
- 回滚还是 fix forward
12 个可直接复制的 Prompt 模板
1. 读这段日志,找到根
Here is the failing CI log: {log}. Identify the FIRST real error (not symptoms downstream). Output: (1) the error line, (2) the most likely cause among: code bug / dep mismatch / cache poison / env diff / flake, (3) the next 1-2 commands to confirm. No speculation — only what the log supports.
可替换变量: log 完整失败日志
2. 本地 vs CI 环境差异
I cannot reproduce a CI failure locally. Compare these envs: local node {nodeLocal}, CI node {nodeCI}, OS, env vars, lockfile drift, cached vs fresh install. Output the 5 most likely diffs and one command each to verify.
可替换变量: nodeLocal, nodeCI
3. flake 还是真失败
A test failed once on CI but passes on retry. Decide flake vs real: (1) Look at the diff that landed before the run — does it touch the test's subject? (2) Check the failure frequency last 7 days, (3) Inspect the error for non-deterministic terms (timeout / Date.now / random). Output: probability flake (0-1), reasoning, and what to do next.
4. 缓存中毒诊断
CI succeeded last run, fails now, same diff. Suspect cache. Check: (1) Last cache key change, (2) Lockfile changes that altered hoist resolution, (3) Postinstall scripts that read env, (4) Restore time anomalies. Output: most likely cache layer + the one cache flush command to try.
5. 依赖漂移定位
Lockfile changed in this commit. Find the actual upgraded package(s) most likely to have broken CI: list each upgrade with old → new version, type (direct / transitive), and changelog highlights for the version jump. Don't propose downgrades — propose investigation order.
6. 执行顺序问题
Build passes locally and on CI in isolation, fails when run after another job. Trace likely sources: (1) shared cache between jobs, (2) env vars set by previous job, (3) DB state not reset, (4) file artifacts leaked. Output: 4 checks ordered cheapest first.
7. “Cannot find module” 专项
CI fails with `Cannot find module {modName}`. Identify the cause: (1) deps mis-listed (in devDeps but needed at build), (2) workspace package not built first, (3) case-sensitivity (works on Mac / fails on Linux), (4) path alias not resolved. Output: probable cause + fix.
可替换变量: modName
8. OOM 诊断
CI failed with OOM. Decide: (1) Is the build itself heavier (new deps, larger bundle)? (2) Is a test leaking memory? (3) Is concurrency too high? Output: cheapest experiment first — usually `NODE_OPTIONS=--max-old-space-size=4096` to confirm if it's memory or runaway.
9. timeout vs hang
A CI step hit timeout. Distinguish slow vs hung: (1) Show the last log line — if it's after a network call, hang on dep. (2) If it's mid-test, hang on async. (3) If logs progress steadily, slow. Pick one diagnosis with evidence.
10. 回滚 vs 修正
Main is broken. Decide revert vs fix-forward: (1) Time to fix forward < 15 min? Fix forward, (2) Else revert, restore green main, re-attempt in a branch. Confirm criteria, then output the exact commands (revert SHA + open PR).
11. PR 内部二分
A PR has 8 commits. Last commit fails CI. Bisect to find the offending commit without running 8 builds: (1) Group commits by file area, (2) Identify the most-likely commit by static reasoning (which one introduces the imports / config touched by the failure), (3) Run CI on that commit only.
12. CI 故障简要复盘
CI was red for 4 hours today. Generate a brief post-mortem: (1) Trigger commit, (2) Time to detect, (3) Time to revert, (4) Why the bad commit landed (test gap / missing CI check), (5) One follow-up. 200 words max. No blame.
容易踩的坑
- 只读最后一条错——下游噪音掩盖了根因。
- 加 retry 来”修” flake——掩盖时序 bug。
- 随手升 dep 版本”试试”——增加变量。
- 不定位就回滚——下一个人重新走同一个 bug。
- 清光所有缓存——能过一次,真问题再也找不到。
- 禁用失败测试——把 bug 推进生产。
- 长事故不写复盘——两个月再来一次。
优化技巧
- 永远从第一条错读起。
- 动手前先抓住 CI 环境(node、OS、lockfile sha)。
- flake 有签名(timeout / Date.now / random / network),真 bug 有你自己代码的 stack。
- 设时间盒:15 分钟没头绪就回滚,分支里修。
- CI 日志带上下文,别只
exit 1。 - 每次长事故新增一条 CI 检查。
- CI 永远
npm ci,不要npm install。
实操加深
使用这些 prompt 时,不要只替换一个主题词就直接交付。围绕「Build 失败排查 Prompt:12 个红 CI 调查模板」先补齐受众、渠道、长度、语气、参考样例、禁止样式和成功标准,再让模型输出 2 个不同版本做横向比较。好的结果应该能被另一个人直接复用,而不是只有顺滑但空泛的表达。
如果输出看起来像通用模板,下一轮要增加一个真实场景、一个反例和一个可检查指标,例如点击率、转化动作、字数、平台限制或品牌禁区。这样改出来的内容才更像可用资产,而不是一次性的灵感草稿。
FAQ
- 让 AI 自动回滚行吗?: 信号明确(main 合并后红)可以自动,但需要人签字。
- AI 能定位是哪个依赖出问题吗?: 能——读 changelog + lockfile diff。再用单个版本固定确认。
- “重试一次”算修法吗?: 只能当标记 flaky 的临时方案,真正的修复要去掉非确定性源头。
- 多久才回滚?: 15 分钟。再拖就在占别人时间。
- 红 CI 该卡所有 PR 吗?: 卡 main 合并,不卡 PR 分支,否则全员被挡。
- 缓存中毒怎么处理?: 定位中毒 key,仅清那个 key,复盘里写明。