You deploy three instances of a LangGraph code-review agent across different environments. Production is on version 1.2, staging is on 1.3, and a canary instance is on 1.4. Each version has slightly different system prompts — version 1.2 checks for security issues, 1.3 added performance checks but accidentally dropped the formatting guidelines, 1.4 restored formatting but changed the output schema field names. Users see wildly inconsistent review quality depending on which instance handles their PR. Worse, the inconsistency is hard to diagnose because no one tracked what changed in the prompts between versions — the prompts live in Python strings scattered across the codebase.
Common causes
1. Prompts stored as inline strings with no version tracking
The system prompt is defined as a triple-quoted string in agents/reviewer.py. When a developer edits it, the change is committed with the code change — but there is no separate history, no changelog, and no diff visible in the PR unless the reviewer specifically looks at the agent file.
How to spot it: Search for triple-quoted strings longer than 3 lines in agent files. Any multi-line string in a .py file that starts with “You are” or contains role/task instructions is a prompt living without version tracking.
2. Prompts assembled dynamically from multiple sources without a manifest
The final prompt is assembled from a base template, a task-specific snippet, and a format instruction block. Each piece comes from a different file or function. There is no manifest that records which combination of pieces was used for any given run. Two versions differ in one piece, and the difference is invisible without comparing all three sources.
How to spot it: Search for string concatenation or f-string building of prompts (system_prompt = base + task_prompt + format_block). If the final assembled prompt is not logged as a whole, the specific combination used for each run is unknown.
3. Prompt is fetched from a remote config store with no pinning
The agent fetches the system prompt from a config service (GET /prompts/reviewer/latest) on startup. “Latest” changes when someone updates the config. Two agent instances that started at different times have different prompts. There is no pinning, no version field in the fetch URL, and no record of which version each instance loaded.
How to spot it: Check every GET /prompts/ call in the agent code. If the URL contains latest or no version parameter, different instances will load different prompts depending on when they started.
4. Prompt interpolation changes field names between versions
Version 1.3 changed {output_format} to {response_schema} in the template but forgot to update the calling code in two places. Those two call sites still pass output_format to the template, which silently renders as an empty string because the new template doesn’t have that placeholder. The agent receives a prompt with a blank format section and produces free-form output.
How to spot it: Compare the placeholder variables in every prompt template against the keyword arguments passed to template.format(...) or Template.substitute(...). Any mismatch (extra kwargs ignored, missing kwargs render empty) is a drift indicator.
5. A/B test framework changed prompts without recording which version was active per run
The pipeline uses an A/B test framework to experiment with prompt variants. The framework changes which variant an agent receives based on a runtime flag. Run logs don’t record which variant was active. When debugging a bad output, you don’t know whether the agent was on variant A or B.
How to spot it: Check whether the A/B framework’s variant assignment is recorded in the run’s metadata. If run logs have no prompt_variant field, you cannot correlate output quality with the active variant.
6. Prompt evolution is not covered by regression tests
Developers change prompt wording to improve one scenario, which regresses another. There are no golden-output tests for the agent. The regression is discovered in production when a user reports different behavior, not before deployment.
How to spot it: Check the test suite for any test that runs the agent with specific inputs and asserts on the output quality or format. If there are zero such tests, prompt regressions are not caught before deployment.
Shortest path to fix
Step 1: Move prompts to versioned files and track them in git
agents/
prompts/
reviewer/
v1.2.0.txt
v1.3.0.txt
v1.4.0.txt
CHANGELOG.md # what changed in each version
Load prompts by version:
def load_prompt(agent: str, version: str) -> str:
path = Path(f"agents/prompts/{agent}/{version}.txt")
if not path.exists():
raise FileNotFoundError(f"Prompt not found: {path}")
return path.read_text()
REVIEWER_PROMPT = load_prompt("reviewer", "v1.4.0")
Git history on the prompt files is now explicit, searchable, and attributable.
Step 2: Log the prompt version and hash with every run
import hashlib
def get_prompt_with_metadata(agent: str, version: str) -> dict:
content = load_prompt(agent, version)
return {
"content": content,
"version": version,
"sha256": hashlib.sha256(content.encode()).hexdigest()[:16],
}
# In every agent invocation:
prompt_meta = get_prompt_with_metadata("reviewer", REVIEWER_VERSION)
logger.info(
"Agent invocation: agent=reviewer prompt_version=%s prompt_hash=%s run_id=%s",
prompt_meta["version"], prompt_meta["sha256"], run_id
)
When debugging an inconsistency, compare the prompt_hash values across instances.
Step 3: Pin prompt versions in deployment config — no “latest”
# deployment/config.yaml
agents:
reviewer:
prompt_version: "v1.4.0" # explicit pin — not "latest"
coder:
prompt_version: "v2.1.0"
config = load_deployment_config("deployment/config.yaml")
REVIEWER_PROMPT = load_prompt("reviewer", config["agents"]["reviewer"]["prompt_version"])
Updating the prompt requires a config change, which goes through code review and deployment like any other change.
Step 4: Validate all prompt placeholder variables at load time
import string
def validate_prompt_placeholders(template: str, kwargs: dict) -> str:
formatter = string.Formatter()
template_fields = {
field_name
for _, field_name, _, _ in formatter.parse(template)
if field_name is not None
}
provided = set(kwargs.keys())
missing = template_fields - provided
extra = provided - template_fields
if missing:
raise PromptConfigError(f"Prompt template missing kwargs: {missing}")
if extra:
logger.warning("Extra kwargs not in prompt template: %s", extra)
return template.format(**kwargs)
Run this at agent startup (not at each invocation) — fail fast on config errors, not silently at runtime.
Step 5: Write golden-output regression tests for critical agent behaviors
GOLDEN_TESTS = [
{
"input": "Review this code: def login(user, pw): return db.query(f'SELECT * FROM users WHERE pw={pw}')",
"must_contain": ["SQL injection", "parameterized"],
"must_not_contain": ["looks good", "no issues"],
"prompt_version": "v1.4.0",
},
]
def test_reviewer_golden_outputs():
for test in GOLDEN_TESTS:
output = run_reviewer_agent(test["input"], prompt_version=test["prompt_version"])
for phrase in test["must_contain"]:
assert phrase.lower() in output.lower(), (
f"Expected '{phrase}' in output but got: {output[:200]}"
)
for phrase in test["must_not_contain"]:
assert phrase.lower() not in output.lower(), (
f"Found banned phrase '{phrase}' in output: {output[:200]}"
)
Run these in CI before promoting any prompt version change to staging.
Prevention
- Store all prompt templates as versioned text files under source control — never as inline Python strings in agent code.
- Log the exact prompt version and SHA-256 hash with every agent run so you can reproduce any past run exactly.
- Pin prompt versions explicitly in deployment configuration; disallow “latest” or unversioned prompt fetches.
- Validate all template placeholder variables at agent startup — fail loudly on variable mismatch rather than silently rendering empty fields.
- Write golden-output regression tests covering the most important agent behaviors; run them in CI for every prompt change.
- Maintain a
CHANGELOG.mdfor each agent’s prompts documenting what each version added, removed, or changed. - When A/B testing prompts, record the variant in run metadata and build a dashboard correlating variant with output quality metrics.
- Require code review for prompt changes just as for code changes — prompt text is production logic, not documentation.
FAQ
Q: Should prompts be versioned separately from the agent code that uses them? A: Yes. Prompts change for different reasons (quality tuning, new requirements, bug fixes in instructions) than the code changes (new features, refactoring). Separate versioning allows you to roll back a prompt regression without rolling back code changes, and vice versa.
Q: How do I manage prompts for a large fleet of 50+ agents? A: Use a prompt registry service that stores prompts with version, agent ID, environment tag, and SHA-256. Agents fetch their pinned version at startup and cache it. The registry provides a diff endpoint between any two versions. Tools like PromptLayer or LangSmith Hub provide this out of the box.
Q: What’s the right cadence for prompt version bumps? A: Bump the version on every substantive change (wording, added instructions, removed sections, changed field names). Use semantic versioning: patch for typo fixes, minor for added capabilities, major for breaking output schema changes. Never edit a prompt in-place without bumping the version.
Q: How do I evaluate whether a new prompt version is better than the old one? A: Create an evaluation dataset of 20-50 inputs with human-rated expected outputs. Run both versions on the dataset and compare scores. Only promote the new version if it improves average score without regressing any individual test case below the previous version’s score.