You deploy a LangGraph infrastructure-provisioning agent that is supposed to verify AWS credentials, check that the target environment is not production, and confirm resource quotas before making any Terraform calls. On a particularly long task, the orchestrator’s pre-flight step times out and the pipeline’s fallback logic marks it as “skipped (non-fatal)” and continues. The Terraform agent proceeds, exhausts the region’s EC2 quota, and leaves a half-provisioned environment that takes hours to clean up. Pre-flight checks are only valuable if they are unconditionally enforced — any code path that allows them to be skipped will eventually be triggered.
Common causes
1. Pre-flight result is treated as optional in the orchestration logic
The orchestration code does preflight_result = run_preflight(task) but only checks the result with if preflight_result.passed: log_success(). It never blocks execution when preflight_result.passed is False or when preflight_result is None (timeout or error). Execution always continues.
How to spot it: Find the code that consumes the pre-flight result. If execution proceeds regardless of the result — including on None, error, or False — the check is advisory, not mandatory.
2. Pre-flight check times out and the timeout is caught as “ok to skip”
The pre-flight check calls an external validation service that is slow. After 30 seconds, the check times out. The exception handler catches TimeoutError and sets preflight_result = PreflightResult(passed=True, skipped=True) to “be safe and not block the user.” This is backwards — a timeout should block, not allow.
How to spot it: Find the except TimeoutError or equivalent handler for the pre-flight step. If it sets passed=True or falls through to execution, it converts a timeout into an automatic approval.
3. Pre-flight is defined in the prompt but not enforced in code
The system prompt says “Before doing anything, always check X, Y, and Z.” The agent sometimes skips step Z because it decides Z is “not relevant” to this particular task. Prompt-based pre-flight is unreliable — the model can always rationalize skipping a step.
How to spot it: Compare the pre-flight steps listed in the prompt against the tool calls actually made at the start of each run. If Z appears in the prompt but is absent from tool call traces on some runs, it is prompt-only and not code-enforced.
4. Pre-flight is hardcoded to skip in certain modes
A “fast mode” flag was added for development velocity. if fast_mode: skip_preflight(). The fast mode flag was set via an environment variable and someone deployed to staging with FAST_MODE=true without realizing pre-flight was disabled. The flag was never cleaned up.
How to spot it: Search for skip_preflight, FAST_MODE, bypass_checks, or similar in the codebase. If any flag can disable pre-flight checks, check whether it can be set in non-development environments.
5. New task types were added but pre-flight was not updated for them
The original pre-flight checked “is AWS credential valid?” and “is environment non-production?” A new task type — database migration — was added to the pipeline. Database tasks need additional pre-flight checks (backup exists, migration is idempotent, rollback plan is ready). Nobody updated pre-flight for the new task type.
How to spot it: List every task type the pipeline handles and the pre-flight checks required for each. Any task type without a complete pre-flight specification is under-checked.
6. Pre-flight runs but the agent doesn’t wait for the result
In an async pipeline, asyncio.create_task(run_preflight(task)) fires the check in the background. The main flow immediately continues to execution without awaiting. The pre-flight check runs and may fail — but by the time the failure is recorded, the execution is already underway.
How to spot it: Check whether pre-flight is awaited (await run_preflight(task)) or fired as a background task. Background pre-flight is effectively no pre-flight.
Shortest path to fix
Step 1: Make pre-flight a blocking gate, not an advisory step
class PreflightError(Exception):
"""Raised when a required pre-flight check fails. Execution must stop."""
def require_preflight(task: dict) -> None:
"""
Must be called before any execution. Raises PreflightError on any failure.
Never returns None. Never swallows exceptions.
"""
checks = get_required_checks(task["type"])
results = []
for check in checks:
try:
result = check.run(task, timeout=30)
except TimeoutError:
raise PreflightError(
f"Pre-flight check '{check.name}' timed out — execution blocked. "
"Fix the check or resolve the underlying connectivity issue."
)
except Exception as e:
raise PreflightError(
f"Pre-flight check '{check.name}' raised an error: {e}"
) from e
if not result.passed:
raise PreflightError(
f"Pre-flight check '{check.name}' failed: {result.reason}"
)
results.append(result)
logger.info("All %d pre-flight checks passed for task %s", len(results), task["id"])
Step 2: Define required checks per task type in a registry
PREFLIGHT_REGISTRY: dict[str, list[PreflightCheck]] = {
"terraform_provision": [
AWSCredentialsCheck(),
NonProductionEnvironmentCheck(),
ResourceQuotaCheck(min_remaining={"ec2": 10, "vpc": 2}),
TerraformValidateCheck(),
],
"database_migration": [
DatabaseConnectionCheck(),
BackupExistsCheck(max_age_hours=24),
MigrationIdempotencyCheck(),
RollbackPlanCheck(),
],
"code_deploy": [
GitBranchCheck(allowed_branches=["main", "release/*"]),
TestSuitePassCheck(),
SecretScanCheck(),
],
}
def get_required_checks(task_type: str) -> list[PreflightCheck]:
if task_type not in PREFLIGHT_REGISTRY:
raise ValueError(
f"No pre-flight checks defined for task type '{task_type}'. "
"Add an entry to PREFLIGHT_REGISTRY before adding new task types."
)
return PREFLIGHT_REGISTRY[task_type]
A KeyError on a new task type forces explicit pre-flight registration — you cannot forget to add checks.
Step 3: Remove all “skip” flags or restrict them to local-only mode
import os
def require_preflight_with_env_guard(task: dict) -> None:
if os.environ.get("SKIP_PREFLIGHT") == "true":
env = os.environ.get("ENVIRONMENT", "unknown")
if env not in ("local", "test"):
raise ConfigurationError(
"SKIP_PREFLIGHT=true is not allowed in non-local environments. "
f"Current environment: {env}"
)
logger.warning("SKIP_PREFLIGHT=true — only allowed in local/test environments")
return
require_preflight(task)
Hard-block the skip flag in staging and production.
Step 4: Enforce pre-flight in the graph definition — not just in code
# LangGraph — first node is ALWAYS preflight; no way to bypass
graph.set_entry_point("preflight")
graph.add_edge("preflight", "execute")
# No conditional edge from start to execute; must go through preflight
If using Temporal, make pre-flight its own activity with no retry:
@workflow.defn
class ProvisionWorkflow:
@workflow.run
async def run(self, task: dict):
# Pre-flight is mandatory first activity, no retry on failure
await workflow.execute_activity(
run_preflight_activity,
task,
retry_policy=RetryPolicy(maximum_attempts=1), # fail fast — don't retry pre-flight errors
start_to_close_timeout=timedelta(seconds=60),
)
await workflow.execute_activity(execute_activity, task)
Step 5: Add a CI test for every pre-flight check
def test_preflight_blocks_on_missing_backup():
task = {"type": "database_migration", "db": "prod_replica"}
# Simulate no backup in the last 24h
with patch("checks.backup_check.get_latest_backup", return_value=None):
with pytest.raises(PreflightError, match="BackupExistsCheck failed"):
require_preflight(task)
def test_preflight_blocks_on_timeout():
task = {"type": "terraform_provision", "env": "staging"}
with patch("checks.aws_check.validate_credentials", side_effect=TimeoutError):
with pytest.raises(PreflightError, match="timed out"):
require_preflight(task)
Prevention
- Implement pre-flight as a mandatory blocking function that raises an exception on any failure — never returns a result that can be ignored.
- Register required pre-flight checks per task type; a missing registration for a new task type should fail loudly, not silently pass.
- Never allow
SKIP_PREFLIGHTor equivalent in non-local environments; guard with an explicit environment check. - Make pre-flight the unconditional first node in your workflow graph; remove any path that can reach execution without passing through pre-flight.
- Treat pre-flight timeouts as blocking failures, not pass-throughs — a check you cannot run is a check that has not passed.
- Write CI tests that verify each pre-flight check blocks execution on every failure mode it is designed to catch.
- Review and update pre-flight checks whenever a new task type or new capability is added to the pipeline.
- Log pre-flight results with timestamps and outcomes for every run — auditors need to verify that checks were run, not just that they passed.
FAQ
Q: How do I avoid making pre-flight so strict that it blocks legitimate work? A: Design each check to fail only on conditions that genuinely block safe execution, not on conditions that are merely suboptimal. “No backup in 24h” blocks a migration; “backup is 6h old (within 24h limit)” does not. Calibrate thresholds based on real incidents, not theoretical maximums.
Q: Should pre-flight checks ever be retried? A: Checks for external resources (API reachability, credential validity) can retry once with a short delay. Checks for logical preconditions (backup exists, environment is non-production) should not retry — if the condition is not met, retrying won’t change it. Fail fast and require a human to fix the precondition.
Q: How do I share pre-flight check results with the main agent so it doesn’t re-derive them? A: Include the pre-flight results in the handoff context passed to the execution agent. The execution agent reads “backup verified at 2026-05-25T14:30:00Z, quota remaining: 15 EC2 instances” from the context and can reference it in its reasoning without re-checking.
Q: What if a pre-flight check requires LLM reasoning (e.g., “is this migration safe?”)?
A: LLM-based pre-flight checks are acceptable for qualitative assessments, but pair them with a deterministic safety net. “Is this migration safe?” → LLM review + --dry-run execution + schema diff comparison. Never rely solely on LLM judgment for a gate that blocks production work.