Agent Skipped a Pre-flight Check It Was Supposed to Run

Your agent proceeded without running required pre-flight checks, causing avoidable failures downstream. Here's how to enforce mandatory checks before execution.

You deploy a LangGraph infrastructure-provisioning agent that is supposed to verify AWS credentials, check that the target environment is not production, and confirm resource quotas before making any Terraform calls. On a particularly long task, the orchestrator’s pre-flight step times out and the pipeline’s fallback logic marks it as “skipped (non-fatal)” and continues. The Terraform agent proceeds, exhausts the region’s EC2 quota, and leaves a half-provisioned environment that takes hours to clean up. Pre-flight checks are only valuable if they are unconditionally enforced — any code path that allows them to be skipped will eventually be triggered.

Common causes

1. Pre-flight result is treated as optional in the orchestration logic

The orchestration code does preflight_result = run_preflight(task) but only checks the result with if preflight_result.passed: log_success(). It never blocks execution when preflight_result.passed is False or when preflight_result is None (timeout or error). Execution always continues.

How to spot it: Find the code that consumes the pre-flight result. If execution proceeds regardless of the result — including on None, error, or False — the check is advisory, not mandatory.

2. Pre-flight check times out and the timeout is caught as “ok to skip”

The pre-flight check calls an external validation service that is slow. After 30 seconds, the check times out. The exception handler catches TimeoutError and sets preflight_result = PreflightResult(passed=True, skipped=True) to “be safe and not block the user.” This is backwards — a timeout should block, not allow.

How to spot it: Find the except TimeoutError or equivalent handler for the pre-flight step. If it sets passed=True or falls through to execution, it converts a timeout into an automatic approval.

3. Pre-flight is defined in the prompt but not enforced in code

The system prompt says “Before doing anything, always check X, Y, and Z.” The agent sometimes skips step Z because it decides Z is “not relevant” to this particular task. Prompt-based pre-flight is unreliable — the model can always rationalize skipping a step.

How to spot it: Compare the pre-flight steps listed in the prompt against the tool calls actually made at the start of each run. If Z appears in the prompt but is absent from tool call traces on some runs, it is prompt-only and not code-enforced.

4. Pre-flight is hardcoded to skip in certain modes

A “fast mode” flag was added for development velocity. if fast_mode: skip_preflight(). The fast mode flag was set via an environment variable and someone deployed to staging with FAST_MODE=true without realizing pre-flight was disabled. The flag was never cleaned up.

How to spot it: Search for skip_preflight, FAST_MODE, bypass_checks, or similar in the codebase. If any flag can disable pre-flight checks, check whether it can be set in non-development environments.

5. New task types were added but pre-flight was not updated for them

The original pre-flight checked “is AWS credential valid?” and “is environment non-production?” A new task type — database migration — was added to the pipeline. Database tasks need additional pre-flight checks (backup exists, migration is idempotent, rollback plan is ready). Nobody updated pre-flight for the new task type.

How to spot it: List every task type the pipeline handles and the pre-flight checks required for each. Any task type without a complete pre-flight specification is under-checked.

6. Pre-flight runs but the agent doesn’t wait for the result

In an async pipeline, asyncio.create_task(run_preflight(task)) fires the check in the background. The main flow immediately continues to execution without awaiting. The pre-flight check runs and may fail — but by the time the failure is recorded, the execution is already underway.

How to spot it: Check whether pre-flight is awaited (await run_preflight(task)) or fired as a background task. Background pre-flight is effectively no pre-flight.

Shortest path to fix

Step 1: Make pre-flight a blocking gate, not an advisory step

class PreflightError(Exception):
    """Raised when a required pre-flight check fails. Execution must stop."""

def require_preflight(task: dict) -> None:
    """
    Must be called before any execution. Raises PreflightError on any failure.
    Never returns None. Never swallows exceptions.
    """
    checks = get_required_checks(task["type"])
    results = []
    for check in checks:
        try:
            result = check.run(task, timeout=30)
        except TimeoutError:
            raise PreflightError(
                f"Pre-flight check '{check.name}' timed out — execution blocked. "
                "Fix the check or resolve the underlying connectivity issue."
            )
        except Exception as e:
            raise PreflightError(
                f"Pre-flight check '{check.name}' raised an error: {e}"
            ) from e
        if not result.passed:
            raise PreflightError(
                f"Pre-flight check '{check.name}' failed: {result.reason}"
            )
        results.append(result)
    logger.info("All %d pre-flight checks passed for task %s", len(results), task["id"])

Step 2: Define required checks per task type in a registry

PREFLIGHT_REGISTRY: dict[str, list[PreflightCheck]] = {
    "terraform_provision": [
        AWSCredentialsCheck(),
        NonProductionEnvironmentCheck(),
        ResourceQuotaCheck(min_remaining={"ec2": 10, "vpc": 2}),
        TerraformValidateCheck(),
    ],
    "database_migration": [
        DatabaseConnectionCheck(),
        BackupExistsCheck(max_age_hours=24),
        MigrationIdempotencyCheck(),
        RollbackPlanCheck(),
    ],
    "code_deploy": [
        GitBranchCheck(allowed_branches=["main", "release/*"]),
        TestSuitePassCheck(),
        SecretScanCheck(),
    ],
}

def get_required_checks(task_type: str) -> list[PreflightCheck]:
    if task_type not in PREFLIGHT_REGISTRY:
        raise ValueError(
            f"No pre-flight checks defined for task type '{task_type}'. "
            "Add an entry to PREFLIGHT_REGISTRY before adding new task types."
        )
    return PREFLIGHT_REGISTRY[task_type]

A KeyError on a new task type forces explicit pre-flight registration — you cannot forget to add checks.

Step 3: Remove all “skip” flags or restrict them to local-only mode

import os

def require_preflight_with_env_guard(task: dict) -> None:
    if os.environ.get("SKIP_PREFLIGHT") == "true":
        env = os.environ.get("ENVIRONMENT", "unknown")
        if env not in ("local", "test"):
            raise ConfigurationError(
                "SKIP_PREFLIGHT=true is not allowed in non-local environments. "
                f"Current environment: {env}"
            )
        logger.warning("SKIP_PREFLIGHT=true — only allowed in local/test environments")
        return
    require_preflight(task)

Hard-block the skip flag in staging and production.

Step 4: Enforce pre-flight in the graph definition — not just in code

# LangGraph — first node is ALWAYS preflight; no way to bypass
graph.set_entry_point("preflight")
graph.add_edge("preflight", "execute")
# No conditional edge from start to execute; must go through preflight

If using Temporal, make pre-flight its own activity with no retry:

@workflow.defn
class ProvisionWorkflow:
    @workflow.run
    async def run(self, task: dict):
        # Pre-flight is mandatory first activity, no retry on failure
        await workflow.execute_activity(
            run_preflight_activity,
            task,
            retry_policy=RetryPolicy(maximum_attempts=1),  # fail fast — don't retry pre-flight errors
            start_to_close_timeout=timedelta(seconds=60),
        )
        await workflow.execute_activity(execute_activity, task)

Step 5: Add a CI test for every pre-flight check

def test_preflight_blocks_on_missing_backup():
    task = {"type": "database_migration", "db": "prod_replica"}
    # Simulate no backup in the last 24h
    with patch("checks.backup_check.get_latest_backup", return_value=None):
        with pytest.raises(PreflightError, match="BackupExistsCheck failed"):
            require_preflight(task)

def test_preflight_blocks_on_timeout():
    task = {"type": "terraform_provision", "env": "staging"}
    with patch("checks.aws_check.validate_credentials", side_effect=TimeoutError):
        with pytest.raises(PreflightError, match="timed out"):
            require_preflight(task)

Prevention

  • Implement pre-flight as a mandatory blocking function that raises an exception on any failure — never returns a result that can be ignored.
  • Register required pre-flight checks per task type; a missing registration for a new task type should fail loudly, not silently pass.
  • Never allow SKIP_PREFLIGHT or equivalent in non-local environments; guard with an explicit environment check.
  • Make pre-flight the unconditional first node in your workflow graph; remove any path that can reach execution without passing through pre-flight.
  • Treat pre-flight timeouts as blocking failures, not pass-throughs — a check you cannot run is a check that has not passed.
  • Write CI tests that verify each pre-flight check blocks execution on every failure mode it is designed to catch.
  • Review and update pre-flight checks whenever a new task type or new capability is added to the pipeline.
  • Log pre-flight results with timestamps and outcomes for every run — auditors need to verify that checks were run, not just that they passed.

FAQ

Q: How do I avoid making pre-flight so strict that it blocks legitimate work? A: Design each check to fail only on conditions that genuinely block safe execution, not on conditions that are merely suboptimal. “No backup in 24h” blocks a migration; “backup is 6h old (within 24h limit)” does not. Calibrate thresholds based on real incidents, not theoretical maximums.

Q: Should pre-flight checks ever be retried? A: Checks for external resources (API reachability, credential validity) can retry once with a short delay. Checks for logical preconditions (backup exists, environment is non-production) should not retry — if the condition is not met, retrying won’t change it. Fail fast and require a human to fix the precondition.

Q: How do I share pre-flight check results with the main agent so it doesn’t re-derive them? A: Include the pre-flight results in the handoff context passed to the execution agent. The execution agent reads “backup verified at 2026-05-25T14:30:00Z, quota remaining: 15 EC2 instances” from the context and can reference it in its reasoning without re-checking.

Q: What if a pre-flight check requires LLM reasoning (e.g., “is this migration safe?”)? A: LLM-based pre-flight checks are acceptable for qualitative assessments, but pair them with a deterministic safety net. “Is this migration safe?” → LLM review + --dry-run execution + schema diff comparison. Never rely solely on LLM judgment for a gate that blocks production work.

Tags: #AI coding #Agents #Troubleshooting