Agent Skipped a Pre-flight Check It Was Supposed to Run

Your agent ran without its required pre-flight checks and failed hours later on a problem it could have caught in seconds. Here is how to make those checks a blocking gate that no code path can skip.

Published: May 25, 2026 Updated: Jun 17, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You deploy a LangGraph infrastructure-provisioning agent that is supposed to verify AWS credentials, check that the target environment is not production, and confirm resource quotas before making any Terraform calls. On a particularly long task, the orchestrator’s pre-flight step times out and the pipeline’s fallback logic marks it as “skipped (non-fatal)” and continues. The Terraform agent proceeds, exhausts the region’s EC2 quota, and leaves a half-provisioned environment that takes hours to clean up. Pre-flight checks are only valuable if they are unconditionally enforced. Any code path that allows them to be skipped will eventually be triggered.

Fastest fix: make pre-flight a function that raises on any failure (including timeout and “no checks defined”) and call it unconditionally on the very first line of your task entry point and your retry path. If pre-flight is structurally able to return a value the caller can ignore, it is advisory, not mandatory. Everything below hardens that one idea: per-task registries, env-flag guards, graph-level enforcement, and CI tests.

Which bucket are you in?

Trace one real run that skipped its check and match the symptom to the cause. Most “skipped pre-flight” incidents are one of these six.

Symptom in the trace	Likely cause	Jump to fix
Pre-flight ran, logged a failure, execution still continued	Result is consumed as advisory (`if passed: log()`)	Step 1
Pre-flight logged a timeout, then run proceeded	`except TimeoutError` sets `passed=True`	Step 1
Check appears in the system prompt but not in tool-call traces on some runs	Prompt-only, not code-enforced	Step 4
Skipped only in staging / on certain machines	A `FAST_MODE` / `SKIP_PREFLIGHT` flag is set there	Step 3
New task type runs with no checks at all	Registry never updated for the new type	Step 2
Pre-flight “ran” but failure recorded after execution started	Fired with `create_task` and never awaited	Step 1

Common causes

1. Pre-flight result is treated as optional in the orchestration logic

The orchestration code does preflight_result = run_preflight(task) but only checks the result with if preflight_result.passed: log_success(). It never blocks execution when preflight_result.passed is False or when preflight_result is None (timeout or error). Execution always continues.

How to spot it: Find the code that consumes the pre-flight result. If execution proceeds regardless of the result (including on None, error, or False), the check is advisory, not mandatory.

2. Pre-flight check times out and the timeout is caught as “ok to skip”

The pre-flight check calls an external validation service that is slow. After 30 seconds, the check times out. The exception handler catches TimeoutError and sets preflight_result = PreflightResult(passed=True, skipped=True) to “be safe and not block the user.” This is backwards: a timeout should block, not allow. A check you could not run is a check that has not passed.

How to spot it: Find the except TimeoutError or equivalent handler for the pre-flight step. If it sets passed=True or falls through to execution, it converts a timeout into an automatic approval.

3. Pre-flight is defined in the prompt but not enforced in code

The system prompt says “Before doing anything, always check X, Y, and Z.” The agent sometimes skips step Z because it decides Z is “not relevant” to this particular task. Prompt-based pre-flight is unreliable: the model can always rationalize skipping a step.

How to spot it: Compare the pre-flight steps listed in the prompt against the tool calls actually made at the start of each run. If Z appears in the prompt but is absent from tool call traces on some runs, it is prompt-only and not code-enforced.

4. Pre-flight is hardcoded to skip in certain modes

A “fast mode” flag was added for development velocity. if fast_mode: skip_preflight(). The fast mode flag was set via an environment variable and someone deployed to staging with FAST_MODE=true without realizing pre-flight was disabled. The flag was never cleaned up.

How to spot it: Search for skip_preflight, FAST_MODE, bypass_checks, or similar in the codebase. If any flag can disable pre-flight checks, check whether it can be set in non-development environments.

5. New task types were added but pre-flight was not updated for them

The original pre-flight checked “is AWS credential valid?” and “is environment non-production?” A new task type (database migration) was added to the pipeline. Database tasks need additional pre-flight checks (backup exists, migration is idempotent, rollback plan is ready). Nobody updated pre-flight for the new task type.

How to spot it: List every task type the pipeline handles and the pre-flight checks required for each. Any task type without a complete pre-flight specification is under-checked.

6. Pre-flight runs but the agent doesn’t wait for the result

In an async pipeline, asyncio.create_task(run_preflight(task)) fires the check in the background. The main flow immediately continues to execution without awaiting. The pre-flight check runs and may fail, but by the time the failure is recorded, the execution is already underway.

How to spot it: Check whether pre-flight is awaited (await run_preflight(task)) or fired as a background task. Background pre-flight is effectively no pre-flight.

Shortest path to fix

Step 1: Make pre-flight a blocking gate, not an advisory step

class PreflightError(Exception):
    """Raised when a required pre-flight check fails. Execution must stop."""

def require_preflight(task: dict) -> None:
    """
    Must be called before any execution. Raises PreflightError on any failure.
    Never returns None. Never swallows exceptions.
    """
    checks = get_required_checks(task["type"])
    results = []
    for check in checks:
        try:
            result = check.run(task, timeout=30)
        except TimeoutError:
            raise PreflightError(
                f"Pre-flight check '{check.name}' timed out — execution blocked. "
                "Fix the check or resolve the underlying connectivity issue."
            )
        except Exception as e:
            raise PreflightError(
                f"Pre-flight check '{check.name}' raised an error: {e}"
            ) from e
        if not result.passed:
            raise PreflightError(
                f"Pre-flight check '{check.name}' failed: {result.reason}"
            )
        results.append(result)
    logger.info("All %d pre-flight checks passed for task %s", len(results), task["id"])

Step 2: Define required checks per task type in a registry

PREFLIGHT_REGISTRY: dict[str, list[PreflightCheck]] = {
    "terraform_provision": [
        AWSCredentialsCheck(),
        NonProductionEnvironmentCheck(),
        ResourceQuotaCheck(min_remaining={"ec2": 10, "vpc": 2}),
        TerraformValidateCheck(),
    ],
    "database_migration": [
        DatabaseConnectionCheck(),
        BackupExistsCheck(max_age_hours=24),
        MigrationIdempotencyCheck(),
        RollbackPlanCheck(),
    ],
    "code_deploy": [
        GitBranchCheck(allowed_branches=["main", "release/*"]),
        TestSuitePassCheck(),
        SecretScanCheck(),
    ],
}

def get_required_checks(task_type: str) -> list[PreflightCheck]:
    if task_type not in PREFLIGHT_REGISTRY:
        raise ValueError(
            f"No pre-flight checks defined for task type '{task_type}'. "
            "Add an entry to PREFLIGHT_REGISTRY before adding new task types."
        )
    return PREFLIGHT_REGISTRY[task_type]

A KeyError on a new task type forces explicit pre-flight registration, so you cannot forget to add checks.

Step 3: Remove all “skip” flags or restrict them to local-only mode

import os

def require_preflight_with_env_guard(task: dict) -> None:
    if os.environ.get("SKIP_PREFLIGHT") == "true":
        env = os.environ.get("ENVIRONMENT", "unknown")
        if env not in ("local", "test"):
            raise ConfigurationError(
                "SKIP_PREFLIGHT=true is not allowed in non-local environments. "
                f"Current environment: {env}"
            )
        logger.warning("SKIP_PREFLIGHT=true — only allowed in local/test environments")
        return
    require_preflight(task)

Hard-block the skip flag in staging and production.

Step 4: Enforce pre-flight in the graph definition, not just in code

In LangGraph v1.0 (stable since October 2025), set_entry_point() is deprecated in favor of an explicit edge from the START sentinel. Wire the entry point so preflight is the only node reachable from the start, and route to END if it fails:

from langgraph.graph import StateGraph, START, END

graph = StateGraph(State)
graph.add_node("preflight", preflight_node)
graph.add_node("execute", execute_node)

# preflight is the ONLY node reachable from START
graph.add_edge(START, "preflight")
# branch on the preflight outcome; never let START reach execute directly
graph.add_conditional_edges(
    "preflight",
    lambda s: "execute" if s["preflight_passed"] else "abort",
    {"execute": "execute", "abort": END},
)

Have preflight_node set preflight_passed=False (or raise) on any failure or timeout. Because there is no edge from START to execute, no path reaches execution without passing through preflight.

If using Temporal, make pre-flight its own activity that fails fast. As of June 2026, maximum_attempts=1 gives a single attempt with no retries, but the cleaner pattern is to raise a non-retryable ApplicationError from inside the activity so the orchestrator stops immediately instead of waiting out a retry budget on an unrecoverable precondition:

from datetime import timedelta
from temporalio import activity, workflow
from temporalio.common import RetryPolicy
from temporalio.exceptions import ApplicationError

@activity.defn
async def run_preflight_activity(task: dict) -> None:
    failures = run_required_checks(task)  # your check runner; raises or collects failures
    if failures:
        # non_retryable: a missing backup or expired token won't fix itself on retry
        raise ApplicationError(
            f"Pre-flight failed: {failures}", type="PreflightFailed", non_retryable=True
        )

@workflow.defn
class ProvisionWorkflow:
    @workflow.run
    async def run(self, task: dict):
        # mandatory first activity; bounded by start_to_close_timeout
        await workflow.execute_activity(
            run_preflight_activity,
            task,
            retry_policy=RetryPolicy(maximum_attempts=1),
            start_to_close_timeout=timedelta(seconds=60),
        )
        await workflow.execute_activity(
            execute_activity, task, start_to_close_timeout=timedelta(minutes=30)
        )

Step 5: Add a CI test for every pre-flight check

def test_preflight_blocks_on_missing_backup():
    task = {"type": "database_migration", "db": "prod_replica"}
    # Simulate no backup in the last 24h
    with patch("checks.backup_check.get_latest_backup", return_value=None):
        with pytest.raises(PreflightError, match="BackupExistsCheck failed"):
            require_preflight(task)

def test_preflight_blocks_on_timeout():
    task = {"type": "terraform_provision", "env": "staging"}
    with patch("checks.aws_check.validate_credentials", side_effect=TimeoutError):
        with pytest.raises(PreflightError, match="timed out"):
            require_preflight(task)

How to confirm it’s fixed

Run these three checks before you trust the gate:

Negative test passes. Force one real failure mode (revoke the token, delete the backup, point at a production env) and confirm the run aborts before any side-effecting tool call. The execute step’s logs should never appear.
No bypass path survives. Grep for skip_preflight, FAST_MODE, SKIP_PREFLIGHT, bypass, and if .*preflight across the repo. Every hit should either be removed or guarded by an environment check that hard-blocks non-local use.
Retry path is covered too. Trigger a retry of a previously-passing run with a now-broken precondition. The retry must re-run pre-flight and abort, not resume on stale results.

In your run logs you should now see a PreflightError (or non-retryable PreflightFailed) with the failing check’s name and reason on every blocked run, and a single “all N checks passed” line on every successful one. If a run reaches execution with neither line, a code path is still skipping the gate.

Prevention

Implement pre-flight as a mandatory blocking function that raises an exception on any failure; it should never return a result that can be ignored.
Register required pre-flight checks per task type; a missing registration for a new task type should fail loudly, not silently pass.
Never allow SKIP_PREFLIGHT or equivalent in non-local environments; guard with an explicit environment check.
Make pre-flight the unconditional first node in your workflow graph; remove any path that can reach execution without passing through pre-flight.
Treat pre-flight timeouts as blocking failures, not pass-throughs. A check you cannot run is a check that has not passed.
Write CI tests that verify each pre-flight check blocks execution on every failure mode it is designed to catch.
Review and update pre-flight checks whenever a new task type or new capability is added to the pipeline.
Log pre-flight results with timestamps and outcomes for every run; auditors need to verify that checks were run, not just that they passed.

FAQ

Q: How do I avoid making pre-flight so strict that it blocks legitimate work? A: Design each check to fail only on conditions that genuinely block safe execution, not on conditions that are merely suboptimal. “No backup in 24h” blocks a migration; “backup is 6h old (within 24h limit)” does not. Calibrate thresholds based on real incidents, not theoretical maximums.

Q: Should pre-flight checks ever be retried? A: Checks for external resources (API reachability, credential validity) can retry once with a short delay. Checks for logical preconditions (backup exists, environment is non-production) should not retry: if the condition is not met, retrying won’t change it. Fail fast and require a human to fix the precondition.

Q: How do I share pre-flight check results with the main agent so it doesn’t re-derive them? A: Include the pre-flight results in the handoff context passed to the execution agent. The execution agent reads “backup verified at 2026-05-25T14:30:00Z, quota remaining: 15 EC2 instances” from the context and can reference it in its reasoning without re-checking.

Q: What if a pre-flight check requires LLM reasoning (e.g., “is this migration safe?”)? A: LLM-based pre-flight checks are acceptable for qualitative assessments, but pair them with a deterministic safety net. “Is this migration safe?” becomes LLM review plus --dry-run execution plus a schema diff comparison. Never rely solely on LLM judgment for a gate that blocks production work.

Q: My orchestrator already retries failed activities. Why does my pre-flight failure still take minutes to surface? A: Because a default retry policy keeps retrying. A precondition failure (expired token, missing backup) will never succeed on retry, so it burns the entire retry budget before failing the run. Mark pre-flight failures as terminal: in Temporal raise ApplicationError(..., non_retryable=True) or set maximum_attempts=1; in LangGraph route the failed preflight node straight to END. See Temporal’s retry policy docs for the non-retryable mechanism.

Tags: #AI coding #Agents #Troubleshooting

Which bucket are you in?

Common causes

1. Pre-flight result is treated as optional in the orchestration logic

2. Pre-flight check times out and the timeout is caught as “ok to skip”

3. Pre-flight is defined in the prompt but not enforced in code

4. Pre-flight is hardcoded to skip in certain modes

5. New task types were added but pre-flight was not updated for them

6. Pre-flight runs but the agent doesn’t wait for the result

Shortest path to fix

Step 1: Make pre-flight a blocking gate, not an advisory step

Step 2: Define required checks per task type in a registry

Step 3: Remove all “skip” flags or restrict them to local-only mode

Step 4: Enforce pre-flight in the graph definition, not just in code

Step 5: Add a CI test for every pre-flight check

How to confirm it’s fixed

Prevention

FAQ

Related

Related Articles

Agent Budget Exhausted Halfway Through the Task

Restored Agent Checkpoint Is Corrupted

Cost Tracking Misses Sub-Agent Usage

Cycle in Agent Call Graph Goes Undetected

Agent Handoff Loses Context Between Steps

Agent Orchestrator Deadlocks Waiting on Each Other