GitHub Actions Deploy Step Times Out After 6 Hours

Your GitHub Actions deploy step hangs and gets cancelled at the 6-hour job limit — usually a wait-for-deployment poll, network egress block, or a deploy CLI waiting on missing input.

A GitHub Actions workflow that normally finishes in 4 minutes is sitting on the deploy step for 6 hours, then gets killed with The job running on runner X has exceeded the maximum execution time of 360 minutes. The actual vercel deploy --prod or firebase deploy runs fine when you trigger it locally. Cancelling and re-running sometimes succeeds, sometimes hangs again. This is almost always a deploy CLI silently waiting on something that will never arrive: a missing interactive prompt response, a webhook that needs a public callback URL the runner cannot expose, a wait-for-deployment polling action stuck because the deployment ID was never set, or an SSH-based deploy hung on a host key prompt. The 6-hour limit is the job timeout, not the deploy timeout — what you actually need is a step-level timeout AND fixing the real hang.

Common causes

Ordered by what we see most often.

1. Deploy CLI waiting on an interactive prompt

firebase deploy will prompt ? You're about to deploy a function in region X. Continue? (Y/n) if it detects a config change. In CI with no stdin, it sits silent forever.

How to spot it: Step log ends mid-deploy with no error. Last line is something like Detected target change. Continue? or just hangs after Starting deploy....

2. A wait-for-deployment action poll never resolves

Actions like bobheadxi/deployments or custom gh api polling loops wait for a deployment status update. If the deployment never reports back (because the deployer used a different SHA, or the status webhook failed silently), the poll runs forever.

How to spot it: Workflow YAML has a with: timeout: 600 or while gh api ...; sleep 30; done loop. Step log shows repeated polling lines with no terminal state.

3. SSH deploy hung on host key prompt

ssh user@host first time prompts Are you sure you want to continue connecting (yes/no)?. CI runners do not have known_hosts entries, so the prompt waits.

How to spot it: Step uses ssh, scp, rsync, or a deploy action like appleboy/ssh-action. Log stops with no output after the SSH connection attempt.

4. Network egress blocked / rate limited mid-deploy

Cloud upload (S3, GCS, Cloudflare R2) can stall if the runner hits an egress throttle or a corporate proxy mid-transfer. The TCP connection doesn’t error — it just sits open with no progress.

How to spot it: Log shows transfer progress for a while, then stops at e.g. “uploaded 47/120 assets” with no error.

5. Step missing timeout-minutes on a workflow with timeout-minutes: 360 job

GitHub Actions jobs default to 360 minutes (6 hours). Without timeout-minutes at the step level, a single hung step eats the whole job budget.

How to spot it: Workflow YAML’s deploy step has no timeout-minutes: key. Job-level default kicks in.

6. Deploy hook fires but target service rejected silently

A webhook-style deploy (Render, Railway, Fly with hook URLs) returns 200 OK to the runner but never starts a build. The poll-for-completion step waits forever for a deploy that never began.

How to spot it: Hook step succeeds. Polling step hangs. Target service dashboard shows no deploy from the expected time.

7. Cache restore hangs on a corrupted blob

actions/cache@v3 restoring a 5 GB build cache can hang if the cache server returns a stuck stream. No error, just no progress.

How to spot it: Step Run actions/cache shows Downloading cache... with no completion line. Newer cache versions are better but still hit this occasionally.

Before you start

  • Capture the workflow YAML for the failing job.
  • Identify which step the hang is on by reading the job log linearly.
  • Note whether this hangs deterministically or intermittently.
  • Have access to the deploy target’s own dashboard to cross-check what actually happened on their side.
  • Confirm runner type (ubuntu-latest, self-hosted, GitHub-hosted larger runner) — egress / DNS behavior differs.

Information to collect

  • The full deploy step’s YAML including with:, env:, and command.
  • Last 50 lines of step log before the timeout.
  • Deploy target’s dashboard log for the same time window (Vercel deployments, Firebase function logs, etc.).
  • Whether the workflow uses any third-party deploy actions and their versions.
  • Whether secrets and environment vars expected by the deploy CLI are actually set in the workflow.
  • Output of gh run view <run-id> --log for cleaner log scraping than the web UI.

Step-by-step fix

Ordered: stop the bleeding first, then fix the underlying cause.

Step 1: Add a step-level timeout-minutes

Immediately cap any deploy step:

- name: Deploy to production
  timeout-minutes: 15
  run: vercel deploy --prod --token=${{ secrets.VERCEL_TOKEN }}

The job now fails fast instead of burning 6 hours of CI minutes. You will catch hangs in minutes, not hours. Apply this to every deploy / wait step in your repo.

Step 2: Force non-interactive mode on deploy CLIs

For Firebase:

- run: firebase deploy --non-interactive --force --project=prod

For Vercel:

- run: vercel deploy --prod --yes --token=${{ secrets.VERCEL_TOKEN }}

For npm publish:

- run: npm publish --provenance --access public
  env:
    NPM_CONFIG_YES: "true"

Any prompt now auto-accepts the default; no silent wait.

Step 3: Skip SSH host key prompts safely

- name: Add SSH known_hosts
  run: |
    mkdir -p ~/.ssh
    ssh-keyscan -H ${{ secrets.DEPLOY_HOST }} >> ~/.ssh/known_hosts
    chmod 600 ~/.ssh/known_hosts

- name: Deploy via SSH
  timeout-minutes: 10
  uses: appleboy/ssh-action@v1
  with:
    host: ${{ secrets.DEPLOY_HOST }}
    username: deploy
    key: ${{ secrets.DEPLOY_KEY }}
    script: |
      cd /var/www && git pull && pnpm install --prod && systemctl reload app

Pre-populating known_hosts eliminates the first-connect prompt. Do not use StrictHostKeyChecking=no — that disables MITM protection.

Step 4: Fix wait-for-deployment polling to fail fast

Add a hard cap and a heartbeat log:

- name: Wait for Vercel deployment
  timeout-minutes: 10
  run: |
    DEPLOY_ID="${{ steps.deploy.outputs.id }}"
    if [ -z "$DEPLOY_ID" ]; then
      echo "ERROR: deployment id missing" >&2
      exit 1
    fi
    for i in $(seq 1 60); do
      STATE=$(vercel inspect "$DEPLOY_ID" --token=${{ secrets.VERCEL_TOKEN }} | grep -E "^\s+state" | awk '{print $2}')
      echo "[poll $i] state=$STATE"
      [ "$STATE" = "READY" ] && exit 0
      [ "$STATE" = "ERROR" ] && exit 1
      sleep 10
    done
    echo "ERROR: deploy did not reach READY in 10m" >&2
    exit 1

A missing deployment ID now fails immediately. A stuck poll fails at 10 minutes instead of 6 hours.

Step 5: Add a watchdog for upload progress

For uploads that can stall silently:

- name: Upload artifacts
  timeout-minutes: 20
  run: |
    aws s3 sync ./dist s3://my-bucket --delete --no-progress \
      | tee upload.log &
    UP_PID=$!
    while kill -0 $UP_PID 2>/dev/null; do
      sleep 30
      if [ -z "$(find upload.log -newer /tmp/_lastsize 2>/dev/null)" ]; then
        echo "no progress in 30s" >&2
        kill $UP_PID
        exit 1
      fi
      touch /tmp/_lastsize
    done
    wait $UP_PID

A stalled transfer dies in 30 seconds instead of hanging until job timeout.

Step 6: Cross-check the deploy target’s own log

Some hangs are not really hangs — the workflow correctly waits while the target service silently rejected the deploy:

- name: Check Vercel deployment exists
  run: |
    vercel ls --token=${{ secrets.VERCEL_TOKEN }} | head -5

If the target dashboard shows no recent deploy, the trigger step failed silently. Log its full response:

- name: Trigger deploy
  run: |
    RESPONSE=$(curl -fsSL -X POST "$DEPLOY_HOOK_URL")
    echo "deploy hook response: $RESPONSE"

See firebase deploy permission denied for adjacent silent-deploy failure patterns.

Step 7: Pin third-party action versions and audit cache restore

Lock action versions to a commit SHA to avoid regressions:

- uses: actions/cache@0c45773b623bea8c8e75f6c82b208c3cf94ea4f9 # v3.3.2
  with:
    path: ~/.pnpm-store
    key: pnpm-${{ hashFiles('**/pnpm-lock.yaml') }}
    enableCrossOsArchive: false

Floating tags like @v3 can pull regressions. Use SHA pins for any action involved in your deploy path.

Verify

  • Re-run the workflow. Total time matches normal (within 1.5x of last green run).
  • Step-level timeout-minutes are set on every deploy / wait step.
  • Hanging is now a 10-15 minute failure with a clear error, not a 6-hour cancellation.
  • Deploy CLI commands include their non-interactive flag everywhere.
  • The wait-for-deployment step exits non-zero if the deployment never appears.

Long-term prevention

  • Mandate timeout-minutes at the step level on every CI workflow; lint for its absence in PRs.
  • Always pass non-interactive flags explicitly to deploy CLIs in CI even if the default seems fine.
  • Pin third-party actions to commit SHAs; review CHANGELOG before bumping.
  • Cross-check deploy success by querying the target service’s API or CLI, not just the trigger step’s exit code.
  • Keep a .github/workflows/CHECKS.md listing each step’s worst-case timeout and your runbook for hangs.
  • For self-hosted runners, monitor disk + memory; cache restore hangs sometimes correlate with disk pressure.

Common pitfalls

  • Setting timeout-minutes: 360 thinking it “fixed” the hang — that is the default. You need a much lower step-level cap.
  • Using StrictHostKeyChecking=no “to fix SSH prompts” — this disables host verification and is a security hole. Use ssh-keyscan instead.
  • Catching the timeout with continue-on-error: true so the workflow looks green while the deploy never happened. See vercel build failed for adjacent silent-deploy patterns.
  • Adding if: always() to a notification step that pings Slack deploy succeeded even when the deploy step timed out. Check steps.deploy.outcome == 'success' instead.
  • Bumping actions/cache major versions in a deploy workflow without testing — cache action regressions have caused multi-hour CI hangs across thousands of repos.

FAQ

Q: Can I raise the job timeout above 360 minutes?

Yes, set timeout-minutes: 1440 (24h max) at the job level. But if you legitimately need that, your deploy probably needs to be split, not extended. Step-level timeouts are the right tool.

Q: My workflow fails but the deploy actually succeeded. What now?

This is the worst failure mode because rollback is hard. Check the deploy target’s dashboard — if the deploy went through, manually promote / approve there, then investigate why the workflow’s exit code was wrong (usually a downstream step like a smoke test).

Q: Should I split deploy + smoke test into separate jobs?

Yes — deploy job posts the URL as an output, smoke test job consumes it. The deploy job ends in minutes (clean state), smoke tests can take longer without blocking the deploy log.

Q: My deploy works on ubuntu-22.04 but hangs on ubuntu-latest after the runner image bumped.

Possible — image changes occasionally affect default ~/.ssh/config or installed Node versions. Pin to a specific runner image (ubuntu-22.04) in production deploy workflows. See vercel build failed for related runner-environment debugging.

Tags: #Troubleshooting #GitHub Actions #CI #deploy #timeout