GitHub Actions Deploy Step Hangs Until the 6-Hour Job Limit

Q: My step has `timeout-minutes` set but the step still ran to the full 360 minutes. Why?

Step `timeout-minutes` can fail to interrupt a wedged child process inside a `uses:` action or a Docker step, so the job-level limit absorbs it instead. Wrap the actual command in `timeout 600 ...` inside a `run:` step so the OS kills the process directly.

Your GitHub Actions deploy step hangs and gets cancelled at the 360-minute job limit. The fastest fix is a shell timeout wrapper plus non-interactive deploy flags. Full diagnosis and fixes inside.

Published: May 24, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

A GitHub Actions workflow that normally finishes in 4 minutes sits on the deploy step for 6 hours, then dies with The job running on runner X has exceeded the maximum execution time of 360 minutes. The job is canceled. The same vercel deploy --prod or firebase deploy runs fine when you trigger it locally. Cancelling and re-running sometimes succeeds, sometimes hangs again.

This is almost always a deploy CLI silently waiting on something that will never arrive: a confirmation prompt with no stdin to answer it, a wait-for-deployment poll stuck because the deployment ID was never set, an SSH deploy hung on a first-connect host-key prompt, or a stalled cloud upload. The 360-minute limit is the job timeout, not your deploy timeout.

Fastest fix: wrap the deploy command in the shell timeout utility and add the CLI’s non-interactive flag, so a hang fails in minutes with a clear error instead of burning 6 hours of CI minutes:

- name: Deploy to production
  timeout-minutes: 15        # job-level safety net
  run: timeout 600 vercel deploy --prod --yes
  env:
    VERCEL_TOKEN: ${{ secrets.VERCEL_TOKEN }}

Then fix the underlying hang using the diagnosis below.

Which bucket are you in?

Find your symptom, jump to the fix.

Last log line before the hang	Likely cause	Go to
`Continue? (Y/n)` or stops after `Starting deploy...`	CLI waiting on a prompt, no stdin	Step 2
Repeated `[poll N]` / `state=BUILDING` lines, no end	`wait-for-deployment` never resolves	Step 4
Stops right after an `ssh`/`scp`/`rsync` line	SSH first-connect host-key prompt	Step 3
`uploaded 47/120 assets` then silence	Egress throttle / proxy stalls the transfer	Step 5
`Downloading cache...` with no completion line	Old `actions/cache` version, backend changed	Step 7
Hook step is green, target dashboard shows no build	Deploy hook accepted but never started	Step 6
No `timeout-minutes`, step just ran to 360 min	No step cap; job default absorbed the hang	Step 1

Before you start

Capture the workflow YAML for the failing job (the full deploy step including with:, env:, and the command).
Identify which step hangs by reading the job log linearly. gh run view <run-id> --log scrapes cleaner than the web UI.
Note whether it hangs deterministically or intermittently.
Open the deploy target’s own dashboard (Vercel deployments, Firebase function logs, Render/Railway activity) to cross-check what happened on their side.
Confirm the runner type (ubuntu-latest, GitHub-hosted larger runner, or self-hosted) — egress and DNS behavior differ, and the 360-minute cap is GitHub-hosted only. Self-hosted runners use the workflow cap of 72 hours instead.

Step-by-step fix

Ordered: stop the bleeding first, then fix the root cause.

Step 1: Bound the step so a hang fails in minutes

Set timeout-minutes on every deploy and wait step. The job-level default is 360 minutes (as of June 2026), and a step timeout can never exceed the job timeout — it only ever lowers it.

- name: Deploy to production
  timeout-minutes: 15
  run: vercel deploy --prod --yes
  env:
    VERCEL_TOKEN: ${{ secrets.VERCEL_TOKEN }}

One important caveat: step timeout-minutes cancels the step, but on uses:-based actions and some Docker/child-process hangs the runner cannot always interrupt the wedged subprocess cleanly. For anything stubborn, wrap the actual command in the GNU timeout utility so the process is killed at the OS level:

- name: Deploy to production
  timeout-minutes: 15            # outer safety net
  run: timeout --signal=SIGTERM 600 ./deploy.sh

timeout 600 ... exits with code 124 if the command runs past 600 seconds, giving you a deterministic, fast failure that the step timeout alone may not.

Step 2: Force non-interactive mode on deploy CLIs

This is the single most common cause: the CLI detected a change and is waiting for a Y/n that will never come.

For Vercel (pass the token via env, not --token; the --token argument can leak into logs and, on Vercel CLI 50.16.0–52.0.0, into a structured JSON error payload — CVE-2026-44479, May 2026; upgrade to 52.0.1+):

- run: vercel deploy --prod --yes
  env:
    VERCEL_TOKEN: ${{ secrets.VERCEL_TOKEN }}

For Firebase:

- run: firebase deploy --non-interactive --force --project=prod

Known gap (as of June 2026): --non-interactive --force still does not auto-answer every prompt in firebase-tools. Firestore index field-override deletions and certain min-instances / min-bill confirmations on firebase deploy --only firestore or --only functions can still block. If a Firebase deploy hangs even with both flags, scope the deploy (--only hosting, --only functions:myFn) so the blocking resource is not in the same command, and check the firebase-tools issue tracker for your specific prompt.

For npm publish:

- run: npm publish --provenance --access public
  env:
    NPM_CONFIG_YES: "true"

Step 3: Skip the SSH host-key prompt safely

ssh user@host on a fresh runner prompts Are you sure you want to continue connecting (yes/no/[fingerprint])? because there is no known_hosts entry. Pre-populate it:

- name: Add SSH known_hosts
  run: |
    mkdir -p ~/.ssh
    ssh-keyscan -H ${{ secrets.DEPLOY_HOST }} >> ~/.ssh/known_hosts
    chmod 600 ~/.ssh/known_hosts

- name: Deploy via SSH
  timeout-minutes: 10
  uses: appleboy/ssh-action@v1
  with:
    host: ${{ secrets.DEPLOY_HOST }}
    username: deploy
    key: ${{ secrets.DEPLOY_KEY }}
    script: |
      cd /var/www && git pull && pnpm install --prod && systemctl reload app

Do not “fix” this with StrictHostKeyChecking=no — that disables MITM protection. ssh-keyscan pins the real host key.

Step 4: Make `wait-for-deployment` polling fail fast

Actions like bobheadxi/deployments or a hand-rolled gh api loop wait for a status update. If the deployment never reports back (deployer used a different SHA, status webhook silently failed), the poll runs forever. Add a hard cap, a heartbeat log, and a guard for a missing ID:

- name: Wait for Vercel deployment
  timeout-minutes: 10
  run: |
    DEPLOY_ID="${{ steps.deploy.outputs.id }}"
    if [ -z "$DEPLOY_ID" ]; then
      echo "ERROR: deployment id missing — trigger step never set an output" >&2
      exit 1
    fi
    for i in $(seq 1 60); do
      STATE=$(vercel inspect "$DEPLOY_ID" | grep -E "^\s+status" | awk '{print $2}')
      echo "[poll $i] status=$STATE"
      [ "$STATE" = "Ready" ] && exit 0
      [ "$STATE" = "Error" ] && exit 1
      sleep 10
    done
    echo "ERROR: deploy did not reach Ready in 10m" >&2
    exit 1
  env:
    VERCEL_TOKEN: ${{ secrets.VERCEL_TOKEN }}

A missing deployment ID now fails immediately; a genuinely stuck poll fails at 10 minutes, not 6 hours.

Step 5: Add a progress watchdog to uploads

Cloud uploads (S3, GCS, Cloudflare R2) can stall on an egress throttle or corporate proxy mid-transfer. The TCP connection does not error; it just stops making progress. Kill it when the log goes quiet:

- name: Upload artifacts
  timeout-minutes: 20
  run: |
    aws s3 sync ./dist s3://my-bucket --delete --no-progress | tee upload.log &
    UP_PID=$!
    touch /tmp/_lastsize
    while kill -0 $UP_PID 2>/dev/null; do
      sleep 30
      if [ -z "$(find upload.log -newer /tmp/_lastsize 2>/dev/null)" ]; then
        echo "no upload progress in 30s — killing transfer" >&2
        kill $UP_PID
        exit 1
      fi
      touch /tmp/_lastsize
    done
    wait $UP_PID

A stalled transfer dies in ~30 seconds instead of hanging until the job timeout.

Step 6: Cross-check the deploy target’s own log

Some hangs are not really hangs — the workflow correctly waits while the target service silently rejected or never started the deploy. A webhook deploy (Render, Railway, Fly hook URLs) can return 200 OK to the runner and never begin a build.

- name: Trigger deploy
  run: |
    RESPONSE=$(curl -fsSL -X POST "$DEPLOY_HOOK_URL")
    echo "deploy hook response: $RESPONSE"

- name: Sanity-check the deploy exists
  run: vercel ls | head -5
  env:
    VERCEL_TOKEN: ${{ secrets.VERCEL_TOKEN }}

If the target dashboard shows no deploy from this run’s timestamp, the trigger step failed silently — log its full response and exit code. See firebase deploy permission denied for adjacent silent-deploy patterns.

Step 7: Upgrade `actions/cache` — old pins now fail, not just hang

This section changed materially. GitHub shut down the legacy cache service backend on February 1, 2025. Any workflow still pinned to actions/cache@v1, @v2, or pre-3.4.0 / pre-4.2.0 releases (the old v3.3.2 SHA many repos copied around) no longer talks to a working backend — these now fail or stall on Downloading cache... rather than just occasionally hanging.

As of June 2026 the current line is actions/cache@v5 (Node.js 24 runtime, requires runner 2.327.1+). Move to v5, or pin v4.2.0+ / v3.4.0+ if you cannot bump the runtime:

- uses: actions/cache@v5
  with:
    path: ~/.pnpm-store
    key: pnpm-${{ hashFiles('**/pnpm-lock.yaml') }}

If you pin to a commit SHA for supply-chain safety, use the SHA of a v5, v4.2.0+, or v3.4.0+ release — never the older v3.3.x SHAs. Floating major tags can also pull regressions, so review the release notes before bumping.

How to confirm it’s fixed

Re-run the workflow. Total time is back within ~1.5x of your last green run.
Every deploy and wait step has a timeout-minutes (and a timeout N ... wrapper on stubborn uses:/script steps).
A deliberate hang now fails in 10–15 minutes with a clear error, not a 6-hour cancellation. Quick test: temporarily point a deploy at a non-existent host and confirm it fails fast.
Every deploy CLI invocation carries its non-interactive flag (--yes, --non-interactive --force, NPM_CONFIG_YES).
The wait step exits non-zero when the deployment never appears.
actions/cache is on v5 / v4.2.0+ / v3.4.0+; no Downloading cache... line sits without a completion line.

Long-term prevention

Mandate timeout-minutes at the step level on every CI workflow; lint for its absence in PRs.
Always pass non-interactive flags explicitly to deploy CLIs, even when the default seems fine — defaults change between CLI versions.
Pass tokens via env: (VERCEL_TOKEN, FIREBASE_TOKEN / GOOGLE_APPLICATION_CREDENTIALS), not as command-line --token arguments that can land in logs.
Confirm deploy success by querying the target service’s API or CLI, not just the trigger step’s exit code.
Keep a .github/workflows/CHECKS.md listing each step’s worst-case timeout and your hang runbook.
On self-hosted runners, monitor disk and memory; cache restore stalls often correlate with disk pressure.

Common pitfalls

Setting timeout-minutes: 360 and thinking it “fixed” the hang — that is the default. You need a much lower step-level cap.
Relying on step timeout-minutes alone for uses: actions — for stubborn child-process hangs, also wrap the command in the shell timeout utility.
Using StrictHostKeyChecking=no “to fix SSH prompts” — it disables host verification and is a security hole. Use ssh-keyscan instead.
Adding continue-on-error: true so the workflow looks green while the deploy never happened. See vercel build failed for related silent-deploy patterns.
An if: always() Slack step that posts deploy succeeded even when the deploy timed out. Gate it on steps.deploy.outcome == 'success'.
Leaving actions/cache pinned to a pre-2025 SHA — those versions point at a backend that no longer exists and now fail outright.

FAQ

Can I just raise the job timeout above 360 minutes?

On GitHub-hosted runners, no — 360 minutes is a hard ceiling per job; the most a workflow run can survive overall is 72 hours. You can set timeout-minutes up to that workflow cap on self-hosted runners, but if you legitimately need more than 6 hours, split the deploy rather than extend it. Step-level timeouts are the right tool for a hang.

My step has timeout-minutes set but the step still ran to the full 360 minutes. Why?

Step timeout-minutes can fail to interrupt a wedged child process inside a uses: action or a Docker step, so the job-level limit absorbs it instead. Wrap the actual command in timeout 600 ... inside a run: step so the OS kills the process directly.

My workflow fails but the deploy actually succeeded. What now?

This is the worst failure mode because rollback is hard. Check the target dashboard — if the deploy went through, promote/approve it there, then investigate why the workflow’s exit code was wrong (usually a downstream step like a smoke test).

Should I split deploy and smoke test into separate jobs?

Yes. The deploy job posts the URL as an output and ends in minutes (clean state); the smoke-test job consumes that output and can run longer without blocking the deploy log.

It works on ubuntu-22.04 but hangs on ubuntu-latest after the runner image bumped.

Possible — image changes occasionally shift the default ~/.ssh/config, the installed Node version, or a preinstalled CLI version. Pin production deploy workflows to a specific runner image (ubuntu-22.04 or ubuntu-24.04) instead of ubuntu-latest. See vercel build failed for related runner-environment debugging.

Tags: #Troubleshooting #GitHub Actions #CI #deploy #timeout

Which bucket are you in?

Before you start

Step-by-step fix

Step 1: Bound the step so a hang fails in minutes

Step 2: Force non-interactive mode on deploy CLIs

Step 3: Skip the SSH host-key prompt safely

Step 4: Make wait-for-deployment polling fail fast

Step 5: Add a progress watchdog to uploads

Step 6: Cross-check the deploy target’s own log

Step 7: Upgrade actions/cache — old pins now fail, not just hang

How to confirm it’s fixed

Long-term prevention

Common pitfalls

FAQ

Related Articles

Astro Adapter Mismatch Between SSR and SSG Modes

Deploy Preview URLs Got Indexed by Google

Monorepo Deploy Only Ships One App Out of Several

Netlify Function Times Out at 10s on Cold Start

Next.js ISR Revalidation Stuck on Stale Pages (Vercel, 2026)

Service Worker Serves Stale Bundle After Deploy

Step 4: Make `wait-for-deployment` polling fail fast

Step 7: Upgrade `actions/cache` — old pins now fail, not just hang