A GitHub Actions workflow that normally finishes in 4 minutes is sitting on the deploy step for 6 hours, then gets killed with The job running on runner X has exceeded the maximum execution time of 360 minutes. The actual vercel deploy --prod or firebase deploy runs fine when you trigger it locally. Cancelling and re-running sometimes succeeds, sometimes hangs again. This is almost always a deploy CLI silently waiting on something that will never arrive: a missing interactive prompt response, a webhook that needs a public callback URL the runner cannot expose, a wait-for-deployment polling action stuck because the deployment ID was never set, or an SSH-based deploy hung on a host key prompt. The 6-hour limit is the job timeout, not the deploy timeout — what you actually need is a step-level timeout AND fixing the real hang.
Common causes
Ordered by what we see most often.
1. Deploy CLI waiting on an interactive prompt
firebase deploy will prompt ? You're about to deploy a function in region X. Continue? (Y/n) if it detects a config change. In CI with no stdin, it sits silent forever.
How to spot it: Step log ends mid-deploy with no error. Last line is something like Detected target change. Continue? or just hangs after Starting deploy....
2. A wait-for-deployment action poll never resolves
Actions like bobheadxi/deployments or custom gh api polling loops wait for a deployment status update. If the deployment never reports back (because the deployer used a different SHA, or the status webhook failed silently), the poll runs forever.
How to spot it: Workflow YAML has a with: timeout: 600 or while gh api ...; sleep 30; done loop. Step log shows repeated polling lines with no terminal state.
3. SSH deploy hung on host key prompt
ssh user@host first time prompts Are you sure you want to continue connecting (yes/no)?. CI runners do not have known_hosts entries, so the prompt waits.
How to spot it: Step uses ssh, scp, rsync, or a deploy action like appleboy/ssh-action. Log stops with no output after the SSH connection attempt.
4. Network egress blocked / rate limited mid-deploy
Cloud upload (S3, GCS, Cloudflare R2) can stall if the runner hits an egress throttle or a corporate proxy mid-transfer. The TCP connection doesn’t error — it just sits open with no progress.
How to spot it: Log shows transfer progress for a while, then stops at e.g. “uploaded 47/120 assets” with no error.
5. Step missing timeout-minutes on a workflow with timeout-minutes: 360 job
GitHub Actions jobs default to 360 minutes (6 hours). Without timeout-minutes at the step level, a single hung step eats the whole job budget.
How to spot it: Workflow YAML’s deploy step has no timeout-minutes: key. Job-level default kicks in.
6. Deploy hook fires but target service rejected silently
A webhook-style deploy (Render, Railway, Fly with hook URLs) returns 200 OK to the runner but never starts a build. The poll-for-completion step waits forever for a deploy that never began.
How to spot it: Hook step succeeds. Polling step hangs. Target service dashboard shows no deploy from the expected time.
7. Cache restore hangs on a corrupted blob
actions/cache@v3 restoring a 5 GB build cache can hang if the cache server returns a stuck stream. No error, just no progress.
How to spot it: Step Run actions/cache shows Downloading cache... with no completion line. Newer cache versions are better but still hit this occasionally.
Before you start
- Capture the workflow YAML for the failing job.
- Identify which step the hang is on by reading the job log linearly.
- Note whether this hangs deterministically or intermittently.
- Have access to the deploy target’s own dashboard to cross-check what actually happened on their side.
- Confirm runner type (ubuntu-latest, self-hosted, GitHub-hosted larger runner) — egress / DNS behavior differs.
Information to collect
- The full deploy step’s YAML including
with:,env:, and command. - Last 50 lines of step log before the timeout.
- Deploy target’s dashboard log for the same time window (Vercel deployments, Firebase function logs, etc.).
- Whether the workflow uses any third-party deploy actions and their versions.
- Whether secrets and environment vars expected by the deploy CLI are actually set in the workflow.
- Output of
gh run view <run-id> --logfor cleaner log scraping than the web UI.
Step-by-step fix
Ordered: stop the bleeding first, then fix the underlying cause.
Step 1: Add a step-level timeout-minutes
Immediately cap any deploy step:
- name: Deploy to production
timeout-minutes: 15
run: vercel deploy --prod --token=${{ secrets.VERCEL_TOKEN }}
The job now fails fast instead of burning 6 hours of CI minutes. You will catch hangs in minutes, not hours. Apply this to every deploy / wait step in your repo.
Step 2: Force non-interactive mode on deploy CLIs
For Firebase:
- run: firebase deploy --non-interactive --force --project=prod
For Vercel:
- run: vercel deploy --prod --yes --token=${{ secrets.VERCEL_TOKEN }}
For npm publish:
- run: npm publish --provenance --access public
env:
NPM_CONFIG_YES: "true"
Any prompt now auto-accepts the default; no silent wait.
Step 3: Skip SSH host key prompts safely
- name: Add SSH known_hosts
run: |
mkdir -p ~/.ssh
ssh-keyscan -H ${{ secrets.DEPLOY_HOST }} >> ~/.ssh/known_hosts
chmod 600 ~/.ssh/known_hosts
- name: Deploy via SSH
timeout-minutes: 10
uses: appleboy/ssh-action@v1
with:
host: ${{ secrets.DEPLOY_HOST }}
username: deploy
key: ${{ secrets.DEPLOY_KEY }}
script: |
cd /var/www && git pull && pnpm install --prod && systemctl reload app
Pre-populating known_hosts eliminates the first-connect prompt. Do not use StrictHostKeyChecking=no — that disables MITM protection.
Step 4: Fix wait-for-deployment polling to fail fast
Add a hard cap and a heartbeat log:
- name: Wait for Vercel deployment
timeout-minutes: 10
run: |
DEPLOY_ID="${{ steps.deploy.outputs.id }}"
if [ -z "$DEPLOY_ID" ]; then
echo "ERROR: deployment id missing" >&2
exit 1
fi
for i in $(seq 1 60); do
STATE=$(vercel inspect "$DEPLOY_ID" --token=${{ secrets.VERCEL_TOKEN }} | grep -E "^\s+state" | awk '{print $2}')
echo "[poll $i] state=$STATE"
[ "$STATE" = "READY" ] && exit 0
[ "$STATE" = "ERROR" ] && exit 1
sleep 10
done
echo "ERROR: deploy did not reach READY in 10m" >&2
exit 1
A missing deployment ID now fails immediately. A stuck poll fails at 10 minutes instead of 6 hours.
Step 5: Add a watchdog for upload progress
For uploads that can stall silently:
- name: Upload artifacts
timeout-minutes: 20
run: |
aws s3 sync ./dist s3://my-bucket --delete --no-progress \
| tee upload.log &
UP_PID=$!
while kill -0 $UP_PID 2>/dev/null; do
sleep 30
if [ -z "$(find upload.log -newer /tmp/_lastsize 2>/dev/null)" ]; then
echo "no progress in 30s" >&2
kill $UP_PID
exit 1
fi
touch /tmp/_lastsize
done
wait $UP_PID
A stalled transfer dies in 30 seconds instead of hanging until job timeout.
Step 6: Cross-check the deploy target’s own log
Some hangs are not really hangs — the workflow correctly waits while the target service silently rejected the deploy:
- name: Check Vercel deployment exists
run: |
vercel ls --token=${{ secrets.VERCEL_TOKEN }} | head -5
If the target dashboard shows no recent deploy, the trigger step failed silently. Log its full response:
- name: Trigger deploy
run: |
RESPONSE=$(curl -fsSL -X POST "$DEPLOY_HOOK_URL")
echo "deploy hook response: $RESPONSE"
See firebase deploy permission denied for adjacent silent-deploy failure patterns.
Step 7: Pin third-party action versions and audit cache restore
Lock action versions to a commit SHA to avoid regressions:
- uses: actions/cache@0c45773b623bea8c8e75f6c82b208c3cf94ea4f9 # v3.3.2
with:
path: ~/.pnpm-store
key: pnpm-${{ hashFiles('**/pnpm-lock.yaml') }}
enableCrossOsArchive: false
Floating tags like @v3 can pull regressions. Use SHA pins for any action involved in your deploy path.
Verify
- Re-run the workflow. Total time matches normal (within 1.5x of last green run).
- Step-level
timeout-minutesare set on every deploy / wait step. - Hanging is now a 10-15 minute failure with a clear error, not a 6-hour cancellation.
- Deploy CLI commands include their non-interactive flag everywhere.
- The wait-for-deployment step exits non-zero if the deployment never appears.
Long-term prevention
- Mandate
timeout-minutesat the step level on every CI workflow; lint for its absence in PRs. - Always pass non-interactive flags explicitly to deploy CLIs in CI even if the default seems fine.
- Pin third-party actions to commit SHAs; review CHANGELOG before bumping.
- Cross-check deploy success by querying the target service’s API or CLI, not just the trigger step’s exit code.
- Keep a
.github/workflows/CHECKS.mdlisting each step’s worst-case timeout and your runbook for hangs. - For self-hosted runners, monitor disk + memory; cache restore hangs sometimes correlate with disk pressure.
Common pitfalls
- Setting
timeout-minutes: 360thinking it “fixed” the hang — that is the default. You need a much lower step-level cap. - Using
StrictHostKeyChecking=no“to fix SSH prompts” — this disables host verification and is a security hole. Usessh-keyscaninstead. - Catching the timeout with
continue-on-error: trueso the workflow looks green while the deploy never happened. See vercel build failed for adjacent silent-deploy patterns. - Adding
if: always()to a notification step that pings Slackdeploy succeededeven when the deploy step timed out. Checksteps.deploy.outcome == 'success'instead. - Bumping
actions/cachemajor versions in a deploy workflow without testing — cache action regressions have caused multi-hour CI hangs across thousands of repos.
FAQ
Q: Can I raise the job timeout above 360 minutes?
Yes, set timeout-minutes: 1440 (24h max) at the job level. But if you legitimately need that, your deploy probably needs to be split, not extended. Step-level timeouts are the right tool.
Q: My workflow fails but the deploy actually succeeded. What now?
This is the worst failure mode because rollback is hard. Check the deploy target’s dashboard — if the deploy went through, manually promote / approve there, then investigate why the workflow’s exit code was wrong (usually a downstream step like a smoke test).
Q: Should I split deploy + smoke test into separate jobs?
Yes — deploy job posts the URL as an output, smoke test job consumes it. The deploy job ends in minutes (clean state), smoke tests can take longer without blocking the deploy log.
Q: My deploy works on ubuntu-22.04 but hangs on ubuntu-latest after the runner image bumped.
Possible — image changes occasionally affect default ~/.ssh/config or installed Node versions. Pin to a specific runner image (ubuntu-22.04) in production deploy workflows. See vercel build failed for related runner-environment debugging.