Agent's Subprocess Orphaned After the Agent Exits

Q: How do I find orphaned subprocesses from a previous agent run?

List port holders with `lsof -i : ` and kill them with `kill -9 $(lsof -t -i : )`. To find processes that were reparented to init (true orphans): on Linux use `ps --ppid 1 -u `; on macOS the BSD `ps` has no `--ppid`, so use `ps -eo pid,ppid,command | awk '$2 == 1'`. If you tagged children with `AGENT_TASK_ID`, grep for that env var instead.

Q: Is it safe to send SIGKILL immediately without SIGTERM first?

For test servers and `--watch` compilers, yes — they have nothing important to flush. For anything that writes to disk (databases, log rotators), always send SIGTERM first and allow 3-5 seconds; a SIGKILL mid-write can corrupt files or leave a stale lock.

Q: The process is gone but I still get `EADDRINUSE`. Why?

That is TCP `TIME_WAIT`, not an orphan. The kernel holds the port for 30-120 seconds after a clean close. Confirm with `lsof -i : ` showing no process while `ss -tan | grep : ` shows `TIME_WAIT`. Set `SO_REUSEADDR` on the listener, or bind to port 0 for throwaway servers.

Q: An orphan is holding a SQLite lock and the next run reports `database is locked`. How do I clear it?

Find the holder with `lsof `, then `kill -9 `. For a WAL-mode database, after confirming nothing else is using it, remove the leftover `-wal` and `-shm` sidecar files so the next opener starts clean.

Q: My subprocess ignores SIGTERM. What now?

After ~5 seconds with no exit, send SIGKILL to the group: `os.killpg(pgid, signal.SIGKILL)`. If it survives even SIGKILL, it is stuck in `D` (uninterruptible sleep) on hung I/O — usually an NFS or device-driver issue. Only killing the container or rebooting the host clears that.

An AI agent launches a dev server or watcher, then exits without killing it. The process keeps holding a port and burning CPU. Here is how to track every subprocess and guarantee it dies.

Published: May 25, 2026 Updated: Jun 17, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

A Claude Code, LangGraph, or Temporal-based agent launches npm run dev to test its own output, then crashes on a budget limit before it can shut the server down. The Node process keeps running, port 3000 stays occupied, and the next run fails immediately with Error: listen EADDRINUSE: address already in use :::3000. Or a build agent fires tsc --watch and exits without killing it, so the watcher keeps writing output in the background. Orphaned subprocesses stay invisible until they cause a visible failure on the next run.

Fastest fix (do this first): Find and kill the leftover process right now, then patch the launch code so it cannot recur. Kill what is holding the port: on macOS or Linux run kill -9 $(lsof -t -i :3000). Then wrap every subprocess launch in a context manager that puts the child in its own session (start_new_session=True) and kills the whole group in a finally block. The full pattern is in Step 1 below.

Which bucket are you in

Symptom	Most likely cause	Jump to
`EADDRINUSE` on the next run, old process still in `ps`	Cleanup never ran (no `finally`, or agent was SIGKILLed)	Causes 1-3, Step 1-2
You called `.terminate()` but children survived	`shell=True`, or no process group	Cause 4, Step 3
Process is gone but port still busy for ~30-120s	TCP `TIME_WAIT`, not an orphan	Cause 5, Step 4
Children linger only after `docker stop`	PID 1 / signal-forwarding gap	Cause 6, Step 5
`database is locked` on next run	Orphan still holds the SQLite write lock	FAQ

Common causes

1. Agent exits without calling cleanup — no signal is sent

The agent calls subprocess.Popen(["npm", "run", "dev"]) and stores the Popen object. If the agent then exits for any reason (budget hit, task complete, unhandled exception) without calling .terminate() on that object, the Python process dies and the child is reparented to PID 1 (the init system). No signal reaches it; it just keeps running.

How to spot it: Grep the agent for subprocess.Popen and asyncio.create_subprocess_exec. For each one, check whether the return value is stored and whether .terminate() or .kill() runs in a finally block or a context manager.

2. Exception path skips cleanup

The agent starts a subprocess, then an exception fires between the launch and the cleanup call. With try: run_server() except: raise the cleanup is not in finally, so the exception bypasses it. This is the single most common path to orphans in practice.

How to spot it: Look at every subprocess launch in a try block. If the terminate() call is not in finally, any exception between Popen() and terminate() orphans the child.

3. Agent is SIGKILLed before cleanup — no process-group tracking

The agent itself is killed by SIGKILL (OOM killer, watchdog, container eviction). SIGKILL cannot be caught, so atexit hooks and finally blocks never run. The children are reparented to PID 1 and keep running unless they were in the same process group and the group received the kill. You can confirm a SIGKILL after the fact: a process killed by signal 9 exits with code 137 (128 + 9).

How to spot it: Check whether subprocesses are started in a new session (start_new_session=True, equivalent to the older preexec_fn=os.setsid but thread-safe). If they share the parent’s group, kill -9 -<pgid> takes them all out together; if not, a parent SIGKILL orphans them.

4. `shell=True` hides the real process tree

subprocess.Popen("npm run dev", shell=True) spawns a shell that forks npm, which forks node. Calling .terminate() on the Popen object sends SIGTERM to the shell only, not to npm or node. The shell exits and the children survive.

How to spot it: Grep for shell=True. Any long-lived subprocess started that way needs explicit process-group management to terminate the full tree.

5. Process is gone but the port is still held (TIME_WAIT)

This one is not an orphan. After a TCP listener closes cleanly, the kernel keeps the port in TIME_WAIT for 2x the maximum segment lifetime — typically 30 to 120 seconds (on Linux, governed by net.ipv4.tcp_fin_timeout). A run that restarts inside that window gets EADDRINUSE even though no process is holding the port. Check with lsof -i :3000: if no process is listed but bind still fails, it is TIME_WAIT, not a leak.

How to spot it: ss -tan | grep :3000 (Linux) or netstat -an | grep 3000 shows the socket in TIME_WAIT. The fix is to set SO_REUSEADDR on the listener (Node does this by default; Python’s http.server does not) or pick an ephemeral port (PORT=0) for throwaway test servers.

6. Container stop does not propagate to subprocesses

In Docker, docker stop sends SIGTERM to PID 1, waits 10 seconds, then sends SIGKILL. If your entrypoint is a naive CMD ["python", "agent.py"], the agent runs as PID 1 but a bare Python process does not forward signals to grandchildren or reap zombies, so children spawned by the agent can linger or pile up as zombies.

How to spot it: Run docker stop <container>, then docker exec <container> ps aux (or check before it dies). If child processes linger or accumulate in Z (zombie) state after the stop, you have a signal-forwarding gap.

Shortest path to fix

Step 1: Wrap every launch in a context manager that kills the group

This is the core fix. start_new_session=True puts the child in its own process group; os.killpg then takes out the child and everything it spawned, in a finally block that runs on success, exception, and normal exit alike.

import subprocess
import signal
import os
from contextlib import contextmanager

@contextmanager
def managed_subprocess(args: list, **kwargs):
    """Start a subprocess and guarantee its whole group dies when the context exits."""
    proc = subprocess.Popen(
        args,
        start_new_session=True,  # child leads its own process group
        **kwargs
    )
    try:
        yield proc
    finally:
        if proc.poll() is None:  # still running
            pgid = os.getpgid(proc.pid)
            os.killpg(pgid, signal.SIGTERM)  # ask the whole group to stop
            try:
                proc.wait(timeout=5)
            except subprocess.TimeoutExpired:
                os.killpg(pgid, signal.SIGKILL)  # force-kill the group
                proc.wait()

# Usage:
with managed_subprocess(["npm", "run", "dev"]) as server:
    run_tests_against_server(server)
# Server (and node, npm, any children) are dead here, even on exception

Note: start_new_session=True and os.killpg are POSIX-only. On Windows, pass creationflags=subprocess.CREATE_NEW_PROCESS_GROUP and terminate with proc.send_signal(signal.CTRL_BREAK_EVENT) or taskkill /T /F /PID <pid> to reach the tree.

Step 2: Track every PID in a registry and clean up on agent exit

The context manager handles the happy path. To survive a SIGTERM (the signal docker stop and most schedulers send first), register a handler that drains a registry. SIGKILL still cannot be caught — that is what Step 5 is for.

import atexit

_subprocess_registry: list[subprocess.Popen] = []

def register_subprocess(proc: subprocess.Popen) -> subprocess.Popen:
    _subprocess_registry.append(proc)
    return proc

def cleanup_all_subprocesses():
    for proc in _subprocess_registry:
        if proc.poll() is None:
            try:
                pgid = os.getpgid(proc.pid)
                os.killpg(pgid, signal.SIGTERM)
                proc.wait(timeout=3)
            except subprocess.TimeoutExpired:
                os.killpg(os.getpgid(proc.pid), signal.SIGKILL)
            except ProcessLookupError:
                pass  # already gone

atexit.register(cleanup_all_subprocesses)
signal.signal(signal.SIGTERM, lambda *_: (cleanup_all_subprocesses(), os._exit(0)))

Step 3: Use exec-form instead of `shell=True`

# WRONG — shell=True: .terminate() only kills the shell, node survives
proc = subprocess.Popen("npm run dev", shell=True)

# CORRECT — exec form + own session, so killpg reaches the whole tree
proc = subprocess.Popen(["npm", "run", "dev"], start_new_session=True)
os.killpg(os.getpgid(proc.pid), signal.SIGTERM)

If you must use a shell (for a pipe or glob), keep start_new_session=True and always kill by group, never by proc.pid alone.

Step 4: Confirm the port is actually free before exiting

After terminating the server, do not assume the port is instantly reusable. A slow graceful shutdown can hold it for seconds, and TIME_WAIT (Cause 5) can hold it for up to 30-120s. Poll for it, and for throwaway test servers prefer SO_REUSEADDR or an ephemeral port so the next run never collides.

import socket
import time

def is_port_in_use(port: int) -> bool:
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        return s.connect_ex(("localhost", port)) == 0

def wait_for_port_free(port: int, timeout: float = 10.0):
    deadline = time.time() + timeout
    while time.time() < deadline:
        if not is_port_in_use(port):
            return
        time.sleep(0.5)
    raise RuntimeError(f"Port {port} still occupied after {timeout}s of cleanup")

with managed_subprocess(["npm", "run", "dev"]) as server:
    run_tests()
wait_for_port_free(3000, timeout=10)

Step 5: For SIGKILL-proofing, hand long-lived processes to a supervisor

atexit and signal handlers do nothing against SIGKILL or a hard container eviction. For any subprocess that must outlive a single agent invocation, let a supervisor own the lifecycle and have the agent only signal it (“start server for run-42”).

# supervisord program for the test server
[program:test_server]
command=npm run dev
autostart=false
autorestart=false
stopwaitsecs=5
killasgroup=true   # send the stop signal to the whole group
stopasgroup=true   # so node and its children all go down

In Docker, add a real init so PID 1 forwards signals and reaps zombies: run with docker run --init (which injects the bundled tini as PID 1, making your agent PID 2), or set init: true in your Compose service. Without it, a bare process at PID 1 will not reap the zombies your subprocesses leave behind.

How to confirm it is fixed

Run a deliberate kill test, since the failure only shows up when cleanup is skipped:

Note the port your agent uses, then start a real run.
While it is mid-task, hard-kill the agent: kill -9 <agent_pid> (simulates OOM/eviction).
After ~10 seconds, check for survivors: lsof -i :3000 and pgrep -f "npm run dev" should both return nothing.
Start the next run; it should bind the port with no EADDRINUSE.

If survivors remain after a SIGKILL of the agent, you are relying on atexit/SIGTERM alone and need the supervisor approach in Step 5.

Prevention

Wrap every subprocess.Popen in a context manager that terminates the group in finally.
Always launch with start_new_session=True and clean up with os.killpg(os.getpgid(pid), ...), not proc.terminate() alone.
Never use shell=True for a long-lived process unless you still kill by group.
Register atexit plus a SIGTERM handler that drains a subprocess registry.
Tag each child with an AGENT_TASK_ID env var so a post-run sweep can find strays: ps -eww -o pid,command | grep AGENT_TASK_ID=<id>.
Treat EADDRINUSE after a clean exit as possible TIME_WAIT, not always a leak — set SO_REUSEADDR or use an ephemeral port for test servers.
In Docker, always run with docker run --init (or Compose init: true) so PID 1 forwards signals and reaps zombies.
Add the SIGKILL smoke test above to CI so a regression in cleanup fails the build.

FAQ

Q: How do I find orphaned subprocesses from a previous agent run? A: List port holders with lsof -i :<port> and kill them with kill -9 $(lsof -t -i :<port>). To find processes that were reparented to init (true orphans): on Linux use ps --ppid 1 -u <agent_user>; on macOS the BSD ps has no --ppid, so use ps -eo pid,ppid,command | awk '$2 == 1'. If you tagged children with AGENT_TASK_ID, grep for that env var instead.

Q: Is it safe to send SIGKILL immediately without SIGTERM first? A: For test servers and --watch compilers, yes — they have nothing important to flush. For anything that writes to disk (databases, log rotators), always send SIGTERM first and allow 3-5 seconds; a SIGKILL mid-write can corrupt files or leave a stale lock.

Q: The process is gone but I still get EADDRINUSE. Why? A: That is TCP TIME_WAIT, not an orphan. The kernel holds the port for 30-120 seconds after a clean close. Confirm with lsof -i :<port> showing no process while ss -tan | grep :<port> shows TIME_WAIT. Set SO_REUSEADDR on the listener, or bind to port 0 for throwaway servers.

Q: An orphan is holding a SQLite lock and the next run reports database is locked. How do I clear it? A: Find the holder with lsof <db_file>, then kill -9 <pid>. For a WAL-mode database, after confirming nothing else is using it, remove the leftover -wal and -shm sidecar files so the next opener starts clean.

Q: Temporal or LangGraph — can subprocesses outlive the workflow? A: Yes. Activities and nodes run inside your worker process, so if the worker is killed its subprocesses are orphaned. The framework does not manage processes you launch yourself. Apply the same context-manager plus atexit/SIGTERM cleanup in the worker, and for anything long-lived use a supervisor (Step 5).

Q: My subprocess ignores SIGTERM. What now? A: After ~5 seconds with no exit, send SIGKILL to the group: os.killpg(pgid, signal.SIGKILL). If it survives even SIGKILL, it is stuck in D (uninterruptible sleep) on hung I/O — usually an NFS or device-driver issue. Only killing the container or rebooting the host clears that.

Tags: #AI coding #Agents #Troubleshooting

Which bucket are you in

Common causes

1. Agent exits without calling cleanup — no signal is sent

2. Exception path skips cleanup

3. Agent is SIGKILLed before cleanup — no process-group tracking

4. shell=True hides the real process tree

5. Process is gone but the port is still held (TIME_WAIT)

6. Container stop does not propagate to subprocesses

Shortest path to fix

Step 1: Wrap every launch in a context manager that kills the group

Step 2: Track every PID in a registry and clean up on agent exit

Step 3: Use exec-form instead of shell=True

Step 4: Confirm the port is actually free before exiting

Step 5: For SIGKILL-proofing, hand long-lived processes to a supervisor

How to confirm it is fixed

Prevention

FAQ

Related

Related Articles

Agent Budget Exhausted Halfway Through the Task

Restored Agent Checkpoint Is Corrupted

Cost Tracking Misses Sub-Agent Usage

Cycle in Agent Call Graph Goes Undetected

Agent Handoff Loses Context Between Steps

Agent Orchestrator Deadlocks Waiting on Each Other

4. `shell=True` hides the real process tree

Step 3: Use exec-form instead of `shell=True`