Agent's Subprocess Orphaned After Agent Exits

Subprocesses launched by your agent keep running after the agent exits, consuming resources and causing side effects. Here's how to track and clean them up.

A Claude Code or LangGraph execution agent launches a npm run dev server to test its own output, then crashes due to a budget limit before it can shut the server down. The Node.js process keeps running — port 3000 is occupied, CPU and memory are consumed, and the next agent run immediately fails with EADDRINUSE: address already in use :::3000. Or a code-compilation agent fires a long-running tsc --watch process and exits without killing it. The watcher accumulates file change events in the background and eventually writes corrupted output files. Orphaned subprocesses are invisible until they cause a visible failure.

Common causes

1. Agent exits without calling cleanup — no signal sent to subprocesses

The agent uses subprocess.Popen(["npm", "run", "dev"]) and stores the Popen object. If the agent exits normally (budget hit, task complete, exception) without calling .terminate() on that object, Python’s process exits and the subprocess becomes an orphan owned by PID 1 (or the init system). No signal is sent.

How to spot it: Search for subprocess.Popen and asyncio.create_subprocess_exec in the agent codebase. For each one, check whether the return value is stored and whether .terminate() or .kill() is called in a finally block or context manager.

2. Exception path skips cleanup

The agent starts a subprocess, then an exception is raised between the start and the cleanup call. try: run_server(); except: raise — the cleanup is not in a finally block and so the exception bypasses it. This is the single most common path to orphaned processes in practice.

How to spot it: Look at every subprocess launch in a try block. If the terminate() call is not in finally, any exception between Popen() and terminate() will orphan the process.

3. Agent crashes before cleanup — no process group tracking

The agent process is killed by SIGKILL (OOM, watchdog, container eviction). SIGKILL cannot be caught — there is no cleanup code that runs. Subprocesses are reparented to PID 1. They continue running unless they are in the same process group as the agent and the group received the kill signal.

How to spot it: Check whether subprocesses are started in a new process group (os.setsid() or start_new_session=True). If they are in the same group as the parent, kill -SIGKILL -<pgid> kills all of them together. If they are not grouped, a parent SIGKILL orphans them.

4. Agent uses shell=True — subprocess tree is not tracked

subprocess.Popen("npm run dev", shell=True) creates a shell process that forks npm, which forks node. Calling .terminate() on the Popen object sends SIGTERM to the shell — not to npm or node. The shell exits but the children survive.

How to spot it: Search for shell=True in subprocess calls. Any subprocess started with shell=True requires explicit process group management to terminate the full tree.

5. Port/resource hold persists even after process is “cleaned up”

The agent calls .terminate() on the subprocess, but the subprocess catches SIGTERM and does a long graceful shutdown that holds the port. The agent waits 0 seconds and exits. The port is occupied for another 30 seconds while the shutdown completes. The next run starts within that window and gets EADDRINUSE.

How to spot it: Time how long the port is occupied after calling .terminate(). If it’s more than 2 seconds, the process is doing a long graceful shutdown. Use .kill() (SIGKILL) after a short timeout if the port must be released immediately.

6. Container orchestration doesn’t propagate signals to subprocesses

In Docker, the agent runs as PID 1. docker stop sends SIGTERM to PID 1. If the agent catches SIGTERM and exits, subprocesses it spawned (which are children of PID 1) also receive the signal — but only if the init system is configured to forward signals. With a naive entrypoint (CMD ["python", "agent.py"]), signal forwarding to grandchildren is unreliable.

How to spot it: Run docker stop <container> and check with ps aux inside the container whether child processes die immediately or linger. Lingering processes after docker stop indicate signal propagation issues.

Shortest path to fix

Step 1: Use a context manager to guarantee subprocess cleanup

import subprocess
import signal
import os
from contextlib import contextmanager

@contextmanager
def managed_subprocess(args: list, **kwargs):
    """Start a subprocess and guarantee it is terminated when the context exits."""
    proc = subprocess.Popen(
        args,
        start_new_session=True,  # new process group
        **kwargs
    )
    try:
        yield proc
    finally:
        if proc.poll() is None:  # still running
            # Try graceful shutdown first
            os.killpg(os.getpgid(proc.pid), signal.SIGTERM)
            try:
                proc.wait(timeout=5)
            except subprocess.TimeoutExpired:
                # Force kill the entire process group
                os.killpg(os.getpgid(proc.pid), signal.SIGKILL)
                proc.wait()

# Usage:
with managed_subprocess(["npm", "run", "dev"]) as server:
    run_tests_against_server(server)
# Server is killed here, even if an exception occurred

Step 2: Track all subprocess PIDs in a registry and clean up on agent exit

import atexit
import signal

_subprocess_registry: list[subprocess.Popen] = []

def register_subprocess(proc: subprocess.Popen) -> subprocess.Popen:
    _subprocess_registry.append(proc)
    return proc

def cleanup_all_subprocesses():
    for proc in _subprocess_registry:
        if proc.poll() is None:
            try:
                os.killpg(os.getpgid(proc.pid), signal.SIGTERM)
                proc.wait(timeout=3)
            except Exception:
                try:
                    os.killpg(os.getpgid(proc.pid), signal.SIGKILL)
                except Exception:
                    pass

# Register cleanup on normal exit and SIGTERM
atexit.register(cleanup_all_subprocesses)
signal.signal(signal.SIGTERM, lambda *_: cleanup_all_subprocesses())

Step 3: Avoid shell=True — use exec-form with process group

# WRONG — shell=True, hard to kill the full tree
proc = subprocess.Popen("npm run dev", shell=True)

# CORRECT — exec form, full process group management
proc = subprocess.Popen(
    ["npm", "run", "dev"],
    start_new_session=True,  # creates a new process group
)
# Kill the entire group:
os.killpg(os.getpgid(proc.pid), signal.SIGTERM)

Step 4: Release ports explicitly before the agent exits

import socket

def is_port_in_use(port: int) -> bool:
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        return s.connect_ex(('localhost', port)) == 0

def wait_for_port_free(port: int, timeout: float = 10.0):
    import time
    deadline = time.time() + timeout
    while time.time() < deadline:
        if not is_port_in_use(port):
            return
        time.sleep(0.5)
    raise RuntimeError(f"Port {port} still occupied after {timeout}s cleanup")

# After terminating the server subprocess:
with managed_subprocess(["npm", "run", "dev"]) as server:
    run_tests()
wait_for_port_free(3000, timeout=10)

Step 5: Use a process supervisor for long-running subprocesses

For subprocesses that must outlive any single agent invocation, use a supervisor (systemd, supervisord, or Docker’s restart policy) rather than managing them from within the agent. The agent signals the supervisor (“start server for run-42”) and the supervisor manages the lifecycle independently.

# supervisord config for the test server
[program:test_server]
command=npm run dev
autostart=false
autorestart=false
stopwaitsecs=5
killasgroup=true
stopasgroup=true

Prevention

  • Wrap every subprocess.Popen call in a context manager that guarantees termination in a finally block.
  • Use start_new_session=True to put subprocesses in their own process group; use os.killpg() to kill the entire group at cleanup.
  • Avoid shell=True for any subprocess that runs long-lived processes; exec-form gives you direct control over the process group.
  • Register a cleanup handler with atexit and signal.SIGTERM that terminates all registered subprocesses.
  • After terminating a subprocess that holds a port, wait for the port to be free before exiting — do not assume instant cleanup.
  • Write cleanup smoke tests: launch the agent, kill it mid-run (SIGKILL), and verify no subprocesses survive after 10 seconds.
  • In Docker, use tini as the init process (docker run --init) to ensure signals are correctly forwarded to subprocess trees.
  • Log subprocess PID at launch and confirm termination in logs; any PID that was launched but never confirmed terminated is a candidate orphan.

FAQ

Q: How do I find orphaned subprocesses from a previous agent run? A: Use ps aux | grep <process_name> to list candidates. For port-specific orphans: lsof -i :<port> (macOS/Linux) or netstat -tlnp | grep :<port> (Linux). To find all processes owned by the agent’s user that are children of PID 1 (orphans): ps --ppid 1 -u <agent_user>.

Q: Is it safe to use SIGKILL immediately without SIGTERM first? A: For test servers and compilation watchers, yes — they have no important cleanup to do. For processes that write to files (databases, log rotators), always try SIGTERM first and give them 3-5 seconds to flush. SIGKILL on a mid-write process can corrupt files.

Q: Temporal workflows — can subprocesses outlive the workflow? A: Yes. Temporal activities run in your worker process. If a Temporal worker crashes, its subprocesses are orphaned. Apply the same atexit + signal.SIGTERM cleanup in the worker process. Temporal does not manage subprocesses launched within activities.

Q: How do I handle a subprocess that ignores SIGTERM? A: After a 5-second wait following SIGTERM with no exit, send SIGKILL to the process group. If a process ignores SIGKILL, it is in a zombie or D (uninterruptible sleep) state — typically caused by a kernel bug or hung I/O. Killing the container or rebooting the host is the only option in that case.

Tags: #AI coding #Agents #Troubleshooting