Docker Container Randomly Restarts With Exit Code 137 (OOM Killed)

Containers restart with exit 137. The OOM killer hit your --memory limit. Find the leak, profile heap, set sensible limits, and stop the bleeding.

The container has been up for two days, then bam — restart with exit code 137. The logs end mid-sentence, no exception, no stack trace. Exit 137 is 128 + 9 = SIGKILL, which on a containerized process almost always means the kernel OOM killer hit your --memory cap. Either the limit is too tight for normal traffic, or the process is leaking and would eventually OOM at any size. Fix by confirming it was the OOM killer, raising the limit only after profiling, finding the actual leak with a heap dump, and putting a guardrail in place so it does not recur silently.

Common causes

Ordered by hit rate.

1. Memory limit set too low for normal peak

You set --memory=256m from a copy-pasted Helm chart. Real peak working set is 400 MB. Every traffic spike triggers OOM.

How to spot it: docker inspect shows Memory: 268435456. RSS during traffic peaks brushes the limit.

2. Slow leak in app code

Process RSS grows monotonically over hours or days. Eventually hits the limit. Restart cycle becomes predictable (e.g. every 36 hours).

How to spot it: docker stats over time shows monotone climb.

3. Unbounded in-process cache

Map/dict used as a cache with no LRU. Every unique key adds an entry. Memory grows forever.

How to spot it: Heap dump shows one Map instance with hundreds of thousands of entries.

4. Connection pool with no upper bound

ORM creates a new connection per request without releasing. Each connection costs MB. Memory climbs with concurrency.

How to spot it: Pool size in metrics exceeds documented max; heap shows many DB driver objects.

5. Native code allocations not visible to the heap profiler

Node Buffer, Python NumPy arrays, Go cgo allocations — these can be invisible to language-level profilers but very visible to the OS.

How to spot it: Language heap looks healthy but RSS is double.

Shortest path to fix

Step 1: Confirm it was the OOM killer

docker inspect --format '{{.State.OOMKilled}} {{.State.ExitCode}}' my-container
# Expect: true 137

# Kernel log
dmesg | grep -i 'oom\|killed process'
# Look for: "Memory cgroup out of memory: Killed process 1234 (node)"

If OOMKilled is false but exit is 137, something else SIGKILL’d it — orchestrator liveness probe, manual kill, or kernel for other reasons.

Step 2: See what limit you actually have and what was used

docker stats --no-stream my-container
# MEM USAGE / LIMIT, e.g. 245.3MiB / 256MiB

# Inside the container
cat /sys/fs/cgroup/memory.max          # cgroup v2
cat /sys/fs/cgroup/memory.current

For Kubernetes:

kubectl top pod my-pod
kubectl describe pod my-pod | grep -A2 -i memory

Step 3: Profile the heap

Node.js — generate a heap snapshot and inspect in Chrome DevTools.

# Start the process with --inspect
node --inspect=0.0.0.0:9229 server.js

# Or trigger a snapshot in-process
node -e "require('v8').writeHeapSnapshot('/tmp/heap.heapsnapshot')"

Look for retained-size leaders. Common offenders: caches, EventEmitter listeners not removed, closures keeping large arrays alive.

Python — use tracemalloc:

import tracemalloc
tracemalloc.start(25)

# ... let it run ...

snap = tracemalloc.take_snapshot()
for stat in snap.statistics('lineno')[:20]:
    print(stat)

Or memray for an ergonomic flamegraph:

pip install memray
memray run -o out.bin my_app.py
memray flamegraph out.bin

Go — pprof:

import _ "net/http/pprof"
go func() { http.ListenAndServe(":6060", nil) }()
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/heap

Step 4: Set sensible limits with a buffer

Aim for limit = p99_RSS * 1.5. For Node, also constrain V8.

# docker-compose.yml
services:
  api:
    image: my/api
    deploy:
      resources:
        limits:
          memory: 768M
    environment:
      NODE_OPTIONS: "--max-old-space-size=512"

For Kubernetes, set both request and limit:

resources:
  requests:
    memory: "512Mi"
  limits:
    memory: "768Mi"

requests for scheduler placement, limits for OOM cutoff. Keep limits within 1.5x of requests to avoid noisy-neighbor surprises.

Step 5: Fix the leak, not just the limit

If RSS is monotone over hours regardless of traffic, raising the limit only delays the next OOM. Common patches:

  • Replace unbounded Map cache with an LRU (lru-cache in Node, cachetools.LRUCache in Python).
  • Add an upper bound to the DB pool and verify connections are released on error paths.
  • Remove EventEmitter listeners on shutdown; in long-lived sockets cap setMaxListeners.
  • Reuse Buffer.alloc() pools instead of per-request large buffers.

Step 6: Add a guardrail

Alert when RSS exceeds 80 percent of the limit so you find leaks before they OOM.

# Prometheus rule
- alert: ContainerNearOOM
  expr: container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.85
  for: 10m
  labels:
    severity: warning

Restart-count alert catches the silent OOM-restart loop:

- alert: PodRestartLoop
  expr: increase(kube_pod_container_status_restarts_total[1h]) > 3
  for: 0m

Prevention

  • Every container has a memory limit; alert at 85 percent of it.
  • p99 RSS is observed for at least a week before the limit is set.
  • Caches and pools have explicit upper bounds.
  • Heap profiling tooling shipped in the image (or attachable via sidecar) so you can dump in prod.
  • Track restart count per service; investigate every OOM rather than swallowing it.

Tags: #Backend #Troubleshooting #docker