Docker Container Restarts With Exit Code 137 (OOM Killed): Fix It

Q: Does exit 137 always mean OOM?

No. 137 is `128 + 9`, i.e. the process received SIGKILL. The OOM killer is the most common source inside containers, but a liveness probe, a manual `docker kill`, or a host-level OOM also produce 137. Check `{{.State.OOMKilled}}` and `dmesg` to be sure.

Container exits 137 with no stack trace. That's SIGKILL from the OOM killer hitting your --memory cap. Confirm it, find the leak with a heap dump, set a sane limit, and add a guardrail.

Published: May 24, 2026 Updated: Jun 18, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

The container is up for two days, then bam: restart with exit code 137. Logs end mid-sentence, no exception, no stack trace. Exit 137 is 128 + 9 = SIGKILL, and on a containerized process that almost always means the kernel OOM killer hit your --memory cap.

Fastest path: run docker inspect --format '{{.State.OOMKilled}} {{.State.ExitCode}}' my-container. If it prints true 137, the kernel killed your process for crossing its memory limit. From there it is one of two things: the limit is too tight for normal peak (raise it after measuring), or the process is leaking and would OOM at any size (find the leak, do not just raise the limit). The rest of this page confirms which one you have, finds the leak with a heap dump, and adds an alert so it never fails silently again.

Which bucket are you in?

Symptom	Likely cause	Go to
OOMs on traffic spikes, RSS stable between spikes	Limit too tight for peak	Step 4
RSS climbs monotonically over hours/days regardless of traffic	Leak	Step 3, then Step 5
Predictable restart cadence (e.g. every ~36h)	Slow leak	Step 3, then Step 5
Language heap looks healthy but RSS is ~2x	Native/off-heap allocations	Step 3 (note below)
`OOMKilled: false` but exit `137`	Host-level OOM, liveness probe, or manual kill	See note in Step 1

Common causes

Ordered by hit rate.

1. Memory limit set too low for normal peak

You set --memory=256m from a copy-pasted Helm chart. Real peak working set is 400 MB. Every traffic spike triggers OOM.

How to spot it: docker inspect shows "Memory": 268435456. RSS during traffic peaks brushes the limit.

2. Slow leak in app code

Process RSS grows monotonically over hours or days, then hits the limit. The restart cycle becomes predictable (for example, every 36 hours).

How to spot it: docker stats over time shows a monotone climb that does not fall back after traffic drops.

3. Unbounded in-process cache

A Map/dict used as a cache with no eviction. Every unique key adds an entry, so memory grows forever.

How to spot it: a heap snapshot shows one Map (or dict) instance with hundreds of thousands of entries.

4. Connection pool with no upper bound

An ORM creates a new connection per request without releasing it. Each connection costs MB, so memory climbs with concurrency.

How to spot it: pool size in metrics exceeds the documented max; the heap shows many DB-driver objects.

5. Native allocations the heap profiler cannot see

Node Buffers, Python NumPy arrays, Go cgo allocations, and JVM/.NET off-heap memory live outside the language heap. The OOM killer charges them against your cgroup, but language-level profilers do not show them.

How to spot it: the language heap looks healthy but the cgroup RSS is roughly double. Compare cat /sys/fs/cgroup/memory.current (what the kernel charges) against your runtime’s reported heap.

Shortest path to fix

Step 1: Confirm it was the OOM killer

docker inspect --format '{{.State.OOMKilled}} {{.State.ExitCode}}' my-container
# Expect: true 137

# Kernel log (root namespace; run on the host)
dmesg | grep -i 'oom\|killed process'
# or, with journald:
sudo journalctl -k | grep -i 'oom\|killed process'
# Look for: "Memory cgroup out of memory: Killed process 1234 (node)"

A cgroup-scoped kill prints Memory cgroup out of memory. A whole-host kill prints Out of memory: Killed process ... without the cgroup prefix.

If OOMKilled is false but exit is 137, the SIGKILL came from somewhere other than your container’s own limit:

Host ran out of RAM. Even with no --memory set, the kernel OOM killer can pick your process when the whole box is out of memory. docker inspect reports OOMKilled: false because it was a host-level, not cgroup-limit, event. Check dmesg for Out of memory without the cgroup prefix, and free -m on the host.
Orchestrator liveness probe killed an unresponsive container (Kubernetes will record this in kubectl describe pod).
Manual docker kill or a deploy that sent SIGKILL.

Step 2: See what limit you actually have and what was used

docker stats --no-stream my-container
# MEM USAGE / LIMIT, e.g. 245.3MiB / 256MiB

# Inside the container (cgroup v2 — default on modern hosts)
cat /sys/fs/cgroup/memory.max          # the hard limit ("max" means unlimited)
cat /sys/fs/cgroup/memory.current      # current charged usage

From the host with the systemd cgroup driver (cgroup v2), the limit file is namespaced by container scope:

cat /sys/fs/cgroup/system.slice/docker-<container-id>.scope/memory.max

Caveat on docker stats: the MEM USAGE column subtracts reclaimable page cache (inactive_file) to approximate the working set. That can read lower than what the OOM killer actually charges, because the kernel evaluates the total cgroup footprint including off-heap/native memory. If docker stats looks fine but you still OOM, trust memory.current and dmesg over the stats column.

For Kubernetes:

kubectl top pod my-pod
kubectl describe pod my-pod | grep -A3 -iE 'memory|OOMKilled|Last State'

kubectl describe shows Last State: Terminated, Reason: OOMKilled and the restart count, which confirms the loop. Note that Kubernetes records Reason: OOMKilled for two distinct events: a container-limit OOM (your pod crossed its own limits.memory) and a node-overcommit OOM (the node ran out of RAM and the kubelet evicted or the kernel killed your pod even though it was under its own limit). If kubectl describe pod shows the container under its limit but still OOMKilled, check node pressure with kubectl describe node | grep -A5 MemoryPressure and reduce overcommit by raising requests.memory.

Step 3: Profile the heap

Take two snapshots (baseline, then after load) and compare. One snapshot is a photo; two is a story. Sort by Retained Size, not Shallow Size: retained is the memory that would be freed if the object were collected, so leaks float to the top.

Node.js — the container-friendly way is a signal trigger, so you do not have to expose a debug port. Start the process with --heapsnapshot-signal=SIGUSR2 (available since Node 12; see the Node.js heap snapshot guide), then send the signal to write a .heapsnapshot into the working directory. Use SIGUSR2, not SIGUSR1 — Node reserves SIGUSR1 to open the inspector, so sending USR1 will not produce a snapshot:

node --heapsnapshot-signal=SIGUSR2 server.js
# later, from inside the container:
kill -USR2 1            # PID 1 if the app is the entrypoint
# pull the file out and open it in Chrome DevTools > Memory
docker cp my-container:/app/Heap.<timestamp>.heapsnapshot ./

Or trigger it in-process (for example, from an admin route):

require('v8').writeHeapSnapshot('/tmp/heap.heapsnapshot');

Caution in production: writing a snapshot pauses the main thread (it can take a minute on a large heap) and is built in memory, so it can briefly double RSS and tip a tight container into another OOM. Take snapshots on a replica or with headroom. Common offenders: caches, EventEmitter listeners never removed, closures keeping large arrays alive.

Python — tracemalloc for line-level attribution:

import tracemalloc
tracemalloc.start(25)

# ... let it run under load ...

snap = tracemalloc.take_snapshot()
for stat in snap.statistics('lineno')[:20]:
    print(stat)

Or memray for a flamegraph:

pip install memray
memray run -o out.bin my_app.py
memray flamegraph out.bin

Go — pprof:

import _ "net/http/pprof"
go func() { http.ListenAndServe(":6060", nil) }()

go tool pprof -http=:8080 http://localhost:6060/debug/pprof/heap

Step 4: Set a sensible limit with a buffer

Aim for limit = p99_RSS * 1.5, measured over at least a week of real traffic. For Node, also constrain V8’s heap so the runtime garbage-collects before the kernel kills it.

# docker-compose.yml
services:
  api:
    image: my/api
    deploy:
      resources:
        limits:
          memory: 768M
    environment:
      NODE_OPTIONS: "--max-old-space-size=512"

Keep --max-old-space-size comfortably below the container limit (here 512 MB heap inside a 768 MB cap) so native memory and other allocations have room. For Kubernetes, set both request and limit:

resources:
  requests:
    memory: "512Mi"
  limits:
    memory: "768Mi"

requests drives scheduler placement; limits is the OOM cutoff. Keep limits within ~1.5x of requests to avoid noisy-neighbor surprises.

Old-runtime trap: very old runtimes read cgroup v1 files only and miss the v2 limit, so they see the host’s total RAM and size their heap far too large, then get OOM-killed instantly. This bites hard after a base-image upgrade (for example moving to Amazon Linux 2023, RHEL 9, or Ubuntu 22.04+, all of which default to cgroup v2). The exact runtime floors that read cgroup v2 correctly, as of June 2026: OpenJDK 8u372+, 11.0.16+, or 17+ (older 8 and pre-11.0.16 builds silently fall back to host RAM), Node 12.17+ / 16+, and recent .NET. The JVM is container-aware by default (-XX:+UseContainerSupport) once you are on a supporting build; you usually do not need to set -Xmx manually, but you can override the 25 percent default with -XX:MaxRAMPercentage.

Step 5: Fix the leak, not just the limit

If RSS is monotone over hours regardless of traffic, raising the limit only delays the next OOM. Common patches:

Replace an unbounded Map cache with an LRU (lru-cache in Node, cachetools.LRUCache in Python).
Add an upper bound to the DB pool and verify connections are released on error paths, not just the happy path.
Remove EventEmitter listeners on shutdown; for long-lived sockets, cap setMaxListeners.
Reuse Buffer.alloc() pools instead of allocating a large buffer per request.

Step 6: Add a guardrail

Alert when the working set crosses ~85 percent of the limit so you catch a leak before it OOMs.

# Prometheus rule
- alert: ContainerNearOOM
  expr: container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.85
  for: 10m
  labels:
    severity: warning

A restart-count alert catches the silent OOM-restart loop:

- alert: PodRestartLoop
  expr: increase(kube_pod_container_status_restarts_total[1h]) > 3
  for: 0m

How to confirm it’s fixed

After raising the limit and/or patching the leak, run the same load you reproduced the OOM with.
Watch docker stats (or Grafana container_memory_working_set_bytes): RSS should plateau, not keep climbing, and stay under ~80 percent of the limit at peak.
Re-run docker inspect --format '{{.State.OOMKilled}} {{.State.RestartCount}}' my-container after a full cycle — OOMKilled should be false and the restart count should stop incrementing.
For a suspected leak, take a heap snapshot before and after a few hours of steady traffic; the retained-size leaders should not grow.

FAQ

Does exit 137 always mean OOM? No. 137 is 128 + 9, i.e. the process received SIGKILL. The OOM killer is the most common source inside containers, but a liveness probe, a manual docker kill, or a host-level OOM also produce 137. Check {{.State.OOMKilled}} and dmesg to be sure.

OOMKilled is false but I still got exit 137 — why? The kill came from outside your container’s own memory limit. The usual culprits are a host-level OOM (the whole machine ran out of RAM, with or without a --memory cap set), an orchestrator liveness probe killing an unresponsive container, or a manual SIGKILL during a deploy. docker inspect only sets OOMKilled: true when the kill was attributed to the container’s own cgroup limit, so a node-level kill reads false. Confirm with dmesg: a line with Out of memory: but no Memory cgroup out of memory: prefix means it was host-level.

docker stats shows I’m well under the limit, so how did it OOM? The MEM USAGE column subtracts reclaimable page cache to approximate the working set, and it does not surface native/off-heap memory the same way the kernel charges it. The OOM killer evaluates the full cgroup footprint. Trust cat /sys/fs/cgroup/memory.current and the dmesg line over the stats column.

Should I just raise --memory and move on? Only if RSS is stable and you OOM only at peak (limit-too-tight). If RSS climbs monotonically regardless of traffic, that is a leak; a bigger limit just buys a longer interval before the same crash.

What’s the right gap between --max-old-space-size and the container limit? Leave headroom for native memory, threads, and buffers. A common rule is to set V8’s heap to roughly 65-75 percent of the container limit (for example, --max-old-space-size=512 inside a 768 MB cap). This lets V8 garbage-collect before the kernel steps in.

Can I take a heap snapshot in production? Carefully. The snapshot pauses the main thread and is built in memory, so it can briefly double RSS and trigger another OOM on a tight container. Prefer a replica with headroom, or the --heapsnapshot-signal=SIGUSR2 + kill -USR2 pattern so you do not have to expose an inspector port.

Prevention

Every container has a memory limit, with an alert at ~85 percent of it.
p99 RSS is observed for at least a week of real traffic before the limit is set.
Caches and pools have explicit upper bounds.
Heap-profiling tooling ships in the image (or is attachable via sidecar) so you can dump in prod.
Restart count is tracked per service, and every OOM is investigated rather than swallowed.

Tags: #Backend #Troubleshooting #docker

Which bucket are you in?

Common causes

1. Memory limit set too low for normal peak

2. Slow leak in app code

3. Unbounded in-process cache

4. Connection pool with no upper bound

5. Native allocations the heap profiler cannot see

Shortest path to fix

Step 1: Confirm it was the OOM killer

Step 2: See what limit you actually have and what was used

Step 3: Profile the heap

Step 4: Set a sensible limit with a buffer

Step 5: Fix the leak, not just the limit

Step 6: Add a guardrail

How to confirm it’s fixed

FAQ

Prevention

Related

Related Articles

Scheduled Cron Job Skipped Silently With No Error Logged

Postgres Migration Stuck on ALTER TABLE in Production

Fix gRPC DEADLINE_EXCEEDED Errors Under Load

JWT 'jwt expired' on Fresh Tokens: Fix Clock Skew

Kafka Consumer Lag Keeps Growing After Scaling Consumers

MongoDB Aggregation With $lookup + $group Runs for 30 Seconds