gRPC DEADLINE_EXCEEDED Errors Under Load

gRPC clients return DEADLINE_EXCEEDED when traffic rises. Propagate deadlines, set sensible per-RPC timeouts, and add a retry policy plus circuit breaker.

Latency was fine at noon. By 14:00 your gRPC service is on fire: clients see DEADLINE_EXCEEDED everywhere, the upstream is also seeing DEADLINE_EXCEEDED from its dependencies, and tail latency has eaten the dashboards. The pattern is almost always the same — a slow downstream, no deadline propagation, and per-RPC timeouts so tight they all trip at once. Fix by propagating deadlines correctly, picking per-call timeouts that match user expectations, adding a built-in retry policy with backoff, and putting a circuit breaker on the slowest path so a single bad backend stops cascading.

Common causes

Ordered by hit rate.

1. Server is genuinely slower than the client timeout

Service is at p99 = 1.5 s under load; client deadline is 1 s. Every p99 request fails. Adding more clients makes it worse because nobody backs off.

How to spot it: Server p99 > client deadline in the matching window.

2. Deadlines are not propagated through the chain

Client sets a 5 s deadline calling service A. Service A calls service B with no deadline — defaults to infinity. Service A times out but service B keeps working on a request nobody is waiting for, wasting capacity.

How to spot it: Service B sees no DEADLINE_EXCEEDED from cancellation, requests keep finishing long after the user gave up. Inflight goroutines climb.

3. No retry policy

Transient hiccup turns into a permanent failure because the client gives up after the first deadline.

How to spot it: Error rate goes up sharply during minor blips; recovers slowly.

4. Naive retry without circuit breaker

Client retries every failure. Slow backend becomes 3x slower because every call retries twice. Retry storm.

How to spot it: Backend RPS spikes during slowdowns instead of dropping.

5. Head-of-line blocking on a single connection

HTTP/2 streams on one connection share flow control. A slow stream slows everyone on that connection. Default gRPC channel often uses one TCP connection per subchannel.

How to spot it: P99 across endpoints all degrade together even when only one endpoint is slow.

Shortest path to fix

Step 1: Propagate deadlines through every hop

Go server-side handler — ctx carries the deadline from the caller; pass it to downstreams.

func (s *server) GetOrder(ctx context.Context, req *pb.GetOrderReq) (*pb.Order, error) {
    // ctx has the caller's deadline; do NOT replace with context.Background()
    user, err := s.userClient.GetUser(ctx, &pb.GetUserReq{Id: req.UserId})
    if err != nil { return nil, err }
    // ...
}

Node client — never set a fresh deadline on every hop without subtraction.

import { credentials, Metadata } from '@grpc/grpc-js';

const deadline = new Date(Date.now() + 2000);   // 2 s budget
client.getOrder({ id: '...' }, { deadline }, (err, res) => { /* ... */ });

Step 2: Pick per-RPC timeouts by SLO, not by guess

Call typeTypical deadline
Sync user-facing read200-500 ms
Sync user-facing write1-2 s
Background batch30-60 s
Streamingnone on the call, deadline on each iteration

Set in the channel for defaults, override per call.

Step 3: Use built-in retry policy

gRPC supports a service config-driven retry policy. Use it instead of hand-rolled retries.

{
  "methodConfig": [{
    "name": [{ "service": "shop.OrderService" }],
    "retryPolicy": {
      "maxAttempts": 4,
      "initialBackoff": "0.05s",
      "maxBackoff": "1s",
      "backoffMultiplier": 2.0,
      "retryableStatusCodes": ["UNAVAILABLE", "RESOURCE_EXHAUSTED"]
    },
    "timeout": "2s"
  }]
}

Note: DEADLINE_EXCEEDED should not be in retryableStatusCodes unless you also extend the timeout, otherwise retries die on the same deadline.

Wire it in (Go):

const cfg = `{ "methodConfig": [...] }`
conn, _ := grpc.Dial("orders:50051",
    grpc.WithDefaultServiceConfig(cfg),
    grpc.WithTransportCredentials(insecure.NewCredentials()),
)

Step 4: Add a circuit breaker on the slow path

Wrap the outbound call. After N failures in a window, open the breaker for M seconds and short-circuit immediately.

import "github.com/sony/gobreaker"

cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{
    Name:        "orders",
    MaxRequests: 1,
    Interval:    30 * time.Second,
    Timeout:     10 * time.Second,
    ReadyToTrip: func(c gobreaker.Counts) bool {
        return c.Requests >= 20 && float64(c.TotalFailures)/float64(c.Requests) > 0.5
    },
})

res, err := cb.Execute(func() (interface{}, error) {
    return client.GetOrder(ctx, req)
})

When the breaker is open, return a fast UNAVAILABLE instead of waiting for the deadline. This is how you stop cascades.

Step 5: Trace the slow span

# OpenTelemetry collector running locally
grpcurl -d '{"id":"abc"}' -plaintext orders:50051 shop.OrderService/GetOrder

In your tracing UI, filter for status_code = DEADLINE_EXCEEDED and look at the longest child span. That is the path to optimize. Common offenders: synchronous DB query under a lock, blocking call to a third-party API without its own timeout, N+1 RPC pattern.

Step 6: Spread load across more connections

conn, _ := grpc.Dial("dns:///orders.example.com:50051",
    grpc.WithDefaultServiceConfig(`{"loadBalancingConfig":[{"round_robin":{}}]}`),
    grpc.WithTransportCredentials(insecure.NewCredentials()),
)

Round-robin over multiple subchannels avoids head-of-line blocking on a single HTTP/2 connection.

Prevention

  • Deadlines propagate through every hop; no context.Background() in handlers.
  • Per-RPC timeouts derived from SLO, not guesses.
  • gRPC built-in retry policy with bounded attempts; never retry DEADLINE_EXCEEDED without extending the budget.
  • Circuit breaker on every external dependency.
  • Tracing on by default; alert on DEADLINE_EXCEEDED rate, not just error rate.

Tags: #Backend #Troubleshooting #grpc