Latency was fine at noon. By 14:00 your gRPC service is on fire: clients see DEADLINE_EXCEEDED everywhere, the upstream is also seeing DEADLINE_EXCEEDED from its dependencies, and tail latency has eaten the dashboards. The pattern is almost always the same — a slow downstream, no deadline propagation, and per-RPC timeouts so tight they all trip at once. Fix by propagating deadlines correctly, picking per-call timeouts that match user expectations, adding a built-in retry policy with backoff, and putting a circuit breaker on the slowest path so a single bad backend stops cascading.
Common causes
Ordered by hit rate.
1. Server is genuinely slower than the client timeout
Service is at p99 = 1.5 s under load; client deadline is 1 s. Every p99 request fails. Adding more clients makes it worse because nobody backs off.
How to spot it: Server p99 > client deadline in the matching window.
2. Deadlines are not propagated through the chain
Client sets a 5 s deadline calling service A. Service A calls service B with no deadline — defaults to infinity. Service A times out but service B keeps working on a request nobody is waiting for, wasting capacity.
How to spot it: Service B sees no DEADLINE_EXCEEDED from cancellation, requests keep finishing long after the user gave up. Inflight goroutines climb.
3. No retry policy
Transient hiccup turns into a permanent failure because the client gives up after the first deadline.
How to spot it: Error rate goes up sharply during minor blips; recovers slowly.
4. Naive retry without circuit breaker
Client retries every failure. Slow backend becomes 3x slower because every call retries twice. Retry storm.
How to spot it: Backend RPS spikes during slowdowns instead of dropping.
5. Head-of-line blocking on a single connection
HTTP/2 streams on one connection share flow control. A slow stream slows everyone on that connection. Default gRPC channel often uses one TCP connection per subchannel.
How to spot it: P99 across endpoints all degrade together even when only one endpoint is slow.
Shortest path to fix
Step 1: Propagate deadlines through every hop
Go server-side handler — ctx carries the deadline from the caller; pass it to downstreams.
func (s *server) GetOrder(ctx context.Context, req *pb.GetOrderReq) (*pb.Order, error) {
// ctx has the caller's deadline; do NOT replace with context.Background()
user, err := s.userClient.GetUser(ctx, &pb.GetUserReq{Id: req.UserId})
if err != nil { return nil, err }
// ...
}
Node client — never set a fresh deadline on every hop without subtraction.
import { credentials, Metadata } from '@grpc/grpc-js';
const deadline = new Date(Date.now() + 2000); // 2 s budget
client.getOrder({ id: '...' }, { deadline }, (err, res) => { /* ... */ });
Step 2: Pick per-RPC timeouts by SLO, not by guess
| Call type | Typical deadline |
|---|---|
| Sync user-facing read | 200-500 ms |
| Sync user-facing write | 1-2 s |
| Background batch | 30-60 s |
| Streaming | none on the call, deadline on each iteration |
Set in the channel for defaults, override per call.
Step 3: Use built-in retry policy
gRPC supports a service config-driven retry policy. Use it instead of hand-rolled retries.
{
"methodConfig": [{
"name": [{ "service": "shop.OrderService" }],
"retryPolicy": {
"maxAttempts": 4,
"initialBackoff": "0.05s",
"maxBackoff": "1s",
"backoffMultiplier": 2.0,
"retryableStatusCodes": ["UNAVAILABLE", "RESOURCE_EXHAUSTED"]
},
"timeout": "2s"
}]
}
Note: DEADLINE_EXCEEDED should not be in retryableStatusCodes unless you also extend the timeout, otherwise retries die on the same deadline.
Wire it in (Go):
const cfg = `{ "methodConfig": [...] }`
conn, _ := grpc.Dial("orders:50051",
grpc.WithDefaultServiceConfig(cfg),
grpc.WithTransportCredentials(insecure.NewCredentials()),
)
Step 4: Add a circuit breaker on the slow path
Wrap the outbound call. After N failures in a window, open the breaker for M seconds and short-circuit immediately.
import "github.com/sony/gobreaker"
cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{
Name: "orders",
MaxRequests: 1,
Interval: 30 * time.Second,
Timeout: 10 * time.Second,
ReadyToTrip: func(c gobreaker.Counts) bool {
return c.Requests >= 20 && float64(c.TotalFailures)/float64(c.Requests) > 0.5
},
})
res, err := cb.Execute(func() (interface{}, error) {
return client.GetOrder(ctx, req)
})
When the breaker is open, return a fast UNAVAILABLE instead of waiting for the deadline. This is how you stop cascades.
Step 5: Trace the slow span
# OpenTelemetry collector running locally
grpcurl -d '{"id":"abc"}' -plaintext orders:50051 shop.OrderService/GetOrder
In your tracing UI, filter for status_code = DEADLINE_EXCEEDED and look at the longest child span. That is the path to optimize. Common offenders: synchronous DB query under a lock, blocking call to a third-party API without its own timeout, N+1 RPC pattern.
Step 6: Spread load across more connections
conn, _ := grpc.Dial("dns:///orders.example.com:50051",
grpc.WithDefaultServiceConfig(`{"loadBalancingConfig":[{"round_robin":{}}]}`),
grpc.WithTransportCredentials(insecure.NewCredentials()),
)
Round-robin over multiple subchannels avoids head-of-line blocking on a single HTTP/2 connection.
Prevention
- Deadlines propagate through every hop; no
context.Background()in handlers. - Per-RPC timeouts derived from SLO, not guesses.
- gRPC built-in retry policy with bounded attempts; never retry
DEADLINE_EXCEEDEDwithout extending the budget. - Circuit breaker on every external dependency.
- Tracing on by default; alert on
DEADLINE_EXCEEDEDrate, not just error rate.
Related
- Backend Postgres connection pool exhausted
- Backend JWT expired clock skew
- Backend RabbitMQ consumer stuck
- Backend GraphQL rate limit cascade
- Edge function timeout
- Rate limit issue
Tags: #Backend #Troubleshooting #grpc