Your GraphQL gateway is fine at 100 RPS, then one popular query starts hammering a slow downstream API. The downstream rate-limits you with HTTP 429. Now every query that touches that downstream fails, even queries that should be cached. Within seconds, requests for unrelated parts of the schema also slow down because the gateway’s connection pool is full of stuck retries. A single hot path took the whole gateway down. Fix it by adding per-resolver complexity costs, DataLoader batching, and circuit breakers that fail fast on rate-limited upstreams.
Common causes
Ordered by hit rate.
1. No per-query complexity limit
Apollo Server and graphql-yoga accept any query depth and any field count by default. A 500-field query costs 500x more than a single-field query, but you bill both as one request.
How to spot it: Run graphql-query-complexity or check the gateway logs for query lengths. If queries over 200 nodes are common, no limit is enforced.
2. Resolvers do N+1 fetches without DataLoader
Resolver for posts.author runs once per post. Query asks for 100 posts -> 100 upstream calls for authors. Hit the upstream rate limit instantly.
How to spot it: Check upstream call count during a single GraphQL query. Should be order of fields, not order of records.
3. Retries on 429 instead of fail-fast
Default fetch retry policies treat 429 like 500. The retry hammers the rate-limited upstream and worsens the cascade.
How to spot it: Check retry logic. If 429 triggers exponential backoff with retries, you are deepening the hole.
4. Shared connection pool across resolvers
Single Axios / undici pool of 50 connections used by all resolvers. One slow resolver fills the pool with stalled connections. Fast resolvers can no longer borrow a connection.
How to spot it: Monitor pool utilization. 100 percent with most connections stalled = pool exhausted by one resolver.
5. Persisted queries not enforced
Bots probe with random queries. Without persisted queries, each random shape triggers full execution and downstream calls.
How to spot it: Check query shape distribution. Hundreds of unique shapes per hour suggests open queries.
6. No cache layer on stable lookups
User profile, product info, taxonomy — fetched on every query. Should be cached for 60s+.
How to spot it: Run upstream call analyzer. Same key fetched 100x per second is uncached.
Before you start
- Confirm the rate limit source: which upstream returned 429.
- Identify the trigger query: GraphQL operation name and document hash.
- Check gateway metrics: request latency percentiles, upstream call counts, error rates.
- Document the cascade timeline: which queries failed first, which followed.
- Roll forward, not back, unless the trigger query was recently added.
Information to collect
- The slow upstream endpoint and its documented rate limit.
- Gateway logs with operation names and durations from the cascade window.
- DataLoader hit rates if instrumented.
- Connection pool stats from undici / Axios / fetch.
- Apollo or graphql-yoga server version and config.
Step-by-step fix
Step 1: Add query complexity limit
import { createComplexityRule } from 'graphql-validation-complexity';
const ComplexityLimitRule = createComplexityRule({
maximumComplexity: 1000,
variables: {},
onCost: (cost) => {
metrics.histogram('graphql_query_cost').record(cost);
},
formatErrorMessage: (cost) =>
`Query is too complex: ${cost}. Maximum allowed: 1000`,
});
const server = new ApolloServer({
schema,
validationRules: [ComplexityLimitRule],
});
Set complexity on heavy fields:
type Query {
posts(first: Int = 10): [Post!]! @cost(complexity: 1, multipliers: ["first"])
search(query: String!, first: Int = 10): [Post!]! @cost(complexity: 5, multipliers: ["first"])
}
Step 2: Add DataLoader to every N+1 resolver
import DataLoader from 'dataloader';
const createAuthorLoader = () => new DataLoader<string, Author>(
async (authorIds) => {
const authors = await db.author.findMany({
where: { id: { in: [...authorIds] } },
});
const byId = new Map(authors.map(a => [a.id, a]));
return authorIds.map(id => byId.get(id) ?? null);
},
{ maxBatchSize: 100, cache: true }
);
// In context per request
const context = ({ req }) => ({
loaders: {
author: createAuthorLoader(),
tagsByPostId: createTagsLoader(),
},
});
// In resolver
Post: {
author: (post, _, { loaders }) => loaders.author.load(post.authorId),
}
DataLoader collapses 100 calls into one batch.
Step 3: Fail fast on 429 with a circuit breaker
import CircuitBreaker from 'opossum';
const breaker = new CircuitBreaker(callUpstream, {
timeout: 3000,
errorThresholdPercentage: 50,
resetTimeout: 30000,
// Specifically fail-fast on rate limits
errorFilter: (err) => err.status !== 429,
});
breaker.fallback(() => {
throw new GraphQLError('Upstream rate-limited, please retry shortly', {
extensions: { code: 'RATE_LIMITED' },
});
});
// Use in resolver
async function fetchAuthor(id: string) {
return breaker.fire(id);
}
When upstream is rate-limited, the breaker opens for 30s and your gateway responds in milliseconds with a clear error instead of holding connections.
Step 4: Isolate per-upstream connection pools
import { Agent } from 'undici';
const upstreamPools = {
fastDb: new Agent({ connections: 50, pipelining: 1 }),
slowApi: new Agent({ connections: 10, pipelining: 1 }), // smaller
search: new Agent({ connections: 20, pipelining: 1 }),
};
// Use the right agent per resolver
fetch(url, { dispatcher: upstreamPools.slowApi });
The slow API can stall its own 10-connection pool without affecting the others.
Step 5: Enforce persisted queries
import { createPersistedQueryLink } from '@apollo/server';
const server = new ApolloServer({
schema,
persistedQueries: {
cache: new RedisCache({ url: process.env.REDIS_URL }),
requireSignature: true,
},
});
Only signed, pre-registered query shapes execute. Random bot queries get rejected at parse time.
Step 6: Cache stable lookups in Redis
async function getUserProfile(id: string) {
const cached = await redis.get(`user:${id}`);
if (cached) return JSON.parse(cached);
const user = await db.user.findUnique({ where: { id } });
await redis.setex(`user:${id}`, 60, JSON.stringify(user));
return user;
}
60-second TTL on profile data cuts upstream calls by 80 to 95 percent for hot users.
Step 7: Add per-resolver tracing and alerting
import { ApolloServerPluginUsageReporting } from '@apollo/server/plugin/usageReporting';
const server = new ApolloServer({
schema,
plugins: [
ApolloServerPluginUsageReporting({
sendVariableValues: { none: true },
sendHeaders: { none: true },
}),
],
});
Apollo Studio shows per-resolver p99 latency. Alert when any resolver crosses 500ms p99.
Verify
- Reproduce the trigger query on staging; gateway should return rate-limit error in under 200ms, not stall.
- Run a load test at 2x your typical peak; latency p99 should stay under 500ms.
- Confirm DataLoader hit rate is over 80 percent during traffic.
- Confirm circuit breaker opens and closes correctly under simulated upstream outage.
Long-term prevention
- Make query complexity limits part of every PR template that adds new resolvers.
- Default-add DataLoader for any list-of-children resolver.
- Standardize per-upstream connection pools at gateway startup.
- Adopt persisted queries for all production clients within the first quarter.
- Make Apollo Studio or equivalent tracing mandatory for production.
Common pitfalls
- Increasing connection pool size to mask a 429 cascade — it just delays the failure.
- Adding retries on 429 — the upstream is already overwhelmed.
- Single DataLoader instance across requests — leaks data across users.
- Setting complexity limit too high (5000+) so it never triggers in practice.
FAQ
What complexity limit should I start with? 1000 is a reasonable default for most APIs. Tighten after a week of measurement.
Does DataLoader work for resolvers that need different fields per parent? Use multiple DataLoaders, one per shape of fetch.
How long should the circuit breaker stay open? 30 to 60 seconds. Long enough for the upstream to recover, short enough that users do not notice extended outages.
Related
- Edge function timeout
- Rate limit issue
- Backend Redis cluster failover stuck
- Backend Postgres connection pool exhausted
- Backend message queue dead-letter buildup
- Auth Redirect Returns to Wrong URL
Tags: #Backend #Troubleshooting #graphql