GraphQL Rate Limit Cascade: Stop One Slow Resolver Taking Down the Gateway

Q: What complexity limit should I start with?

`1000` is a reasonable default for most APIs. Log per-operation cost for a week, then set the cap just above your legitimate p99 so real queries pass and abusive ones do not.

Q: How long should the circuit breaker stay open?

A `resetTimeout` of 30 to 60 seconds works well — long enough for the upstream to recover, short enough that the probe retries before users notice a sustained outage.

Q: Should I retry a `429` at all?

Only after the `Retry-After` interval (or a long backoff) elapses, and never on the same request path that is feeding the cascade. During an active incident, fail fast and shed load rather than retrying.

Q: My opossum breaker never opens even when 429s pour in — why?

Almost always an inverted `errorFilter`. opossum ignores an error (does not count it toward opening) when `errorFilter` returns truthy. If you wrote `errorFilter: (err) => err.status === 429`, you told it to IGNORE every `429`, so the breaker stays closed forever. Flip it to `(err) => err.status !== 429` so the `429` returns `false` and counts. Confirm by logging `breaker.stats` and watching `failures` climb during the incident.

One slow resolver hits an upstream 429 and the whole GraphQL gateway stalls. Fix it with per-query complexity costs, DataLoader batching, fail-fast circuit breakers, and per-upstream connection pools.

Published: May 23, 2026 Updated: Jun 18, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Your GraphQL gateway is fine at 100 RPS, then one popular query starts hammering a slow downstream API. The downstream rate-limits you with HTTP 429. Now every query that touches that downstream fails, including queries that should be served from cache. Within seconds, unrelated parts of the schema slow down too, because the gateway’s shared connection pool is full of stuck retries. A single hot path took the whole gateway down.

Fastest stabilizer right now: stop retrying 429s and make the rate-limited upstream fail fast (circuit breaker in front of it), so the gateway stops holding connections and recovers in seconds. Then prevent recurrence with per-query complexity limits, DataLoader batching, and per-upstream connection pools. The rest of this page walks through both.

Which bucket are you in?

Symptom you see	Most likely cause	Jump to
One huge query (hundreds of fields) precedes the cascade	No complexity limit	Step 1
Upstream calls scale with row count, not field count	N+1 resolvers, no DataLoader	Step 2
`429`s keep coming back in waves after a brief recovery	Retries on `429` instead of fail-fast	Step 3
Fast, unrelated queries stall during the incident	Shared connection pool exhausted	Step 4
Hundreds of unique query shapes per hour from clients you do not control	No safelisting / persisted query list	Step 5
Same lookup key fetched dozens of times per second	No cache on stable data	Step 6

Common causes, ordered by hit rate

1. No per-query complexity limit

Apollo Server and graphql-yoga accept any query depth and any field count by default. A 500-field query costs roughly 500x more than a single-field query, but you bill both as one request and one rate-limit token.

How to spot it: instrument a complexity estimator (see Step 1) and log the cost per operation, or scan gateway logs for query document length. If queries over ~200 nodes are common, no limit is enforced.

2. Resolvers do N+1 fetches without DataLoader

The resolver for posts.author runs once per post. A query asking for 100 posts fires 100 separate upstream calls for authors, hitting the upstream rate limit almost instantly.

How to spot it: count upstream calls during a single GraphQL query. The count should track the number of fields, not the number of records.

3. Retries on 429 instead of fail-fast

Default fetch/axios retry policies treat 429 the same as 500 (transient, retry with backoff). Retrying a rate-limited upstream just hammers it harder and deepens the cascade.

How to spot it: read your retry config. If 429 triggers exponential backoff with retries, you are digging the hole deeper. Note: a 429 should ideally be retried only after its Retry-After header elapses, never immediately.

4. Shared connection pool across resolvers

A single Axios / undici pool of, say, 50 connections is shared by all resolvers. One slow resolver fills the pool with stalled connections, and fast resolvers can no longer borrow one.

How to spot it: monitor pool utilization. 100% utilization with most connections stalled on one host means the pool is exhausted by a single upstream.

5. No safelisting / persisted query list

Clients (or bots) send arbitrary query shapes. Without a persisted query list enforced at the gateway, every novel shape triggers full parse, validation, execution, and downstream calls.

How to spot it: check query-shape distribution. Hundreds of unique operation hashes per hour from production clients suggests open, unrestricted queries.

6. No cache layer on stable lookups

User profile, product info, taxonomy — fetched fresh on every query when they change rarely. These should be cached for 60s or more.

How to spot it: run an upstream-call analyzer. The same key fetched dozens of times per second is uncached.

Before you start

Confirm the rate-limit source: which upstream returned 429, and what its documented limit is.
Identify the trigger operation: the GraphQL operation name and document hash.
Pull gateway metrics for the cascade window: request latency percentiles, per-upstream call counts, error rates.
Document the cascade timeline: which queries failed first, which followed.
Roll forward, not back — unless the trigger query was a recent deploy, in which case reverting that change is the fastest mitigation.

Step-by-step fix

Step 1: Add a query complexity limit

The older graphql-validation-complexity package exposes a validation rule but cannot see request variables (validation runs before variables are bound), so a posts(first: $n) cost can be undercounted. As of June 2026 prefer graphql-query-complexity, which runs in Apollo Server’s didResolveOperation hook and does have access to variables. The plugin shape below is for Apollo Server 5 (the current major; Apollo Server 4 reached end-of-life on 2026-01-26 and AS5 needs Node.js 20+). The @apollo/server import path is identical on AS4 and AS5, so the same plugin runs on either.

import { ApolloServer } from '@apollo/server';
import {
  getComplexity,
  fieldExtensionsEstimator,
  simpleEstimator,
} from 'graphql-query-complexity';
import { GraphQLError } from 'graphql';

const MAX_COMPLEXITY = 1000;

const complexityPlugin = {
  async requestDidStart() {
    return {
      async didResolveOperation({ request, document, schema }) {
        const complexity = getComplexity({
          schema,
          operationName: request.operationName,
          query: document,
          variables: request.variables,
          estimators: [
            fieldExtensionsEstimator(),
            simpleEstimator({ defaultComplexity: 1 }),
          ],
        });
        metrics.histogram('graphql_query_cost').record(complexity);
        if (complexity > MAX_COMPLEXITY) {
          throw new GraphQLError(
            `Query is too complex: ${complexity}. Maximum allowed: ${MAX_COMPLEXITY}`,
            { extensions: { code: 'QUERY_TOO_COMPLEX' } },
          );
        }
      },
    };
  },
};

const server = new ApolloServer({ schema, plugins: [complexityPlugin] });

Declare per-field cost in the SDL with @cost, multiplied by the pagination argument so a large first is priced accordingly:

type Query {
  posts(first: Int = 10): [Post!]! @cost(complexity: 1, multipliers: ["first"])
  search(query: String!, first: Int = 10): [Post!]! @cost(complexity: 5, multipliers: ["first"])
}

Start MAX_COMPLEXITY at 1000; tighten after a week of measuring real traffic with the histogram above.

Step 2: Add DataLoader to every N+1 resolver

Create a fresh DataLoader instance per request (never share one across requests — see pitfalls) so it batches and de-dupes all loads within a single GraphQL operation.

import DataLoader from 'dataloader';

const createAuthorLoader = () => new DataLoader<string, Author>(
  async (authorIds) => {
    const authors = await db.author.findMany({
      where: { id: { in: [...authorIds] } },
    });
    const byId = new Map(authors.map(a => [a.id, a]));
    // Return in the SAME order as authorIds, one slot per id
    return authorIds.map(id => byId.get(id) ?? null);
  },
  { maxBatchSize: 100, cache: true },
);

// New loaders per request in the context factory
const context = async ({ req }) => ({
  loaders: {
    author: createAuthorLoader(),
    tagsByPostId: createTagsLoader(),
  },
});

// In the resolver
const resolvers = {
  Post: {
    author: (post, _args, { loaders }) => loaders.author.load(post.authorId),
  },
};

DataLoader collapses 100 author lookups in one tick into a single batched fetch. Two rules that bite people: the batch function must return results in the exact order of the input keys (one slot per key, null for misses), and the input/output array lengths must match.

Step 3: Fail fast on 429 with a circuit breaker

This is the single highest-leverage change during an active cascade. Wrap the upstream call in opossum (v9.0.0 is the current release as of June 2026) so a rate-limited upstream trips the breaker instead of stacking stuck connections.

import CircuitBreaker from 'opossum';

const breaker = new CircuitBreaker(callUpstream, {
  timeout: 3000,                  // give up on a single call after 3s (opossum default is 10000)
  errorThresholdPercentage: 50,   // open once >=50% of calls fail (this is also the default)
  resetTimeout: 30000,            // probe again after 30s in halfOpen (also the default)
  // errorFilter returns TRUE for errors that should NOT count toward opening.
  // We return false for 429 so rate-limit errors DO trip the breaker fast.
  errorFilter: (err) => err.status !== 429,
});

breaker.fallback(() => {
  throw new GraphQLError('Upstream rate-limited, please retry shortly', {
    extensions: { code: 'RATE_LIMITED' },
  });
});

async function fetchAuthor(id: string) {
  return breaker.fire(id);
}

Mind the errorFilter polarity. opossum’s own docs define it as: “an optional function that will be called when the circuit’s function fails. If this function returns truthy, the circuit’s failPure statistics will not be incremented.” So returning true tells opossum to ignore that error (do not count it toward opening), and returning false counts it. Our filter is (err) => err.status !== 429, which returns false for a 429 — meaning 429s (and timeouts, which the filter also lets through as false) DO count. Once the upstream is rate-limited the breaker opens within a few requests and the gateway returns a clear error in milliseconds instead of holding a connection open for the full timeout. The breaker probes again after resetTimeout (it moves to a halfOpen state) and closes when the probe succeeds.

Step 4: Isolate per-upstream connection pools

Give each upstream its own undici Agent (its own connection budget), so a stall on the slow API cannot starve the fast ones.

import { Agent } from 'undici';

const upstreamPools = {
  fastDb: new Agent({ connections: 50, pipelining: 1 }),
  slowApi: new Agent({ connections: 10, pipelining: 1 }), // intentionally smaller
  search:  new Agent({ connections: 20, pipelining: 1 }),
};

// Pick the right dispatcher per resolver / per upstream host
await fetch(url, { dispatcher: upstreamPools.slowApi });

Now the slow API can saturate its own 10-connection pool without touching the other 70 connections. This is the bulkhead pattern: blast radius is contained to one upstream.

Step 5: Safelist queries with a persisted query list

To actually reject arbitrary bot queries, you need safelisting, not plain Automatic Persisted Queries (APQ). APQ only swaps a query hash for the full string to shrink requests — it does not stop a never-seen query from running. Safelisting requires a persisted query list (PQL) that the gateway enforces.

With GraphOS / Apollo Router: register your clients’ trusted operations to a PQL at build time, then set the router’s persisted-queries security level. As of June 2026 the levels escalate from log_unknown / audit mode (logs unregistered ops as a dry run) to full safelisting, where the router rejects any operation not in the PQL. See Safelisting with persisted queries. Run audit mode first until your logs show every legitimate client operation is registered, then flip to enforce.
On a self-hosted Apollo Server without GraphOS: APQ alone is not safelisting. Maintain your own allowlist of operation hashes and reject unknown hashes in a plugin’s didResolveOperation (throw a GraphQLError with a PERSISTED_QUERY_NOT_IN_LIST code), or move enforcement to the router.

Either way, combine safelisting with the complexity limit from Step 1 — safelisting blocks unknown shapes, complexity caps the known-but-expensive ones.

Step 6: Cache stable lookups in Redis

async function getUserProfile(id: string) {
  const cached = await redis.get(`user:${id}`);
  if (cached) return JSON.parse(cached);

  const user = await db.user.findUnique({ where: { id } });
  await redis.setex(`user:${id}`, 60, JSON.stringify(user)); // 60s TTL
  return user;
}

A 60-second TTL on profile data typically cuts upstream calls for hot users by 80 to 95 percent, which is often enough on its own to keep you under the upstream’s rate limit. Pick the TTL from how stale the data may safely be, and remember DataLoader’s per-request cache (Step 2) handles dedup within one query while Redis handles it across queries and requests.

Step 7: Add per-resolver tracing and alerting

import { ApolloServerPluginUsageReporting } from '@apollo/server/plugin/usageReporting';

const server = new ApolloServer({
  schema,
  plugins: [
    ApolloServerPluginUsageReporting({
      sendVariableValues: { none: true },
      sendHeaders: { none: true },
    }),
  ],
});

Apollo Studio (GraphOS) then shows per-resolver and per-operation p99 latency. Alert when any resolver crosses 500ms p99, and alert separately on upstream 429 rate so you catch the cascade before users do. If you do not use GraphOS, the equivalent is OpenTelemetry spans per resolver exported to your APM.

How to confirm it’s fixed

Reproduce the trigger query on staging. The gateway should return a RATE_LIMITED error in under ~200ms, not stall for seconds.
Run a load test at 2x your typical peak. Latency p99 should stay under 500ms and the 429 from one upstream should not surface as errors on unrelated queries.
Check DataLoader batching. During traffic, upstream call count per operation should track field count, and DataLoader hit rate should be over 80 percent.
Simulate an upstream outage. Confirm the circuit breaker opens (fast errors), then closes again once the upstream recovers, without manual intervention.
Verify isolation. With the slow upstream throttled, hit a fast query and confirm its latency is unchanged.

Long-term prevention

Make a query complexity limit part of the PR template whenever a new resolver is added.
Default to DataLoader for any list-of-children resolver.
Standardize per-upstream connection pools at gateway startup; review their sizes quarterly.
Move production clients onto a persisted query list (audit mode first, then safelisting) within the quarter.
Make per-resolver tracing (GraphOS or OpenTelemetry) mandatory in production.

Common pitfalls

Increasing connection pool size to mask a 429 cascade — it just delays the failure and wastes more upstream tokens.
Adding retries on 429 — the upstream is already overwhelmed; respect Retry-After instead and let the breaker absorb the rest.
Sharing one DataLoader instance across requests — its cache leaks data across users and never expires. Always build loaders per request.
Setting the complexity limit so high (5000+) it never triggers — measure real costs first, then set the cap just above your legitimate p99.
Assuming APQ gives you safelisting — it does not. APQ is a payload-size optimization; rejecting unknown queries needs a persisted query list.

FAQ

What complexity limit should I start with? 1000 is a reasonable default for most APIs. Log per-operation cost for a week, then set the cap just above your legitimate p99 so real queries pass and abusive ones do not.

Does DataLoader work for resolvers that need different fields per parent? Use multiple DataLoaders, one per distinct fetch shape (for example one keyed by authorId, another by postId). One loader per query intent keeps batching correct.

How long should the circuit breaker stay open? A resetTimeout of 30 to 60 seconds works well — long enough for the upstream to recover, short enough that the probe retries before users notice a sustained outage.

Is APQ enough to block bot queries? No. Automatic Persisted Queries only shorten requests by sending a hash. To reject unregistered queries you need safelisting via a persisted query list (GraphOS/Apollo Router) or a custom allowlist check in a server plugin.

Should I retry a 429 at all? Only after the Retry-After interval (or a long backoff) elapses, and never on the same request path that is feeding the cascade. During an active incident, fail fast and shed load rather than retrying.

My opossum breaker never opens even when 429s pour in — why? Almost always an inverted errorFilter. opossum ignores an error (does not count it toward opening) when errorFilter returns truthy. If you wrote errorFilter: (err) => err.status === 429, you told it to IGNORE every 429, so the breaker stays closed forever. Flip it to (err) => err.status !== 429 so the 429 returns false and counts. Confirm by logging breaker.stats and watching failures climb during the incident.

Tags: #Backend #Troubleshooting #graphql