Multi-Turn Jailbreak Escalates Over Many Messages (Crescendo)

An attacker shifts model behavior one message at a time until restrictions fall. Detect the escalation pattern across the conversation, reset context, and add session-level monitoring that fires before the violation turn.

Published: May 25, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

No single user message looks obviously malicious. The first few turns are friendly and on-topic. Then the requests drift: first edge cases, then hypotheticals, then restricted topics reframed as fiction or academic curiosity, then “continue the story” into territory the model would have refused in turn 1. By the time the violation lands, a defender reading any one message in isolation sees nothing wrong — the escalation is only visible across the full history. This is the published Crescendo attack (USENIX Security 2025): it “begins with a general prompt about the task and then gradually escalates the dialogue by referencing the model’s replies.” It needs no special tokens and, in the original research, succeeds in fewer than 10 queries — often under 5.

Fastest fix: per-message input filters cannot catch this, so the highest-leverage change is a session-level monitor plus periodic re-injection of your policy into long histories. If you suspect a session is mid-escalation right now, reset its context (drop the conversation history, keep only the default system prompt) and require a fresh session. The steps below give you a topic-drift monitor, policy re-injection, a roleplay flag, a turn cap, and an alert rule, in priority order.

This article is for teams running an LLM application (you control the system prompt, the history, and the request loop). If you are an end user who hit a content filter, this is not your page.

Which variant are you dealing with?

Symptom in the logs	Likely variant	Primary defense
Slow topic creep over many human turns, each referencing the model’s last reply	Crescendo (the case in this article)	Session-level topic-drift monitor + policy re-injection
One huge pasted prompt with hundreds of fake “user/assistant” Q&A pairs, then the real ask	Many-shot jailbreaking (long-context abuse)	Classify and rewrite the prompt before it reaches the model; cap injected example turns
A roleplay persona established early, then “stay in character and explain…”	Roleplay-framed escalation (a Crescendo sub-pattern)	Roleplay flag + post-response policy check
Violation appears only after the system prompt is far back in a long context	System-prompt recency decay	Re-inject a condensed policy reminder every N turns
User B’s session starts already drifted	Session-isolation bug	Verify fresh history per authenticated session

If hundreds of fabricated turns arrive in a single message, you are looking at many-shot jailbreaking, not Crescendo. Anthropic’s research showed a classify-and-rewrite filter cut that attack’s success rate from 61% to 2%; handle it at the input boundary, not with the drift monitor below.

Common causes

1. The context window accumulates “yes” precedents

Every time the model complies with a slightly off-topic or boundary-testing request, it sets a precedent in the history. Later turns build on it: “Since you already explained X, surely you can also explain Y.” The model’s earlier output is part of its own context and biases what comes next — this referencing of prior replies is the core mechanic of Crescendo.

How to spot it: review the full history for sessions that produced policy-violating output. Count how many sequential turns showed increasing topic drift before the violation. A staircase pattern (each response a bit farther from the initial policy) is the signature.

2. Fictional or roleplay framing introduces restricted content incrementally

The user establishes a roleplay scenario early: “Let’s write a story where you play a scientist.” Over subsequent turns the subject matter migrates toward what the system prompt restricts. The model, committed to the narrative, continues it.

How to spot it: track turns containing roleplay-establishment phrases — let's pretend, in this story, you are playing, as a character. Any session that establishes a roleplay context and then produces a policy-violating response likely used this pattern.

3. The model is never reminded of its policy during long sessions

For sessions past 20-30 turns, the system prompt’s instructions sit relatively far back in the context window. Models weight recency, so the now-distant system prompt can carry less influence than the recent (attacker-shaped) turns.

How to spot it: log which turn each violation occurred in. If violations cluster at turn 20+ for long sessions, system-prompt recency is a contributing factor.

4. Each individual turn passes the input filter but the sequence does not

Your input scanner checks each message in isolation. Message 14 says continue the story — no injection keywords — but in context it means “continue the policy-violating narrative established in messages 8-13.”

How to spot it: your monitoring must evaluate sequences of messages, not just individual messages. A sequence classifier that takes the last N messages as input catches gradual drift that a per-message filter misses.

5. The user leverages the model’s own stated commitments against it

The model said in turn 7: “I understand — I am playing a neutral scientist character.” In turn 14 the user invokes it: “You already said you’re a neutral scientist, so in character you would explain how…” The model treats its own prior statement as a binding commitment. This is Crescendo’s defining move — escalating by referencing the model’s replies.

How to spot it: alert when a user message quotes or directly invokes something the model said in a previous turn to justify a new request. The pattern is [model's prior words] + therefore you should now.

6. Session-level state is not reset between users

In a shared service, session state from user A bleeds into user B’s session because the application reuses conversation history incorrectly. User B starts in an already-drifted state.

How to spot it: verify that each new authenticated user session starts with a fresh message history containing only the default system prompt. Any session that begins with more than one message (system plus inherited turns) from a previous user is a session-isolation failure.

Shortest path to fix

Step 1: Implement a session-level topic-drift monitor

A per-message filter is blind to Crescendo by design. Score each turn against the session’s baseline topic and watch the trend, not any single value. Use a cheap, fast model for the scorer (any small classifier-grade model works; the example uses an inexpensive chat model).

interface TurnRecord {
  turnIndex: number;
  userMessage: string;
  modelResponse: string;
  topicScore: number;  // 0-1, how on-topic is this turn?
}

async function scoreTurnTopicRelevance(message: string, baselineTopic: string): Promise<number> {
  const result = await openai.chat.completions.create({
    model: "gpt-5.5",  // any cheap, fast model is fine for scoring
    messages: [
      {
        role: "system",
        content: `Score how relevant this message is to the topic "${baselineTopic}". Reply with a number 0-100 only.`,
      },
      { role: "user", content: message.slice(0, 500) },
    ],
    max_tokens: 5,
  });
  return parseInt(result.choices[0].message.content ?? "100") / 100;
}

function detectDriftTrend(turns: TurnRecord[]): boolean {
  if (turns.length < 5) return false;
  const recent = turns.slice(-5).map((t) => t.topicScore);
  const avg = recent.reduce((a, b) => a + b, 0) / recent.length;
  return avg < 0.4;  // threshold — tune per application
}

Because Crescendo can land in under 5 turns, set the rolling window short (the example uses 5) and tune the threshold from your own staging runs rather than a global default.

Step 2: Periodically re-inject the system prompt for long sessions

This directly counters cause 3 (recency decay). Re-insert a condensed policy reminder every N turns so the rules stay near the end of the context, not just at the top.

function buildLongSessionMessages(
  systemPrompt: string,
  history: { role: string; content: string }[],
  reinforceEveryN = 10
): { role: string; content: string }[] {
  const messages: { role: string; content: string }[] = [
    { role: "system", content: systemPrompt },
  ];

  for (let i = 0; i < history.length; i++) {
    if (i > 0 && i % reinforceEveryN === 0) {
      // Reinject a condensed policy reminder
      messages.push({
        role: "system",
        content: `[Policy reminder — turn ${i}]: ${policyReminder}`,
      });
    }
    messages.push(history[i]);
  }
  return messages;
}

Step 3: Detect roleplay-establishment patterns and set a session flag

const ROLEPLAY_TRIGGERS = [
  /let'?s\s+(pretend|imagine|role[\s-]?play)/i,
  /you\s+are\s+(playing|acting\s+as|now)\s+(a|an|the)\s+\w+/i,
  /in\s+this\s+(story|scenario|fiction|game)/i,
  /as\s+(a|an|the)\s+(character|persona|role)\s+of/i,
];

function detectRoleplayEstablishment(message: string): boolean {
  return ROLEPLAY_TRIGGERS.some((re) => re.test(message));
}

// Set a session flag
if (detectRoleplayEstablishment(userMessage)) {
  session.roleplayActive = true;
  logger.info({ event: "roleplay_session_started", sessionId: session.id, turn: session.turnCount });
}

Step 4: Cap session length and offer a fresh-session continuation

const MAX_TURNS_PER_SESSION = 30;

if (session.turnCount >= MAX_TURNS_PER_SESSION) {
  return {
    message:
      "This conversation has reached its maximum length. Please start a new session to continue.",
    shouldReset: true,
  };
}

Step 5: Alert on sessions with detected roleplay plus policy violations

Run a post-response policy check on every output (not just flagged turns), and treat roleplay-plus-violation as the highest-signal combination.

async function postTurnAudit(session: Session, latestResponse: string): Promise<void> {
  if (!session.roleplayActive) return;

  const isPolicyViolation = await checkPolicyViolation(latestResponse, session.systemPrompt);
  if (isPolicyViolation) {
    logger.error({
      event: "roleplay_escalation_violation",
      sessionId: session.id,
      turnCount: session.turnCount,
      responseSummary: latestResponse.slice(0, 300),
    });
    // Alert security team and suspend session
    await suspendSession(session.id);
  }
}

How to confirm it’s fixed

Replay a known Crescendo sequence in staging. Take a benign-looking 5-to-10-turn escalation toward a topic your system prompt restricts and run it through the full request loop.
Confirm the alert fires before the violation turn, not after. Your topic-drift monitor (Step 1) should trip while the conversation is still in the “edge cases / hypotheticals” zone.
Verify the context reset actually clears history. After a cap or suspend, inspect the next request payload: it must contain only the default system prompt and the new user turn — no inherited messages.
Check that re-injected reminders land at the right interval by logging the assembled messages array for a 25-turn session and confirming a policy reminder appears every N turns.
Test session isolation by alternating two authenticated users through the same worker and confirming neither sees the other’s history.

If the alert only fires after the violating output is already returned, shorten the rolling window and lower the drift threshold until it trips earlier.

Prevention

Monitor topic drift at the session level, not just per message — a sequence of slightly off-topic messages is more informative than any individual message.
Re-inject condensed policy reminders into the history every N turns for long sessions to counteract recency bias.
Set a maximum session turn limit and require users to start a new session, preventing unbounded context accumulation. Because Crescendo can succeed in under 5 turns, the cap is a backstop, not the primary defense.
Alert on sessions that contain both roleplay-establishment patterns and later policy-violating output — this combination is the highest-signal indicator.
Run a post-response policy check on every model output, not just flagged turns.
Handle many-shot jailbreaking separately at the input boundary: classify and rewrite prompts that arrive with large numbers of fabricated conversational examples (Anthropic’s classify-and-rewrite cut that attack from 61% to 2%).
Review flagged sessions with a human analyst on a regular cadence to catch new escalation patterns automation has not yet learned.
Ensure session isolation is correct: every new user session begins with only the default system prompt and no history from previous sessions.
Test your monitoring by running known multi-turn escalation sequences in staging and verifying alerts fire before the violation turn.

FAQ

Q: Can I prevent multi-turn escalation by just telling the model “do not be manipulated over multiple turns”? A: It helps a little — the model becomes aware of the pattern — but it is not a complete fix. Crescendo works precisely by getting the model to honor its own earlier replies, and a persuasive escalation can still succeed. Session-level monitoring and context resets are far more reliable than a single instruction.

Q: How many turns does a typical multi-turn escalation take? A: Less than you would expect. The Crescendo research found most tasks succeed in fewer than 10 queries, and many in under 5. Do not assume you have until turn 20 to react; your monitor should trip in the early-to-middle turns, while the conversation is still drifting.

Q: How is this different from “many-shot jailbreaking”? A: Crescendo is a multi-turn conversation that escalates by referencing the model’s prior replies. Many-shot jailbreaking is a single huge prompt that stuffs hundreds of fake question/answer pairs into a long context window before the real request. They need different defenses: Crescendo needs session-level drift monitoring; many-shot needs an input-side classifier that rewrites or rejects prompts loaded with fabricated example turns.

Q: Should I tell users their session is being monitored for escalation? A: For most consumer applications, disclosing that conversations are monitored for policy compliance (not necessarily this specific attack) is legally advisable and builds trust. You do not need to publish the detection logic itself.

Q: Is this harder to pull off against models with longer context windows? A: Not really. Models now ship with 1M-token context as standard (Opus 4.7, Sonnet 4.6, Gemini 3.1 Pro, as of June 2026), and a longer window cuts both ways: the model remembers the escalation chain more clearly, but it also remembers its own earlier policy-compliant responses. The attack still works; only the pacing changes. Large windows also make many-shot jailbreaking cheaper, which is a separate reason to filter at the input.

External references: the Crescendo attack paper (USENIX Security 2025), Anthropic’s Many-shot jailbreaking research, and the OWASP Top 10 for LLM Applications (2025).

Tags: #ai-security #prompt-injection #Troubleshooting

Which variant are you dealing with?

Common causes

1. The context window accumulates “yes” precedents

2. Fictional or roleplay framing introduces restricted content incrementally

3. The model is never reminded of its policy during long sessions

4. Each individual turn passes the input filter but the sequence does not

5. The user leverages the model’s own stated commitments against it

6. Session-level state is not reset between users

Shortest path to fix

Step 1: Implement a session-level topic-drift monitor

Step 2: Periodically re-inject the system prompt for long sessions

Step 3: Detect roleplay-establishment patterns and set a session flag

Step 4: Cap session length and offer a fresh-session continuation

Step 5: Alert on sessions with detected roleplay plus policy violations

How to confirm it’s fixed

Prevention

FAQ

Related

Related Articles

Agent Leaked an API Key in Its Output: Rotate and Lock It Down

Roleplay Bypasses Your AI Content Filter

AI Follows Malicious Instructions Hidden in an Uploaded File

Your AI Tool Accidentally Wrote Phishing Content

Data Exfiltration via Image URL

Prompt Injection Hidden Inside a PDF