AI Accidentally Assisted in Crafting Phishing Content

An AI assistant helped write a convincing phishing email or credential-harvesting page without recognizing the intent. How to detect the pattern and add intent-detection guardrails.

Your customer communication assistant generated what appeared to be a routine urgent account-notification email — credible sender name, official-looking formatting, a sense of urgency, and a link to “verify your credentials.” A security team member reviewing outbound drafts recognized the pattern: the email had all the hallmarks of a spear-phishing message. The user who prompted it described it as a “template for a security awareness training exercise” but the resulting copy was indistinguishable from a real attack. AI assistants are highly effective at writing persuasive, well-formatted text — which makes them useful for legitimate security training but also means they can inadvertently produce content that would be effective if misused. The observable signal in your system: a model output that combines urgency language, credential-request calls to action, impersonation of a trusted sender, and an external link. This article covers how to detect these patterns in output monitoring and how to build intent-verification gates for sensitive content categories.

Common causes

1. “Security awareness training” framing removes guardrails

Users often frame phishing-content requests as legitimate training exercises: “Write a realistic phishing email for our security awareness program.” The stated purpose is legitimate — but the output is the same regardless of intent. Many models produce the content because the training framing is plausible.

How to spot it: Review output logs for content that simultaneously contains: impersonation of a trusted organization + urgency language + credential-request calls to action. The three-factor combination is the signal, not the stated purpose.

2. Incremental generation via multi-turn avoids per-turn detection

The user first asks for “a formal account notification email template.” Then asks to “add urgency.” Then asks to “add a login link.” Each individual step appears innocuous; the assembled result is a phishing template.

How to spot it: Compare the final assembled output against phishing signal patterns, not just individual turns. A session that produces a credential-request email through incremental refinement is flagged at the assembly stage.

3. Content generation tools have no output intent classifier

The application applies an input classifier (blocking obvious requests) but no output classifier. The multi-turn incremental method bypasses the input classifier while producing policy-violating content.

How to spot it: Check whether your content moderation runs on model outputs, not only on user inputs. Many pipelines have input filters but no output classifiers.

4. The system prompt for a “marketing email” tool is too broad

An email drafting tool has a permissive system prompt: “Write compelling emails for the user.” No restrictions on impersonation, urgency tactics, or credential harvesting calls to action are present. The tool produces anything the user describes.

How to spot it: Review your email drafting tool’s system prompt for explicit prohibitions on: impersonating specific organizations, creating fake login pages, and using deceptive urgency (e.g., “your account will be suspended in 24 hours”).

5. Legitimate social engineering simulations lack operator verification

A security company’s legitimate pen-testing platform needs this capability. But your general-purpose assistant tool has no way to distinguish a verified pen-tester from a regular user. Anyone can claim the “training” context.

How to spot it: Check whether your application has any user verification or elevated permission flow for security-testing content. If any user can invoke “security training mode” without verification, the gate is missing.

6. Generated HTML page combines form + credential field + external action URL

Beyond email, users can ask for “a realistic login page template for training.” The model generates an HTML form that submits credentials to an external URL. The HTML is functional and indistinguishable from a credential-harvesting page.

How to spot it: Scan generated HTML for the combination of: <form action="..."> with an external action URL + <input type="password">. This combination in generated output is high-signal regardless of stated purpose.

Shortest path to fix

Step 1: Add an output phishing-signal classifier

interface PhishingSignals {
  hasImpersonation: boolean;
  hasUrgencyLanguage: boolean;
  hasCredentialRequest: boolean;
  hasExternalLoginLink: boolean;
}

function detectPhishingSignals(text: string): PhishingSignals {
  return {
    hasImpersonation: /dear\s+(valued\s+)?(customer|user|member)|your\s+(account|subscription)\s+at\s+\w+/i.test(text),
    hasUrgencyLanguage: /(immediately|within\s+24\s+hours?|account\s+will\s+be\s+(suspended|closed|locked)|urgent|action\s+required)/i.test(text),
    hasCredentialRequest: /(verify\s+your\s+(identity|account|credentials?|password)|log\s+in\s+to\s+confirm|click\s+here\s+to\s+(reset|verify|update)\s+your\s+password)/i.test(text),
    hasExternalLoginLink: /https?:\/\/(?!yourdomain\.com)[^\s]+\/(login|signin|verify|reset|account)/i.test(text),
  };
}

function isPhishingTemplate(signals: PhishingSignals): boolean {
  const signalCount = Object.values(signals).filter(Boolean).length;
  return signalCount >= 3;  // 3 or more signals = high risk
}

const output = modelResponse.choices[0].message.content ?? "";
const signals = detectPhishingSignals(output);
if (isPhishingTemplate(signals)) {
  logger.error({ event: "phishing_template_detected", signals });
  return { blocked: true, reason: "Output flagged as potential phishing content." };
}

Step 2: Add explicit prohibitions to the email drafting system prompt

You are an email drafting assistant.

PROHIBITED CONTENT — never produce:
- Emails that impersonate specific organizations (banks, government agencies, social media platforms) without explicit operator authorization.
- Urgency language combined with credential-reset or login links ("your account will be suspended — click here to verify").
- HTML forms that submit credentials to external URLs.
- Content designed to deceive recipients about the sender's identity.

If a request appears to involve these patterns, ask the user to clarify the legitimate business purpose before proceeding.

Step 3: Require explicit operator verification for security-training content

const SECURITY_TRAINING_USERS = new Set<string>(); // populated by admin verification flow

function canGenerateSecurityTrainingContent(userId: string): boolean {
  return SECURITY_TRAINING_USERS.has(userId);
}

if (requestedSecurityTraining && !canGenerateSecurityTrainingContent(req.user.id)) {
  return res.status(403).json({
    error: "Security-training content requires operator verification. Contact your admin to request access.",
  });
}

Step 4: Add session-level intent tracking for email drafting

interface EmailDraftSession {
  addedImpersonation: boolean;
  addedUrgency: boolean;
  addedLoginLink: boolean;
  combinedSignalCount: number;
}

function updateSessionSignals(session: EmailDraftSession, newContent: string): void {
  if (/impersonat|on behalf of|posing as/i.test(newContent)) session.addedImpersonation = true;
  if (/urgent|within \d+ hours?|suspended|locked/i.test(newContent)) session.addedUrgency = true;
  if (/login|verify.{0,30}credentials|reset.{0,30}password/i.test(newContent)) session.addedLoginLink = true;

  session.combinedSignalCount = [
    session.addedImpersonation,
    session.addedUrgency,
    session.addedLoginLink,
  ].filter(Boolean).length;

  if (session.combinedSignalCount >= 2) {
    logger.warn({ event: "incremental_phishing_pattern", session });
  }
}

Step 5: Scan generated HTML for credential-harvesting patterns

function detectHarvestingForm(html: string): boolean {
  const hasPasswordInput = /<input[^>]+type=["']password["']/i.test(html);
  const hasExternalAction = /<form[^>]+action=["']https?:\/\/(?!localhost|yourdomain\.com)[^"']+["']/i.test(html);
  return hasPasswordInput && hasExternalAction;
}

if (detectHarvestingForm(generatedHtml)) {
  logger.error({ event: "credential_harvesting_html_detected" });
  throw new Error("Generated HTML contains a credential-harvesting form pattern.");
}

Prevention

  • Apply an output phishing-signal classifier to all text generated by email or content drafting tools — not just input filters.
  • Add explicit prohibitions to email-tool system prompts covering impersonation, urgency deception, and credential-harvesting link patterns.
  • Require verified operator authorization for any user requesting security-awareness training content.
  • Track session-level signals: impersonation + urgency + credential link combinations across turns, not just per-message.
  • Scan generated HTML for credential-harvesting patterns (<form action="external"> + <input type="password">).
  • Log all email content generated by your application and retain it for 30 days so security teams can review if a phishing complaint is received.
  • Brief your support team: if a user reports receiving a suspicious email traced back to your platform, treat it as a security incident.
  • Review your content policies quarterly against the latest social engineering patterns — phishing techniques evolve faster than static keyword lists.

FAQ

Q: Can I safely support legitimate security awareness training without enabling abuse? A: Yes, with verified access control. Legitimate security-training platforms require that users authenticate with verifiable credentials (e.g., corporate email domain, contract confirmation). A general-purpose assistant tool should not offer this capability to anonymous users.

Q: Is this a model safety issue or an application design issue? A: Both. Model providers train safety policies to reduce phishing assistance, but those policies are probabilistic and can be bypassed through incremental or reframed requests. The application must add its own output monitoring and intent-verification gates as a second layer.

Q: If a user explicitly says they are building a training exercise, does that change the liability? A: Legally this varies by jurisdiction; operationally you should treat any output that produces functional phishing content as a risk regardless of stated intent. Functional phishing content is just as dangerous whether or not the requester’s stated purpose was legitimate.

Q: Should I watermark AI-generated content to enable attribution if it is misused? A: AI content watermarking is an active research area. Until reliable watermarking is available, the most practical controls are audit logs (who generated what, when), access controls (who can generate sensitive content), and output retention policies.

Tags: #ai-security #prompt-injection #Troubleshooting