Can AI replace qualitative researchers?

No. It accelerates coding but cannot interpret context, irony, or what is left unsaid. The published κ = 0.84-0.91 results all depend on human-defined frameworks and a human-coded validation sample.

Which model should I use?

Any 1M-context model works (Claude Opus 4.7, Gemini 3.1 Pro, GPT-5.5). In the Dec 2025 benchmark the Gemini variant scored highest on kappa (0.907 vs 0.853 vs 0.842), but the spread was small; consistency across runs matters more than which model you pick.

How do I compute Cohen's kappa?

Build a confusion matrix of your codes versus the model's on the same 10% sample, then use any stats package (`scipy.stats.cohen_kappa_score` in Python, or the `irr` package in R). NVivo and Dedoose calculate it for you.

How many transcripts per prompt?

Group by participant segment or interview wave and cap at 5-10, even though the window holds more. Quality drops before the token limit.

How do I handle multilingual data?

Code in the source language, then translate only the quotes you cite in the report — never translate first, because translation strips the nuance you are coding for.

AI Use Cases

AI Qualitative Coding: Code Transcripts Like a Trained Researcher

Use AI to run open and axial coding on interview transcripts at scale, with Cohen's kappa reliability checks that catch hallucinated codes before they reach your findings.

Published: May 17, 2026 Updated: Jun 06, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

TL;DR

A long-context model (Claude Opus 4.7, Gemini 3.1 Pro, or GPT-5.5 — all 1M tokens as of June 2026) can produce a first-pass codebook from interview transcripts in minutes instead of days. The catch: a single run is not reliable enough to publish. Run the same prompt 3-5 times, keep only codes that recur, and measure agreement against a human-coded 10% sample with Cohen’s kappa. A Dec 2025 study (arXiv 2512.20352) of ensemble LLM thematic analysis reached κ = 0.84-0.91 — “almost perfect” on the Landis & Koch scale — at roughly $0.15-0.20 per transcript versus $20-40 for human coding. Treat AI as a fast first coder, never the final word.

The task

You have a stack of interview transcripts, customer-support chats, survey open-ends, or diary studies. You need to find recurring themes — what users call “the thing” — and turn dozens of hours of conversation into a defensible set of codes, definitions, and representative quotes you can defend in a readout or a paper.

When AI is the right tool

You have 10-100 transcripts and manual coding would take weeks.
You already have a coding framework (open, axial, or deductive) you want applied consistently across every transcript.
You need a first pass so a human researcher refines instead of starting from zero.
Your team needs faster signal between research cycles to keep product decisions moving.

When not to rely on AI alone

Academic publications where methodological transparency and an auditable trail are required — most journals still expect a human-verified codebook.
Sensitive topics (medical, legal, abuse) where misreading a quote has real consequences.
Small samples (under 5 transcripts) — manual coding is faster and more accurate, and the model has too little context to generalize.
Languages or cultures the model has thin training data for, where idiom and irony get flattened.

What to feed the model

The transcripts, with speaker labels intact and identifiers anonymized (strip names, emails, and employer before pasting).
The coding framework — a list of predefined codes, or an instruction to do open coding.
The research question in one sentence (“what blocks first-time users from completing setup?”).
Excerpts of how you coded similar data before, as exemplars. Two or three good ones sharply improve agreement.
A stop-list: codes too general to be useful (“user feedback”, “general comment”).

How much fits in one prompt

A 60-minute interview transcript runs roughly 8,000-12,000 words, or about 11,000-16,000 tokens. With a 1M-token window you can technically load 40-60 transcripts at once, but quality degrades long before the limit — models lose track of line references and conflate speakers across a giant blob. Cap each run at 5-10 transcripts grouped by participant segment or interview wave, then merge codebooks. On ChatGPT Plus, in-app context is closer to 320 pages (full 1M only on the $200 Pro tier as of June 2026), so batch even smaller there.

Copy-ready prompt

Replace each [bracketed] placeholder with your own text before sending.

You are assisting a qualitative researcher with thematic coding.

Research question: [research_question]
Coding approach: [open / axial / deductive]
Predefined codes (if any): [predefined_codes]
Stop-list (codes to never use): [stop_list]
Exemplar coding from prior data: [exemplars]

Transcripts to code:
"""
[transcripts]
"""

Output:
1. A table of codes:
   - Code name (2-4 words)
   - 1-sentence definition
   - 2 verbatim quotes that anchor the code (with speaker + line reference)
   - Frequency across transcripts
2. A short axial section: which codes cluster into 3-5 higher-order themes.
3. A "boundary cases" list: 3-5 quotes that were hard to code, with your reasoning.
4. A flag list: any code where you are less than 70% confident.

Rules:
- Quote text must appear verbatim in the source. Do not paraphrase.
- Cite speaker and line number for every quote.
- If a quote does not fit any code, place it in "uncoded — needs human review".
- Do not invent codes that lack at least 2 supporting quotes.

Run it three to five times, not once

A single pass is the biggest source of unreliable AI coding. The same transcript fed to the same model twice will produce slightly different code sets — that variance is the signal. The published ensemble method runs each transcript 3-6 times at temperature 0, then keeps only codes that surface in a majority of runs. Codes that appear in one run and vanish in the next are exactly the hallucinations you want to drop before they reach your analysis.

Practical recipe:

Run the prompt 3-5 times on the same batch.
Keep codes that appear in a majority of runs; quarantine the rest.
Hand-code a random 10% of segments yourself.
Compute Cohen’s kappa between your codes and the model’s consensus codes.
If κ falls below 0.61, tighten your definitions and stop-list, then re-run.

What “good agreement” actually means

Cohen’s kappa corrects raw agreement for chance. The standard reference is the Landis & Koch (1977) scale:

Kappa (κ)	Interpretation	Safe to publish?
0.81 - 1.00	Almost perfect	Yes, with human spot-check
0.61 - 0.80	Substantial	Yes for internal readouts
0.41 - 0.60	Moderate	Refine codebook first
0.21 - 0.40	Fair	Do not rely on it
0.00 - 0.20	Slight	Recode manually

In the arXiv 2512.20352 study on an art-therapy interview transcript (Dec 2025), the ensemble method reached κ = 0.907 with Gemini 2.5 Pro, 0.853 with GPT-4o, and 0.842 with Claude 3.5 Sonnet — all in “almost perfect” territory. Those were the shipping models at the time; the June 2026 successors (Gemini 3.1 Pro, GPT-5.5, Claude Opus 4.7/Sonnet 4.6) are stronger, so treat those numbers as a conservative floor. The same study used a dual metric: kappa for label agreement plus cosine similarity (92-95%) for semantic consistency, since two coders can pick different labels that mean the same thing.

Cost and tool comparison

The economic case is the reason teams reach for AI coding at all. As of June 2026:

Approach	Cost per transcript	Speed	Notes
Human coder	$20-40	Hours	Gold standard; needed for the 10% check
LLM ensemble (API)	~$0.15-0.20	Minutes	3-5 runs at temperature 0
NVivo (AI add-on)	$295-595 / yr license	Mixed	Traceable, strong export; AI is assistive
ATLAS.ti AI Lab	$395-595 / yr license	Mixed	”AI-assisted manual”; expect to weed first-order codes
Dedoose	$14.99 / mo	Mixed	Cloud, mixed-methods, good for remote teams

Dedicated QDA software (NVivo, ATLAS.ti, Dedoose) still wins where you need an auditable trail and journal-grade traceability. A raw model run wins on speed and cost for early signal. Many teams now do both: model for the first pass, QDA tool to document and defend the final codebook.

How to check the output

Verify every quote actually appears in the transcript and is attributed to the right speaker. This is where hallucination hides — models will confidently fabricate a plausible-sounding quote.
Sanity-check frequencies: a code that appears once is an observation, not a theme.
Stress-test boundary cases with a colleague. Disagreement is where insight lives.
Confirm the model honored the stop-list; it tends to drift back to generic codes over long batches.

Common mistakes

Running the prompt once and treating the output as final. Always ensemble.
Accepting paraphrased quotes — they are unusable for reporting and a red flag for fabrication.
Skipping the kappa check because the model “sounded confident”.
Coding at too coarse a level, so every theme blurs into “users want better UX”.
Pasting 40 transcripts into one prompt and trusting the line references; batch smaller.

FAQ

Can AI replace qualitative researchers? No. It accelerates coding but cannot interpret context, irony, or what is left unsaid. The published κ = 0.84-0.91 results all depend on human-defined frameworks and a human-coded validation sample.
Which model should I use? Any 1M-context model works (Claude Opus 4.7, Gemini 3.1 Pro, GPT-5.5). In the Dec 2025 benchmark the Gemini variant scored highest on kappa (0.907 vs 0.853 vs 0.842), but the spread was small; consistency across runs matters more than which model you pick.
How do I compute Cohen’s kappa? Build a confusion matrix of your codes versus the model’s on the same 10% sample, then use any stats package (scipy.stats.cohen_kappa_score in Python, or the irr package in R). NVivo and Dedoose calculate it for you.
How many transcripts per prompt? Group by participant segment or interview wave and cap at 5-10, even though the window holds more. Quality drops before the token limit.
How do I handle multilingual data? Code in the source language, then translate only the quotes you cite in the report — never translate first, because translation strips the nuance you are coding for.

For complementary patterns, see user feedback clustering prompts, the user feedback clustering AI workflow, and customer discovery questions AI. For background on chance-corrected agreement, see the Landis & Koch (1977) benchmark scale.

Tags: #Data analysis #Workflow #Research

TL;DR

The task

When AI is the right tool

When not to rely on AI alone

What to feed the model

How much fits in one prompt

Copy-ready prompt

Run it three to five times, not once

What “good agreement” actually means

Cost and tool comparison

How to check the output

Common mistakes

FAQ

Related

Related Articles

Write the A/B Test Summary With AI

Write Chart Takeaways With AI: Turn a Screenshot Into a Tight Caption

AI Competitor Comparison Tables: Build a Matrix That Survives a Source Check

Write a Dashboard Takeaway With AI

Interpret A/B Test Results With AI: Significance, SRM, Effect Size

AI for Financial Trend Analysis: Find Real Revenue, Cost, and Margin Shifts