Using ChatGPT and other LLMs to analyse interviews and qualitative data - what works and what doesn't

Simple LLM tools like ChatGPT, Gemini, Claude, Grok or Copilot seem like the perfect solution for qualitative analysis: fast, easy, and impressive in demos. But practical experience and research reveals critical limitations that make them unsuitable for real analysis. Here's what you need to know before turning to AI with your research.

Cover Image for Using ChatGPT and other LLMs to analyse interviews and qualitative data - what works and what doesn't

Picture this: You are a consultant who after two long weeks just finished 25 expert interviews for a due diligence project. Or you're a user researcher who has done 40 customer interviews about their product experience. Or an EU policy analyst at Brussels who just received a massive stack of 500 consultation feedback statements to analyse.

Seeing your desperate face, your colleaque suggests: "Hey, why not just upload it all to ChatGPT and ask it to find the themes?"

It's a tempting idea. Tools like ChatGPT, Claude, Copilot, Gemini or Grok are free to use, fast, and impressive in demonstrations. Upload your transcripts, do some "prompt engineering" to create a query asking for themes and insights, and within seconds you have what looks like professional analysis. No expensive software, no weeks of manual coding, no learning complex methodology.

But here's the problem: what looks like good analysis isn't the same as what is good analysis. Recent research from academic institutions worldwide has identified what practioners knew already: there are serious, fundamental limitations that make ChatGPT unsuitable for rigorous qualitative analysis.

If you're considering using ChatGPT for interview analysis, survey open-ends, or any qualitative research, here's what you need to know about what works, what fails, and what alternatives exist that actually deliver reliable results.


The attractive promise: why chat-based AI tools seem perfect for qualitative analysis

Let's be fair: ChatGPT does have genuine appeal for qualitative analysis, and it's not hard to see why so many researchers and analysts are experimenting with it.

Speed: Traditional thematic analysis of 25 interviews takes 40-80 hours of manual coding and theme development. ChatGPT produces an analysis in minutes.

Accessibility: No expensive software licenses. No weeks learning NVivo, ATLAS.ti or other previous-generation clumsy qualitative analysis tools. Just upload your data, write a prompt, get results.

Impressive first impressions: The analysis ChatGPT produces looks professional. It identifies themes, provides structure and produces quotes that seemingly come from your data. In a 15-minute demo, it's genuinely impressive.

Reducing manual drudgery: Researchers spend 1-3 hours manually coding each interview. The promise of automation is powerfully attractive - it's the difference between a week at the office or a week at the beach...

For business contexts with tight deadlines, or academic researchers facing large datasets, ChatGPT appears to solve a genuine problem. Which is why thousands of people are trying it for qualitative analysis right now.


The critical failures: what research reveals about ChatGPT's and other AI tools' limitations

There is a Finnish saying "Moni kakku päältä kaunis, vaan on silkkoa sisältä" translating literally as "Many cakes are beautiful on the surface, but just plain bread inside" or figuratively as "All that glitters is not gold". The promise of AI being a silver bullet for analysis unfortunately doesn't survive contact with reality.

Multiple peer-reviewed studies published in 2024-2025 have systematically tested ChatGPT and similar basic LLM tools for qualitative analysis, and the results reveal fundamental problems that make simple one-shot LLM methods unsuitable for serious research or business analysis.

These aren't just theoretical concerns. Researchers worldwide have documented their frustrations when attempting to use ChatGPT-type rudimentary AI tools for real qualitative analysis.

  • Philipp Mayring, testing both ChatGPT 3.5 and 4 for content analysis, found they "led in both versions at most to rough approximations of the sample solution with a large number of gross errors."
  • Ana Canhoto, a marketing professor, discovered that "ChatGPT is not consistent in its reading of a text. Slight variations in the phrasing of the prompt produce dramatically different interpretations."
  • Dr Wilf Nelson, after extensive testing for qualitative research, concluded that "ChatGPT doesn't see much nuance in human behaviour" and found it fundamentally "not suitable for conducting qualitative analysis due to its limitations."

The pattern is clear: what looks impressive in a quick demo fails when subjected to rigorous research standards.

Problem 1: Hallucinations - making up data that wasn't there

This is the most serious issue. ChatGPT doesn't just analyse your data, it sometimes invents things.

Morgan (2023) in the International Journal of Qualitative Methods found that "hallucination happens because each response depends on the continuing context... the software is continually trying to predict what comes next, which means that one misstep in its responses can be magnified in subsequent responses."

What this means in practice: You ask ChatGPT to identify themes in customer interviews. It returns a theme called "frustration with mobile app performance" supported by what looks like relevant quotes. But when you check the original transcripts, those exact phrases don't exist. ChatGPT has synthesised plausible-sounding quotes from fragments, or worse, invented them entirely.

Multiple studies have documented this problem. Nguyen-Trung (2025) in Quality & Quantity noted that "hallucinations have been documented in other studies of GenAI thematic analyses and present a substantial threat to its validity and trustworthiness." The researchers were blunt in their conclusion: "LLM hallucination is inevitable and unavoidable."

For academic research, this destroys validity. For business decisions, it means basing strategy on insights that don't actually exist in your data. For policy analysis, it risks misrepresenting stakeholder positions. Even individual "minor" hallucinations destroy the full credibility of the research, as shown for example in the fallout from Deloitte's report produced by "vibe consulting" where they had to pay the Australian governement back after hallucinated quotes were discovered.

Problem 2: Inconsistency - different prompts, different results

Ask ChatGPT to analyse your interviews. Get a set of themes. Ask it again tomorrow with a slightly different prompt. Get different themes. Which analysis is correct? You have no way to know. With humans you could ask them to explain the evolving thoughts and rationale, but any attempt to ask the LLM for why it changed the categorisation would just be met with a very plausible sounding explanation in reality not related to the complex internal workings of the model at all.

Lee et al. (2024) in the Journal of Medical Internet Research identified "output being prompt-dependent" as a major challenge, noting that "prompts requesting the same output but phrased differently will lead to different outputs."

What this means in practice: You can't reproduce your analysis. A stakeholder questions a finding. You can't go back and verify it because ChatGPT might give you different results. For peer-reviewed research, this fails basic reproducibility standards. For consulting work, you can't defend your analysis when challenged. "The AI chatbot told me so" is not an answer...

This makes ChatGPT fundamentally unsuitable for any work where you need to demonstrate: "Here's how we reached this conclusion, and here's how someone else could verify it."

Problem 3: Missing nuanced themes and minority insights

ChatGPT-type chatbot tools can find obvious, frequently-mentioned themes. It struggles badly with subtle patterns, minority viewpoints, or interpretive insights requiring contextual understanding.

Sakaguchi, Sakama, and Watari (2025) in a comparative study in the Journal of Medical Internet Research found that "ChatGPT demonstrates strong capabilities in detecting widely recurring qualitative themes, its performance in identifying less frequently mentioned or nuanced themes remains limited compared to human analysis." Morgan (2023) noted: "ChatGPT performed reasonably well, but in both cases it was less successful at locating subtle, interpretive themes, and more successful at reproducing concrete, descriptive themes."

What this means in practice: In due diligence interviews, the most valuable insights often come from what 2-3 experts mentioned that others didn't. In policy consultations, minority views require equal representation. In customer research, outlier experiences often reveal critical product issues.

ChatGPT type tools systematically de-prioritise or miss these insights. It gives you the obvious patterns everyone agrees on, but misses the nuanced, high-value insights that require interpretation and contextual understanding.

Problem 4: Data reduction instead of condensation

Proper qualitative research and analysis of e.g., interviews involves condensing data: abstracting essential meanings while preserving important details. LLM's tends toward data reduction: simply omitting portions of the data.

Nguyen-Trung (2025) identified this clearly: "Rather than condensing the data (abstracting data while preserving its essential meanings), it tends to reduce data (omits portions) and could even hallucinate or make up responses." The author identifies a major issue with even modern LLMs: their "capacity to do complex tasks involving long-text files and multiple steps remains limited... this limitation stems from GPT-4's context window, defined as “the maximum number of tokens that can be used in a single request."".

What this means in practice: You have 40 interviews. ChatGPT processes them and identifies 8 themes. But which parts of the data did it ignore? What quotes didn't fit its theme structure and got dropped? You have no way to know what you're missing.

This is particularly dangerous for policy work where comprehensiveness matters, or business analysis where the critical insight might be the thing mentioned only twice.

Problem 5: Black box with zero transparency

When you use ChatGPT for analysis, you get results. But you can't trace how it reached those conclusions. You can't see its reasoning. You can't verify that every relevant data point was considered.

A recent article by Nguyen and Welch (2025) in the journal Organizational Research Methods concludes that "LLM chatbots are the wrong tool for qualitative data analysis. Researchers using LLM chatbots become trapped in an infinite loop of chatbot conversation."

What this means in practice: A client asks: "How do you know customers feel this way about pricing?" You can point to a ChatGPT-generated theme, but you can't show the systematic process that identified this pattern across multiple interviews. You can't demonstrate that every pricing-related comment was captured and considered.

For academic research, this fails peer review standards for methodological rigour. For consulting, you can't provide the audit trail clients expect for high-stakes decisions. For policy work, you can't demonstrate that all stakeholder voices were fairly represented.


Why traditional tools aren't much better

At this point, some readers are thinking: "Fine, so ChatGPT has problems. I'll just use NVivo, ATLAS.ti, MAXQDA or some other qualitative analysis tool instead."

These tools solve some problems (no hallucinations, reproducible analysis, clear audit trails) but create others. As we discussed in our comparison of qualitative analysis tools, traditional QDA software requires weeks to learn, costs £800-1,600 annually, and still requires 1-3 hours of manual coding per interview.

For a typical study of 30 interviews, you're looking at 60-90 hours of painstaking manual work. For business contexts with 2-week deadlines, this is simply not feasible and that is why most turn to Excel, post-it note sorting or other "hacks" when doing thematic analysis in a business setting. For academic researchers with limited time, it means accepting smaller sample sizes or sacrificing depth.

The fundamental problem: qualitative analysis has been stuck between two unsatisfying options:

  1. Fast but unreliable (ChatGPT and similar LLM approaches)
  2. Rigorous but impossibly slow (manual analysis with traditional tools)

The solution: systematic AI-assisted analysis with two-way transparency

The breakthrough comes from recognising what ChatGPT gets wrong: it tries to analyse data at query time, retrieving and interpreting on-the-fly. As we explained in Why RAG doesn't work for qualitative research, this approach is fundamentally unsuited to systematic analysis.

The alternative is to structure data systematically upfront, creating a stable, transparent framework that can then be queried, refined, and analysed without the inconsistency and hallucination problems of real-time LLM interpretation.

This is the approach Skimle takes: systematic analysis following established thematic analysis methodology, using AI to automate the mechanical coding work while preserving the rigour and transparency that serious analysis requires. Skimle follows the bottom-up qualitative analysis process of proper coding and category creation that our co-founder Henri Schildt used for more a dozen published academic articles, but uses hundreds of atomic LLM-calls to automate each step of the workflow.

Skimle difference #1: Two-way transparency

One critical innovation is two-way transparency: the ability to trace from any theme back to the specific quotes supporting it, and from any document forward to all the themes it contributes to.

Theme to data: Click on "pricing concerns" and see every customer quote about pricing, across all interviews, with context. No hallucinations because every claim is linked to actual data. No missing nuanced points because systematic processing captured everything.

Data to themes: Open any interview and see exactly which themes this respondent's comments contributed to. Verify that the analysis captured everything important this person said, or adjust / recode as needed.

This two-way transparency solves the fundamental trust problem with ChatGPT analysis. When a stakeholder asks "How do you know this?", you don't point to an AI-generated summary you can't verify. You show them the complete audit trail from raw data to insight in an format that is as clear as Excel so people can trace the argument.

Skimle difference #2: Systematic processing instead of retrieval

ChatGPT processes your data every time you query it, with no guarantee of consistency or completeness. Skimle's systematic AI-assisted analysis processes everything once, thoroughly, creating a stable structure that remains consistent. The resulting Skimle table (what insights each document has related to each category) is a format that both humans and computers can understand, navigate and manipulate.

Think of it like the difference between:

  • ChatGPT approach: Having an assistant rummage through filing cabinets every time you ask a question, pulling different files each time and giving you whatever they happen to find. Or running financial analysis through a calculator each time you change an assumption.
  • Systematic approach: Carefully organising all files into a structured system once, so every subsequent query accesses the same organised information. Or having a nicely laid out Excel table with source data, assumptions and formulas clearly laid out for real-time editing.

The systematic approach means:

  • Consistency: You are dealing with analysed data in a stable format, not generated each time
  • Completeness: You know everything was processed, not just what matched a particular query
  • Refinability: You can reorganise themes, merge categories, adjust structure - the underlying data coding doesn't change

Skimle difference #3: Validation and refinement

Chatbots give you results. Take them or leave them. You can't easily refine, adjust, or validate. And often if you try to probe deeper, you get the sycophantic-apologetic side of LLMs: "You are absolutely right, I seem to have omitted this important theme. My mistake I will work harder next time. Spotting the mistake was not just observant, it was truly next-level genious! Well done, lets add the theme to the analysis and proceed! "*

Proper AI-assisted analysis treats the AI-generated structure as a first draft requiring human validation and refinement. The AI does the mechanical work of systematic coding and initial categorisation. The researcher applies judgment, expertise, and contextual understanding to refine themes, identify relationships, and develop insights.

This preserves what's valuable about human expertise while eliminating the mechanical drudgery that makes manual analysis so time-intensive.


When simple AI chatbots might be acceptable (and when definitely not!)

To be balanced: ChatGPT, Cursor, Gemini, Grok, Claude and the likes aren't useless for all qualitative work. There are contexts where its limitations matter less:

Acceptable uses:

  • Quick exploratory analysis of small datasets (5-10 interviews) where you will later do the proper analysis anyways
  • Generating initial ideas for coding frameworks that you'll then apply manually
  • Summarising single interviews for your own note-taking
  • Brainstorming potential themes before systematic analysis
  • Asking the AI to find that one quote you know is in the data but can't find

Definitely not acceptable:

  • Offloading any steps in academic research process to the AI
  • Business decisions involving significant investment or risk
  • Policy analysis where stakeholder views must be fairly represented
  • Any context where you need to defend your methodology
  • Legal or compliance work requiring audit trails
  • Studies with 10+ interviews where you are not manually verifying everything

The key question: Can you afford to have hallucinations, inconsistency, and missing insights in the AI generated answers? If the answer is no, simple chatbots aren't suitable for your work.


Conclusion: promising technology, wrong application

ChatGPT and it's friends are remarkable pieces of technology with genuine capabilities. But analysing qualitative research data systematically and reliably isn't one of them. The hallucinations, inconsistency, and transparency problems aren't minor issues that can be worked around with better prompts or next-generation models. They're fundamental limitations of how LLMs are designed to work.

The research is clear: Simple chatbot tools can assist with qualitative analysis in limited ways, but it cannot replace systematic methodology, human judgment, and transparent audit trails. Researchers and analysts who try to use it as a complete analysis solution are building their conclusions on unreliable foundations.

The good news: you don't have to choose between ChatGPT's speed and traditional approaches' rigour. Properly designed AI-assisted analysis can deliver both, using AI to automate mechanical coding work while preserving the systematic methodology and two-way transparency that serious analysis requires.

If you're analysing interviews, survey open-ends, policy consultations, or any other serious work where the conclusions matter, don't trust black-box LLM approaches. Use systematic analysis that you can verify, defend, and trust.


Ready to analyse your qualitative data with both speed and rigour? Try Skimle for free and experience systematic AI-assisted analysis with full two-way transparency from every insight back to source data.

Want to learn more about qualitative analysis methods? Read our guides on thematic analysis methodology, how to conduct effective interviews, and choosing the right qualitative analysis tools.

About the Authors

Henri Schildt is a Professor of Strategy at Aalto University School of Business and co-founder of Skimle. He has published over a dozen peer-reviewed articles using qualitative methods, including work in Academy of Management Journal, Organization Science, and Strategic Management Journal. His research focuses on organizational strategy, innovation, and qualitative methodology. Google Scholar profile

Olli Salo is a former Partner at McKinsey & Company where he spent 18 years helping clients understand the markets and themselves, develop winning strategies and improve their operating models. He has done over 1000 client interviews and published over 10 articles on McKinsey.com and beyond. LinkedIn profile