How does ChatGPT actually work? And how do LLMs analyse data?

You have probably used ChatGPT, Claude, or Copilot dozens or even hundreds of times. You type a question, press enter, and get an answer that often sounds remarkably intelligent. But do you actually understand what is happening when you hit that enter button?

Most people have fundamental misconceptions about how these AI systems work. These misunderstandings matter because they lead to using AI tools in ways that produce unreliable results, particularly when analysing qualitative data or doing serious research work.

We've earlier discussed these topics on Skimlecast's Episode 6 and Episode 8 and written about for example the challenges of using ChatGPT for analysing qualitative data on Signal & Noise blog, and in this article we go even deeper to the topic.

Let's clear up the confusion and understand what ChatGPT actually does, using the example of when you ask it to analyse your interview transcripts or summarise documents.

Six common misconceptions about how AI works

Misconception 1: ChatGPT stores your data in a smart database somewhere

When you upload documents to ChatGPT, many people imagine the AI is storing those documents in some kind of searchable database that it can later query. This is wrong.

Large language models are not databases. They do not store your documents as retrievable files. When you upload a PDF or paste text, the AI converts it into text (tokens) that become part of the immediate conversation context. Once that conversation ends, the documents are gone unless you're using a system specifically designed to store them separately (e.g. some RAG systems).

This matters enormously when you think ChatGPT is "analysing all your interview data". The AI has no persistent access to those documents it can refer to. It cannot go back and recheck things unless you feed it the entire context again, and it's not structuring or arranging the data in any way.

Misconception 2: When ChatGPT says "I'm analysing your data" it's actually thinking

You have probably seen ChatGPT display a message like "Analysing your documents..." or "Let me think about that..." and assumed the AI is genuinely performing some internal analytical process during that pause.

What actually happens is much simpler. The model starts generating tokens immediately. Those "thinking" messages are just text predictions like any other output. The AI predicts that showing a "thinking" message before the answer matches the pattern of helpful assistants in its training data. There is no separate analytical process happening. It is generating text the entire time. Note that in some settings (e.g., when using Claude Code's harness) there are hidden output tokens, but that is not part of the core LLM worklow it's an additional feature on top - more about it later.

When you ask ChatGPT to analyse 50 interview transcripts, it's not going away to carefully review each one and build a systematic understanding. It's immediately starting to predict what text should follow your question based on patterns it learned during training.

Misconception 3: ChatGPT learns from your conversations in real time

This is perhaps the most widespread misconception which we also discussed at length in Skimlecast Episode 4 based on a listener question. Your conversations are not training the model in real time. Training a large language model takes weeks or months with massive computing resources. The ChatGPT you talk to today is a frozen snapshot that was trained months ago.

When you chat with ChatGPT, you are interacting with a static file, similar to running a video game from a disc (remember those...?). The model cannot learn from your feedback or remember your previous conversations (unless you're using specific features like custom instructions or memory features, which work through separate systems and are often very light as each memory needs to be fed to the LLM model every turn of the chat.

This has big implications for qualitative research workflows. If you give ChatGPT feedback like "actually, that theme should be split into two categories", the model has not learned anything. The next time you analyse similar data, it will make the same mistakes unless you explicitly tell it again.

Misconception 4: ChatGPT understands language and meaning like humans do

When ChatGPT discusses thematic analysis in qualitative research or helps you categorise interview responses, it sounds like it understands what themes are and why certain quotes belong together.

LLMs do not comprehend language or meaning in the way humans do. They are extraordinarily sophisticated pattern-matching systems. They have learned which words tend to follow other words in millions of examples, including examples of people doing thematic analysis, writing research reports, and categorising qualitative data.

This distinction matters because the AI lacks the lived experience and contextual understanding that human researchers bring. When you analyse interview transcripts, you recognise when someone is being sarcastic, when they are expressing something important but struggling to articulate it, or when an offhand comment reveals a deeper insight. ChatGPT can only work with surface-level patterns in the text, meaning sometimes it identifies the meaning correctly but often it does not.

Misconception 5: Newer AI models always perform better

You might assume that ChatGPT-5 or the latest model is automatically better at every task than older versions. This turns out to be false in many cases.

Recent research found that newer models like ChatGPT-4o sometimes performed worse than older ones at specific tasks, including scientific summarisation. Different models have different strengths. Some are better at creative writing, others at code, others at following instructions precisely. For analysing data like open text responses, you cannot simply assume the newest model will do the best job.

Misconception 6: ChatGPT can access and retrieve information like a search engine

When you ask ChatGPT "what does recent literature say about customer satisfaction drivers", many people imagine the AI is searching through articles somewhere and retrieving information it then processes. Now in reality doing this would require huge amounts of compute and take ages to do.

LLMs cannot access URLs or search the internet unless explicitly given that capability through additional tools. The base model can only work with text in the immediate conversation. What sounds like retrieval is actually the model generating text based on patterns it saw during training about what articles typically say about customer satisfaction.

If you are using "Deep research" functionalities, then the LLM can perform web searches, but these take time just like human browsing would. No deep research tool will actually search through all the relevant web sites, they will visit the top sites and then fill in the gaps with plausible data from their training sets.

How large language models actually work

To understand why these misconceptions exist and what ChatGPT really does, you need a basic understanding of the underlying technology.

Next token prediction at massive scale

At its core, a large language model has one job: predict the next token. A token is roughly a word or part of a word. The entire system is trained to look at a sequence of tokens and predict which token should come next.

During training, the model sees billions of examples from the internet: articles, books, websites, code, conversations. For each example, it learns patterns. After seeing "The capital of Finland is" thousands of times followed by "Helsinki", it learns that "Helsinki" is the highly probable next token in that context. Crucially, the model is frozen after training. What you chat with is a snapshot of those learned patterns.

This process scales up dramatically. The model does not just learn that Helsinki follows "capital of Finland". It learns incredibly sophisticated patterns about how language works, how arguments are structured, how concepts relate to each other and so on. So it's not entirely correct to call it an "fancy autocomplete" as there are so many factors in play in producing the next token that models exhibit emergent behaviour.

After training the LLM to predict a next token, it is next tuned to produce helpful and safe responses through two stages. First, supervised fine-tuning (SFT) trains the model on curated examples of ideal assistant behavior—human-written conversations showing how to respond properly. Then comes reinforcement learning from human feedback (RLHF): humans rank multiple model outputs to train a separate "reward model" that learns to score responses, the LLM generates responses, the reward model scores them, and the LLM's parameters are adjusted to make high-scoring token outputs more likely. This process shapes the model's default behavior and is why LLMs like ChatGPT prefers to be helpful (and sometimes even too sycophantic...) rather than just completing text in any direction.

Attention mechanisms and context

The breakthrough that made modern LLMs possible is called the "attention mechanism". This allows the model to figure out which previous tokens in the conversation are most relevant for predicting the next token.

When generating a response, the model looks at all the tokens in the current conversation and calculates attention scores. If you asked about "customer satisfaction in retail customers", the model pays more attention to those specific tokens when generating the response, rather than weighing every single word in the full conversation history equally.

This attention mechanism is why LLMs seem to maintain context across a conversation. They can refer back to things you said earlier because those tokens, as well as the entire previous conversation session, are fed to it every time and the attention mechanism can bring them up. However, attention has limits. Most models have a maximum context window, perhaps 128,000 tokens (about 100,000 words). Once the conversation exceeds that length, older tokens get dropped. The model has no memory of them. And when the conversation gets longer, there are more and more tokens fed to the model each turn, which makes it less likely for the model to identify the right tokens to pay attention to.

Generate, feed back, repeat

Here is what actually happens when you ask ChatGPT to analyse your interview data:

You paste interview transcripts and ask a question. This becomes a long sequence of tokens.
The model looks at all those tokens and predicts the single most likely next token based on patterns learned during training.
That predicted token gets added to the context.
The model looks at the entire sequence again, including the token it just generated, and predicts the next token.
This repeats until the model generates a token that signals it's done.

Each token is predicted one at a time, with the full conversation fed back to the model each time. This is why longer conversations become slower. The model processes more tokens for each new word it generates.

When ChatGPT produces a thematic analysis of your interviews, it has not systematically reviewed each interview and built a structured understanding. It has predicted a plausible sequence of tokens that resembles thematic analyses it saw during training, using attention to focus on relevant parts of your interviews that are currently in context.

What this means when you use ChatGPT for analysis

Understanding how LLMs work reveals why they behave in unexpected and often problematic ways when you try to use tools like ChatGPT or Copilot for analysing data.

Every query re-analyses from scratch

Because the model predicts tokens based on the current conversation, asking the same question twice can give different answers. There is randomness in the prediction process (controlled by "temperature" settings). More fundamentally, the model does not build a persistent understanding or structure of your data that it would refine over time.

If you ask ChatGPT to categorise 100 customer feedback responses, then later ask "what were the main themes again?", the model re-generates an answer based on the current conversation state. If parts of the original responses have scrolled out of the context window, the answer may be incomplete or inconsistent with the earlier analysis.

This is completely different from how humans do qualitative research. A researcher builds a mental model of the data, refines categories over time, and can reliably recall their analytical framework. ChatGPT does none of this. Each response is freshly predicted tokens.

Not retrieving anything, just predicting text

When you upload 50 interview transcripts and ask "what did people say about pricing?", it feels like ChatGPT is searching through those interviews to find pricing comments.

What actually happens is the model looks at the tokens representing your interviews (currently in context) and predicts a plausible response about pricing based on patterns from training data. If your context is too long, the model may not attend to all interviews equally. It might focus on interviews that appeared recently in the token sequence due to recency bias in attention mechanisms.

The model is not conducting a systematic review. It is generating text that sounds like a systematic review based on partial attention to your data. This is why ChatGPT-style document chat fails for rigorous analysis. The outputs look professional but lack the comprehensive coverage and consistency that proper research requires.

Hallucinations are a feature not a bug

Studies show that LLMs produce inaccurate conclusions in up to 73% of scientific summary cases. The model exaggerates claims, invents citations, and confidently states things that do not appear in source documents.

This happens because the model is trained to predict plausible text, not accurate text. When summarising research, it predicts the kind of confident, broad conclusions that often appear in summaries, even if those conclusions go beyond what your specific data says.

For analysing open-ended survey responses or conducting due diligence research, hallucinations are catastrophic. You cannot distinguish between real insights drawn from your data and plausible-sounding fabrications unless you manually verify every claim against source documents.

Sycophancy means it tells you what you want to hear

LLMs exhibit sycophancy, the tendency to agree with users and validate their ideas even when those ideas are wrong. This happens because the training data in the fine-tuning stage includes many examples of helpful assistants who are agreeable and supportive.

If you have a hypothesis about what your interview data reveals and you ask ChatGPT to confirm it, the model is biased towards generating text that agrees with you. This is particularly dangerous for qualitative research where confirmation bias is already a major threat to rigour. Some researchers have noted that if they have first discussed or e.g., typo-corrected text related to a specific theoretical construct so that it's still stored in the LLMs context window, this has a heavy influence on the LLMs answers and it tries, often without mentioning it, form a coherent bridge between the theory and the data even if none exists.

A proper analytical process involves challenging your initial interpretations, looking for disconfirming evidence, and being surprised by what the data shows. ChatGPT's sycophancy works against this. It tends to confirm your hunches rather than revealing unexpected patterns.

Context limits mean incomplete analysis

Even with 128,000 token context windows, you cannot fit hundreds of long interview transcripts or thousands of survey responses into a single conversation. Once you exceed the limit, the model starts forgetting earlier documents.

This creates a fundamental problem for analysing qualitative data at scale. The AI cannot hold all your data in attention simultaneously. Its analysis is necessarily incomplete, focusing on whatever fits in the current context window.

Researchers working manually can build up a systematic understanding across a large dataset over time. ChatGPT cannot. Each query works with only what fits in context right now.

You need a harness around the LLM

Raw large language models are extraordinarily powerful pattern-matching engines. But on their own, they are not suitable for serious analytical work. You need to build a harness around the LLM that compensates for its limitations.

Tools, memory, and workflow

The future of AI is not chatting with a single model. Well-designed AI systems combine LLMs with tools, memory systems, and structured workflows.

Tools allow the LLM to perform actions rather than just generating text. Instead of predicting what a calculation result might be, the model calls a calculator tool and gets the actual answer. Instead of predicting what articles probably say, the model uses a search tool to retrieve real documents.
Memory systems store information outside the conversation context so it persists across sessions. Rather than re-reading all your interviews for every query, a memory system maintains a structured representation of your data that the LLM can query.
Structured workflows break complex tasks into steps instead of asking the LLM to do everything in one go. Rather than "analyse these interviews", a workflow might: extract quotes, identify themes, categorise quotes into themes, summarise each theme, compare across segments, and generate a report. Each step uses the LLM appropriately rather than hoping it magically does rigorous analysis.

Claude Code as an example

Claude Code is a good example of building a harness around an LLM. When you ask Claude Code to modify files in a codebase, it does not just generate text describing changes. It has access to tools that read files, write files, run bash commands, and execute code.

The LLM generates decisions about what to do next, but the tools perform actual actions. This combination makes Claude Code far more capable than ChatGPT for software development tasks. The LLM component is similar (Opus or Sonnet 4.5), but the harness transforms what's possible.

Skimle takes the same approach for qualitative analysis

This same principle applies to Skimle - our tool for qualitative data analysis. We originally tried to use raw ChatGPT-type tools (and believe us, we tested them all...) for analysing interview transcripts for research and public consultation statements, for policy making, but found them unsuitable because they lacked the harness required for rigorous research.

Skimle builds a harness around LLMs that includes:

Systematic processing workflow that analyses documents one by one rather than dumping everything into a prompt
Structured data storage that maintains categories, themes, and quote-to-category mappings outside the LLM conversation
Two-way transparency that lets you trace every insight back to source quotes and see what quotes fed each category
Consistency mechanisms that ensure the same data produces the same analysis rather than random variation between runs
Human oversight interfaces that keep researchers in control of analytical decisions rather than blindly trusting LLM outputs

The LLM component in Skimle does what LLMs do well: small queries e.g., to extract and categorise themes. But the harness handles what LLMs cannot do: maintaining comprehensive structured data, ensuring systematic coverage, providing transparency, and supporting iterative refinement.

You can read more about how this structured approach differs from RAG systems and why two-way transparency matters when using AI analysis tools.

The bottom line

ChatGPT and other large language models are remarkable technology, but they are not magic. Understanding that they work by predicting the next token based on learned patterns helps you recognise both their strengths and fundamental limitations.

For casual tasks like drafting emails, brainstorming ideas, or explaining concepts, raw LLMs work reasonably well. For serious analytical work like analysing interviews, conducting thematic analysis, or synthesising expert feedback, you need tools specifically designed to handle the rigour and transparency that research requires. Just like for serious AI assisted coding work you need a harness like Claude Code.

The good news is that by understanding how LLMs actually work, you can evaluate AI tools more critically. Ask questions like: How does this tool ensure comprehensive coverage? Can I trace every conclusion back to source data? Does it produce consistent results? Is there a systematic workflow or just a chat interface?

These questions distinguish between genuine AI-assisted analysis and AI slop. The technology is powerful, but only when properly harnessed.

Ready to analyse your qualitative data with a dedicated AI harness built for serious knowledge work? Try Skimle for free and experience systematic AI-assisted analysis with full two-way transparency from every insight back to source data.

About the authors

Henri Schildt is a Professor of Strategy at Aalto University School of Business and co-founder of Skimle. He has published over a dozen peer-reviewed articles using qualitative methods, including work in Academy of Management Journal, Organisation Science, and Strategic Management Journal. His research focuses on organisational strategy, innovation, and qualitative methodology. Google Scholar profile

Olli Salo is a former Partner at McKinsey & Company where he spent 18 years helping clients understand the markets and themselves, develop winning strategies and improve their operating models. He has done over 1000 client interviews and published over 10 articles on McKinsey.com and beyond. LinkedIn profile