Bias in AI-assisted qualitative analysis: what it looks like, and how to catch it

Bias in AI-assisted qualitative analysis is not the model being rude about a protected group. It is something quieter and more dangerous: the AI's coding errors are not random with respect to who said what. A peer-reviewed 2025 study testing GPT and Llama on real interview data found exactly this, errors correlated with refugee status, gender, and education, which means even a model that looks accurate overall can produce confidently wrong conclusions about specific groups.

No AI system, including the ones underneath Skimle, can claim to have eliminated this. What can change is whether you, the researcher, are ever in a position to notice it.

What does bias look like in AI-assisted qualitative analysis?

It rarely looks like an obvious slur or a flagged word. It looks like a finding that is wrong in a specific, patterned way that is easy to miss because the output still reads as coherent and well-supported.

We demonstrated a version of this directly by feeding an AI tool 650 randomly labelled customer comments, half tagged "Finnish," half tagged "US," with no actual connection between the label and the content. The AI confidently reported that Finnish comments were terse and verdict-focused while American comments were narrative and emotional, classic national stereotypes, complete with supporting quotes. When the labels were flipped and the exact same question asked again, the AI found the same stereotypes again, just pulling different quotes to support the now-relabelled data. The model was not reading the data. It was pattern-matching to what it already expected nationality to predict, then selecting evidence to fit.

That is what bias looks like in this context: not an absence of evidence, but evidence selected and framed to match a prior expectation the model picked up from its training data, applied to data where that expectation does not hold.

What does the research actually say?

This is not a one-off demonstration. It is a documented pattern across multiple major model providers.

A 2025 study by Julian Ashwin, Aditya Chhabra, and Vijayendra Rao, published in Sociological Methods & Research, tested two versions of OpenAI's GPT models and two versions of Meta's Llama models on real qualitative data: open-ended interviews with Rohingya refugees in Cox's Bazaar, Bangladesh. Their core finding is precise and uncomfortable: the errors the models made in coding the interviews were not random with respect to the characteristics of the interview subjects, specifically refugee status, gender, and education. As the authors put it, even a model with 95% coding accuracy can still produce arbitrary bias in your conclusions if the 5% of errors it makes are systematically related to the variable you are studying. Random errors wash out with a large enough sample. Systematic ones do not.

Anthropic, Claude's own developer, ran a comparable evaluation on their own model rather than waiting for outside researchers to find the problem first. They generated prompts across 70 decision scenarios (hiring, housing, medical treatment, loans) and systematically varied the demographic details in each one, then tested Claude 2.0's responses. They found patterns of both positive and negative discrimination in select settings with no intervention applied, and showed that careful prompt design could significantly reduce, not eliminate, both. To Anthropic's credit, they published the dataset and prompts publicly and explicitly stated they do not endorse using language models to make automated decisions in the high-risk scenarios they tested.

Three things are worth taking from this body of research together: the pattern shows up across different model families and providers, it shows up specifically in qualitative coding tasks (not just hypothetical decision scenarios), and even the companies building these models are finding it in their own evaluations rather than dismissing it.

Why "use a better prompt" or "use a better model" does not fix this

The instinctive responses to this problem are both reasonable-sounding and both insufficient on their own.

Anthropic's own research found that careful prompting reduces the size of the effect, not its existence; their paper describes "significantly decreasing" discrimination, not eliminating it. Our own testing found the same pattern: explicitly instructing a model to "never hallucinate" and act like a careful expert changed how many quotes it produced to back up its claims, but the underlying wrong conclusion stayed exactly the same. The model was not being careless. It was doing what it was built to do, produce a coherent, well-supported-looking answer, and a stereotype is a coherent, well-supported-looking answer if you do not check it against the actual data.

Switching models does not solve it either, for the same reason. The Ashwin, Chhabra and Rao study found the systematic bias pattern across both OpenAI's and Meta's models, not just one. Better prompting and newer models can reduce the rate of these errors. Neither removes the underlying mechanism: a model trained on a vast corpus of human-written text inevitably absorbs the patterns, including the stereotypes and the underrepresentation, present in that text, and brings them along uninvited whenever it analyses your specific dataset.

What you can actually do about it

The fair position is that no qualitative analysis tool, including Skimle, can promise that the AI underneath it has zero bias. What differs sharply between tools is whether bias is checkable at all once it happens.

In a black-box chat tool, you ask a question, you get a confident-sounding answer, and there is no stable, inspectable structure standing between your data and that answer. If the model's analysis is systematically off for one group of respondents, you have no way to see it; the summary just looks plausible.

Skimle's architecture is built around two-way transparency: every category traces down to the exact quotes that support it, and every quote traces up to the categories it was coded into. That does not stop a biased coding decision from happening. It means a biased coding decision leaves a visible trail you can actually go and check, the same way you would check a human research assistant's coding, rather than a trail that disappears the moment the chat window scrolls.

The |way we do coding](insight-order-in-skimle) is also deliberately designed to avoid bias. Instead of feeding the entire document (which would typically include both explicit mentions of the background as well as lots of clues for the model to infer it), we feed and identify insights one chunk at a time. This way the thoughts and ideas stand on their own merit instead of being labelled as coming from a specific type of informant.

In practice, that means:

Check coding patterns against metadata, deliberately. If your data includes attributes like gender, language, region, or role, filter the category structure by that variable and look for categories that correlate suspiciously well with a demographic rather than with what people actually said. This is the same logic as the flip test: if a finding only holds up when you already expected it to, be suspicious of it.

Read the quotes behind a finding before you act on it, especially a flattering or expected one. A surprising finding tends to get scrutinised. A finding that confirms what everyone already assumed rarely does, which is exactly backwards, since that is the kind of finding a pattern-matching model is most likely to produce whether or not it is true.

Treat AI-coded analysis on sensitive or vulnerable populations with extra caution. The Ashwin, Chhabra and Rao study specifically involved refugee interview subjects, a population where getting the analysis systematically wrong has real consequences. This is exactly the kind of work academic researchers and public sector and policy teams handle regularly, stakeholder consultations, vulnerable-population studies, asylum and welfare casework, and the case for keeping every coding decision auditable and editable, rather than trusting a generated summary, is stronger here, not weaker.

Do not mistake editability for a fix you can skip. Skimle's coding is editable specifically so that a human reviewer can catch and correct exactly this kind of systematic skew. That review step is doing real work; treating the AI's first pass as final because it looked reasonable defeats the purpose of building a tool this way.

Pay particular attention when analysing non-English data. The training data behind most large language models is disproportionately English, which means coding quality and the risk of stereotype leakage are not guaranteed to be uniform across languages. Skimle analyses material across 100+ languages and supports combining findings across languages in one project, but that capability does not exempt the underlying model from this risk. If a study spans several languages, it is worth spot-checking whether the category structure for one language looks thinner or more generic than another, rather than assuming uniform quality across the dataset.

This is not unique to Skimle, and that is the point

It would be more comfortable to write this article as "other tools have this problem, ours does not." That claim would not survive contact with the research above, and a comparison post that made it would not deserve to be trusted on anything else either. The training-data bias documented in GPT, Llama, and Claude is a property of how these models are built, not a defect specific to any one application built on top of them, including Skimle's.

What we have tried to build is a tool where that underlying risk is something a careful researcher can actually investigate and correct, rather than something hidden behind a fluent paragraph. That is a meaningfully different proposition from "bias-free," and it is the one we are willing to stand behind.

Frequently asked questions

Can AI bias in qualitative analysis be eliminated entirely?

No tool can currently claim this, including Skimle. The bias originates in the underlying language model's training data, and reducing it (through better prompting, fine-tuning, or evaluation) is an active area of research with partial, not complete, results so far, as Anthropic's own published findings on Claude show.

Does using more than one AI model reduce the risk?

It can help surface disagreement worth investigating, since the Ashwin, Chhabra and Rao study found the bias pattern in both OpenAI's and Meta's models but not necessarily in identical ways. Running the same dataset through a different model and comparing results is a reasonable sanity check, though it does not guarantee either model is unbiased.

Is this worse than human researcher bias?

Human coders carry their own biases too, and qualitative methods have spent decades developing practices, reflexivity, intercoder reliability, peer debriefing, to surface and manage that. AI bias is different mainly in scale and visibility: it can affect every document in a dataset uniformly and quickly, and without an inspectable structure, it can be much harder to notice than one coder's idiosyncrasies. Two-way transparency exists specifically to make AI's coding decisions as inspectable as a human coder's marked-up transcript.

Should I avoid AI-assisted analysis for research on vulnerable or sensitive populations?

Not necessarily, but the bar for verification should be higher. Keep the underlying coding visible and editable, check findings against metadata where you have it, and treat any AI-generated conclusion about a specific subgroup as a hypothesis to verify against the source quotes, not a finished result.

Want to see how an auditable coding structure works in practice? Try Skimle for free and check a finding against its source quotes yourself, rather than taking a summary on faith.

About the authors

Henri Schildt is a Professor of Strategy at Aalto University School of Business and co-founder of Skimle. He has published over a dozen peer-reviewed articles using qualitative methods, including work in Academy of Management Journal, Organisation Science, and Strategic Management Journal. Google Scholar profile

Olli Salo is a co-founder at Skimle and former Partner at McKinsey & Company where he spent 18 years helping clients understand the markets and themselves, develop winning strategies and improve their operating models. He has done over 1000 client interviews and published over 10 articles on McKinsey.com and beyond. LinkedIn profile