How to anonymise and pseudonymise qualitative research data: IRB-compliant de-identification of interview transcripts

Step-by-step guide to anonymising and pseudonymising qualitative interview data for IRB, GDPR, and HIPAA compliance with Skimle Anonymise. Covers identifiers, audit trails, and methods documentation.

Cover Image for How to anonymise and pseudonymise qualitative research data: IRB-compliant de-identification of interview transcripts
Share this article:

To anonymise qualitative interview data for IRB and ethics board compliance: remove or transform all direct identifiers (names, contact details, precise locations), address indirect identifiers that could re-identify participants in context (specific roles, organisations, unique demographic combinations), apply transformations consistently across all documents, and retain a documented audit trail of what was changed and how. Tools like Skimle Anonymise provide research-grade pseudonymisation with AI detection across six identifier categories, per-category transformation controls, and a PDF audit report documenting all decisions. This satisfies most IRB documentation requirements and journal methods section expectations.


Why anonymisation matters more than researchers expect

Most qualitative researchers understand that they need to anonymise their data. Fewer appreciate just how much is riding on doing it correctly. This guide focuses specifically on the anonymisation and pseudonymisation step; for the broader question of how AI can support your analysis once data is de-identified, see how Skimle works and the guide to using AI in qualitative research for academics.

IRB protocols typically specify not just that anonymisation will occur, but how it will be carried out. When your study is approved, the ethics board is approving a specific data handling procedure. If your actual approach deviates from what you described (because you used a different tool, skipped certain identifier categories, or handled data inconsistently across transcripts), you have a compliance gap that could affect publication, data archiving, or institutional review.

Journal reviewers are increasingly asking about anonymisation methods. A methods section that says "names were changed" does not answer the questions a rigorous reviewer will ask: were indirect identifiers addressed? Was the same pseudonym used consistently across transcripts? Was a re-identification key retained, and if so, how is it stored?

For European researchers, GDPR applies to interview data as personal data for as long as re-identification is possible. Under US regulations, HIPAA has specific de-identification standards with clear enumerated criteria. Getting this wrong can risk rejection, later retraction, or even trigger an institutional investigation. In any case, recless de-identification can expose your participants to harm.

The good news is that with a systematic approach, getting it right is achievable. The rest of this guide explains how.


The spectrum from pseudonymisation to anonymisation

These two terms are often used interchangeably, but they describe legally and methodologically distinct things.

Pseudonymisation replaces identifying information with a code or substitute identifier, while retaining a key that allows re-identification if needed. "Participant 7" is a pseudonym. "Sarah M." replaced with "Helen K." is a pseudonym. The original identities can be recovered by anyone with access to the translation table. Pseudonymised data is still personal data under GDPR. The ICO guidance on anonymisation and pseudonymisation is explicit on this point: pseudonymisation reduces privacy risk but does not remove the data from the GDPR regime.

Anonymisation is the irreversible removal of all identifying information, including destruction of the re-identification key. Truly anonymised data is no longer personal data under GDPR or most other frameworks, because re-identification is not possible. The HIPAA Safe Harbor standard defines 18 specific identifiers that must be removed or transformed for data to be considered de-identified under US law.

In practice, most academic qualitative research requires pseudonymisation, not true anonymisation. You need to be able to go back to your original transcripts, verify quotes, respond to reviewer questions, and potentially share data under a data sharing agreement. Destroying your re-identification key is rarely appropriate for active research data. It is appropriate once a study is complete and archiving decisions have been made.

What this means practically: when IRB protocols and consent forms say "your data will be anonymised," they usually mean pseudonymised. This is not dishonest, and it is how the term is commonly used in research ethics contexts. But it is worth being precise in your methods section, because reviewers in technical fields may push back on the terminology.


What counts as an identifier

Most researchers know to replace participant names. Far fewer address the full range of identifiers that appear in interview transcripts.

Direct identifiers are the obvious ones: personal names, email addresses, phone numbers, postal addresses, institution names that would identify a specific individual, and precise dates tied to identifiable events.

Indirect identifiers are more subtle. These are combinations of information that, in context, allow a reader to identify a participant even without a name. Consider:

  • "As the only female partner at the firm at the time..."
  • "The CEO who led the 2019 acquisition of the Helsinki subsidiary..."
  • "From our office in Rovaniemi, which serves the whole region..."
  • "I was the lead negotiator on the public sector contract you've probably read about..."
  • "When I moved from the Oslo office to take up the London role..."

None of these contain a personal name. All of them, in the right context, identify a specific individual. The combination of role, organisation, location, and time period can be more identifying than a name, because names are common while the combination of attributes is unique.

This is the core limitation of find-and-replace anonymisation. A script can catch "Henri Schildt" reliably. It will not catch "the only Finn on the European management committee during the restructuring."

The categories that matter for qualitative interview data are:

  • Names (personal, including first names alone if distinctive)
  • Titles and roles (especially rare or unique positions)
  • Locations (specific cities, buildings, regions that narrow the population)
  • Organisations (employers, institutions, clients, including those mentioned by participants but not their own)
  • Dates (specific enough to link to identifiable events)
  • Other (physical descriptions, unusual career histories, family circumstances mentioned in passing)

Each category requires a decision about treatment, and that decision may differ by research context and compliance requirement.


How Skimle Anonymise works for academic researchers

Skimle Anonymise is built specifically for qualitative research data. The workflow is designed to be reproducible, documentable, and appropriate for IRB and ethics board requirements.

Upload and detection

You upload your interview transcripts (Word documents, PDFs, or plain text files). The AI analyses each document and detects candidate identifiers across all six categories above. Detection is not a simple word list; it uses contextual analysis to catch indirect identifiers as well as direct ones. Passages flagged for re-identification risk (where a combination of information creates a vulnerability even after direct identifier removal) are surfaced separately for researcher review.

Choosing your de-identification level

Three preset levels map to common compliance requirements:

Level 1: light pseudonymisation. Addresses direct identifiers only. Names, contact information, and explicit location references are replaced with pseudonyms or generalisations. The translation key is retained. This level is appropriate for internal research use, preliminary analysis, or contexts where data will be handled within a secure institutional environment and the participant population is low-risk.

Level 2: strong pseudonymisation. Addresses both direct and indirect identifiers. Roles are generalised, locations are broadened, specific dates are shifted, and re-identification risk passages are reviewed. The translation key is retained. This is the level most academic journal publications and ethics boards will expect.

Level 3: strong anonymisation. All identifying information is transformed, the translation key is destroyed, and the output is designed to meet the HIPAA Safe Harbor standard for de-identified data. This level is appropriate for data sharing under open-data requirements, or for archiving data after a study is complete.

Per-category control

For each of the six identifier categories, you choose the transformation type: keep as-is, pseudonymise (replace with a plausible substitute), generalise (replace with a broader descriptor), or redact (replace with [REDACTED]). You can also write custom rules, for example "generalise all references to specific departments within the organisation" or "replace the name of the client company with 'a major European retailer'."

This level of control matters because the right treatment varies by category and context. Names are typically pseudonymised. Specific locations are often generalised ("in a small town in northern Finland" rather than "in Lieksa"). Precise dates related to identifiable public events may need to be shifted rather than removed entirely.

Date shifting

Dates are a category that requires particular care. Removing dates entirely can destroy temporal information that is analytically important. "Three months after the acquisition" is meaningful; removing the date entirely may break the interpretive thread of the analysis.

Skimle Anonymise offers systematic date shifting: all dates within a document can be offset by the same amount, preserving all temporal relationships while making the absolute dates non-identifying. The phrase "three months after the acquisition" remains "three months after the acquisition." The calendar date of the acquisition shifts by the same offset as every other date in the document.

Cross-file consistency

In multi-transcript studies, the same participant will appear in multiple documents: their own transcript, any secondary materials where they are mentioned, and potentially in other transcripts where co-participants refer to them. Good pseudonymisation is consistent across all documents: "Bob" should become the same pseudonym in every document where he appears.

Skimle Anonymise maintains cross-file consistency automatically. It also supports entity merging: if a participant is referred to by their name in one document, by their title in another, and by their company role in a third, you can merge these into a single pseudonym that is applied consistently.

Export

The export package includes three outputs:

  1. Anonymised DOCX files. One per uploaded transcript, with all transformations applied.
  2. PDF audit report. Documents every transformation made, the rules applied to each category, the researcher decisions taken, and any passages flagged for risk review. This is the document that goes into your IRB records and supports your methods section.
  3. Excel translation table. The re-identification key, mapping original identifiers to their pseudonyms. This is stored separately from the anonymised data and is the document that needs to be protected under your data management plan.

Using the audit report in your methods section

The audit report is where Skimle Anonymise pays its way in academic contexts. Getting the anonymisation right is necessary but not sufficient. You also need to be able to document and defend what you did.

A Skimle Anonymise audit report contains:

  • The tool and version used
  • The date and time of processing
  • The number of documents processed and the identifier categories detected
  • The transformation rules applied to each category
  • A log of researcher decisions (for any items where the researcher overrode the AI's default or made a manual choice)
  • A list of passages flagged for re-identification risk and the researcher's disposition of each flag
  • Confirmation of whether the translation key was retained or destroyed

This provides the evidence base for a methods section paragraph that might read:

"Transcripts were anonymised using Skimle Anonymise. All six identifier categories (names, titles/roles, locations, organisations, dates, and other identifying information) were detected using AI analysis and transformed according to Level 2 pseudonymisation standards, with all indirect identifiers addressed in addition to direct identifiers. A systematic date-shifting offset was applied to preserve temporal relationships. Cross-document consistency was verified. The translation key is retained by the research team in accordance with the data management plan approved by [Ethics Committee]. An audit report documenting all transformations is available from the corresponding author."

That paragraph answers every question a rigorous methods reviewer will ask. It names the tool, describes the scope of detection, explains the level of anonymisation applied, addresses the specific technical questions (date handling, cross-document consistency), and points to the documentation.

For IRB documentation, the audit report itself is the primary artefact. Ethics boards are increasingly asking for evidence of how anonymisation was carried out, not just an assertion that it was. A PDF audit trail showing the tool, the configuration, and the transformation log is strong evidence of a systematic and documented process.


Practical guidance for different research contexts

The right approach varies by research design, institutional context, and regulatory regime.

Interview-based studies

This is the core use case. For researchers still setting up their interview workflow from the beginning, the practical setup guide for recording, transcription, and AI-assisted analysis covers how to structure the pipeline before the anonymisation step. For longitudinal interview studies where you return to participants over time, maintain the same pseudonym across all interview rounds. The translation table becomes important for tracking participant identity across the dataset over months or years.

For team research, the translation table should be stored in a secure shared location accessible only to research team members, not in the same folder as the anonymised data. Ideally, only the PI or designated data manager holds the key.

For research where participants are at elevated risk if identified (participants from marginalised communities, whistleblowers, employees discussing sensitive workplace issues), consider Level 2 pseudonymisation as a minimum and review all risk-flagged passages carefully before sharing any data.

Archival and document analysis

Anonymisation is not only relevant for primary interview data. If your qualitative research involves internal documents, memos, policy papers, or other materials that contain named individuals, the same framework applies. Documents often contain identifying information about third parties (people mentioned but not themselves research participants), and these individuals are equally entitled to privacy protection.

Solo researcher vs. team research

A solo PhD student running a small interview study has different practical needs from a multi-site research team. PhD researchers with limited budgets will also find the qualitative research on a PhD budget guide relevant for tool selection decisions at this stage. For solo work, the critical discipline is separating the translation table from the anonymised data from the first day, not as a retrospective clean-up before submission. For team research, the critical discipline is ensuring everyone uses the same pseudonyms, which cross-file consistency tools handle automatically.

EU GDPR context

Under GDPR, pseudonymised data remains personal data and requires a legal basis for processing. For most academic research, this is either legitimate interests or the specific research exemption under Article 89. Your institution's data protection officer can advise on which applies. The key practical point is that pseudonymised transcripts cannot simply be uploaded to any cloud tool without checking whether the processing is covered by your data processing agreements.

Skimle is GDPR compliant with EU-based processing. This matters when you are uploading original (pre-anonymised) transcripts for processing, because the upload itself is personal data processing and it needs to occur within a GDPR-compliant environment.

US HIPAA context

HIPAA's Safe Harbor method requires removal or transformation of 18 specific identifier types, including dates (other than year) for anyone aged 90 or over, geographic subdivisions smaller than a state, and several others that are less commonly thought of as sensitive. The Expert Determination method requires statistical analysis of re-identification risk. For most qualitative academic research, the Safe Harbor method (corresponding to Level 3 in Skimle Anonymise) is the practical standard.

If you are working in a health research context and are unsure which standard applies, the HHS guidance on de-identification under HIPAA provides the authoritative reference.


Before you start: a quick checklist

Before beginning your anonymisation process, confirm:

  • Does your IRB protocol specify how anonymisation will be carried out? If so, your process needs to match.
  • Have you agreed on a pseudonym scheme with your team (if applicable) before anyone starts processing?
  • Is the translation table going to be stored separately from the anonymised data, with access restricted?
  • Have you noted the version of any tool you are using? Methods sections require this. For online tools like Skimle Anonymise, use the current date as version number.
  • Does your data management plan specify how long the translation table will be retained and under what conditions it will be destroyed?

If you are in the EU, confirm that any tool processing your original transcripts operates under a signed data processing agreement and processes data within the EU (Skimle data is securely stored in Sweden and Germany) If you are in the US and working under HIPAA, confirm whether Safe Harbor or Expert Determination applies to your study.


The audit trail as a research output

Qualitative research is increasingly expected to provide transparency about analytical decisions. The decisions made during anonymisation (which identifiers were treated how, which indirect identifiers were caught and how they were addressed, what judgements were made about re-identification risk) are part of the analytical record. They affect what the data looks like and therefore what analysis is possible.

Transparency in AI tools is a theme that applies just as much to anonymisation tools as it does to coding and analysis tools. Researchers who combine AI-assisted analysis with manual workflows will find relevant guidance in the manual coding and REFI-QDA export guide, including how to structure your analytical record for reviewers. A tool that produces anonymised output without documenting what it did provides less assurance than one that shows its work.

The combination of a systematic process, a documented audit trail, and appropriate storage of the translation key is the foundation of anonymisation practice that will satisfy an ethics board, stand up to reviewer scrutiny, and protect your participants. These are not competing objectives. A process that genuinely protects participants is also the process that satisfies reviewers, because both are asking the same underlying question: was this done carefully and systematically?


Ready to anonymise your qualitative research data with a documented audit trail? [Get started to free(../pricing) with Skimle Anonymise, included in all Skimle plans.

Related reading:


About the authors

Henri Schildt is a Professor of Strategy at Aalto University School of Business and co-founder of Skimle. He has published over a dozen peer-reviewed articles using qualitative methods, including work in Academy of Management Journal, Organisation Science, and Strategic Management Journal. His research focuses on organisational strategy, innovation, and qualitative methodology. Google Scholar profile

Olli Salo is a former Partner at McKinsey & Company where he spent 18 years helping clients understand their markets and themselves, develop winning strategies, and improve their operating models. He has conducted over 1,000 client interviews and published more than 10 articles on McKinsey.com and beyond. LinkedIn profile


Sources