Transcribing audio interviews and videos with Skimle

To transcribe audio interviews with Skimle, upload your audio or video file (MP3, M4A, WAV, MP4, or MOV) and receive an accurate transcript within minutes, complete with automatic speaker identification. Skimle supports 100+ languages, handles multi-speaker recordings, and processes everything on GDPR-compliant EU infrastructure. Your source audio or video file is securely deleted after transcription; only the transcript remains in your project.

If you record interviews regularly and spend time dealing with transcription, this guide covers how Skimle's transcription works, why we chose the engine we did, how to get the best results from your recordings, and what to do with the transcript once it arrives. The free trial includes over 3 hours of transcriptions and after that it costs less than half the price of typical competing options.

For a full end-to-end walkthrough from recording to analysis, see our practical interview setup guide.

How Skimle's transcription works in practice

The process is straightforward. Log in to your Skimle account, open the Transcripts panel in the left sidebar, and upload your file. Skimle supports the most common audio and video formats:

Audio: MP3, M4A, WAV
Video: MP4, MOV

You do not need to extract audio from a video file before uploading. If you recorded an interview over Zoom or Teams and have the video file, you can upload it directly.

Language detection and multi-language support

Skimle detects the language automatically. The transcription engine covers 100+ languages, including Finnish, Swedish, Norwegian, Danish, German, French, Spanish, and Portuguese alongside English. This matters for research teams operating across language markets and for projects with interviews conducted in participants' native languages.

Speaker identification

When a recording contains more than one speaker, Skimle identifies and labels them automatically. The transcript marks each speaker's turns so you can see immediately who said what. Labels default to "Speaker 1", "Speaker 2", and so on; you can rename them to match your participants once the transcript arrives.

Speaker diarisation (the technical term for identifying who is speaking when) works best when speakers have distinct voices and there is reasonable audio quality. It is less reliable when multiple people talk simultaneously or when audio quality is poor. More on that in the recording tips section below.

How long does transcription take?

For a typical 60-minute interview, transcription completes in under five minutes. Shorter recordings are faster; longer ones scale roughly proportionally. You can continue working elsewhere while Skimle processes the file.

What happens to the audio file

After transcription completes, the source audio file is deleted from Skimle's servers. Only the transcript text remains. This is a deliberate data minimisation choice: transcripts are what you need for analysis, and there is no reason to retain audio that may contain sensitive material beyond the point at which it has served its purpose.

All processing takes place on EU-hosted infrastructure throughout, which means your interview data does not leave the EU at any stage. For researchers and organisations with GDPR obligations, this removes a common compliance concern with transcription services that route data through US-based infrastructure.

How we chose Skimle's transcription engine

Transcription is one of those decisions that looks simple until you start working with real data at scale.

For qualitative research, transcription quality matters more than people assume at first. Errors in a consumer context (a voice assistant mishearing "set a timer for ten minutes") are annoying but inconsequential. Errors in a research transcript are a different problem. If the transcription mishears a technical term, a product name, or a person's name, you carry that error into your coding and analysis. When you are working with 30 interviews on a sensitive topic, consistent mistranscription of key terminology creates real downstream problems: codes built on misread text, quotes you cannot use verbatim, and time spent correcting rather than analysing.

What we tested

When building Skimle's transcription feature, we tested options across three broad categories:

Cloud-based API services from major providers. These are fast and generally accurate in English, but quality across less common languages varied considerably. Some produced noticeably weaker results in Finnish and Swedish, which are important languages for our user base. Data routing was also a concern with services that process outside the EU.

Consumer transcription tools designed primarily for note-taking and meeting summaries. These are useful for their intended purpose but are not built for research-grade output: they often strip timestamps, do not handle multi-speaker recordings well, and require manual export steps that interrupt the research workflow.

Local open-source options, primarily Whisper from OpenAI. Whisper produces good quality output and has the advantage of processing entirely on your own machine, which makes it attractive when institutional policy requires that audio never leaves your environment. The downside is speed: running locally in real-time means a 60-minute interview takes roughly 60 minutes to transcribe, which adds friction when you have multiple recordings to process. We cover local Whisper setup in the practical interview setup guide as a backup option.

What mattered most

Our criteria, ranked by importance for qualitative research use:

Accuracy across languages, as a lot of research is not done in English
Speaker diarisation quality — correctly separating multiple voices
Data security — no processing outside the EU
Speed — fast enough to not interrupt the research workflow
Cost — reasonable at the volume a research project generates so we can price

The engine we integrated performed best on the combination of these factors.

Why we integrated it directly & what does it cost?

We could have left transcription as a separate step and asked users to upload finished transcripts. Many research workflows do work that way. But the friction of managing separate tools, converting file formats, and manually importing transcripts adds up across a project with dozens of recordings.

Integrated transcription means your audio is transcribed to a format that is 100% ready for analysis. It also means you work within a single security perimeter: audio is transcribed inside Skimle, deleted, and the transcript feeds the analysis workflow without leaving the platform or requiring you to sign on to additional services.

The existing research-grade transcription services charge an arm and leg for transcription: Nvivo costs EUR 30 per hour and MAXQDA 10 EUR per hour. In Skimle, one minute of transcriptions consumes 1 credit, meaning that the trial already comes with over 3 hours of free transcription services and in paid Skimle plans we're talking around 5 EUR per hour depending on the plan.

Tips for getting the best results

Transcription quality is not determined solely by the engine. The recording that goes in shapes the transcript that comes out. A few practical steps at the recording stage prevent most transcription problems.

Recording quality

The practical interview setup guide covers this in full detail, but the essentials are:

Quiet room. Background noise reduces accuracy more than almost anything else. Close windows, turn off fans, and avoid coffee shops for important interviews.
Phone on the table. If you are recording on a smartphone, place it on the table rather than holding it. Position the microphone end (usually the bottom of the phone) toward the speaker.
Right distance. The sweet spot is 30 to 60 centimetres between the microphone and the speaker's mouth. Closer risks distortion; further means the microphone picks up more room noise than voice.
One speaker at a time. Overlapping speech is the main cause of speaker identification errors. It does not need to be a formal turn-taking exercise, but brief pauses between speakers help the diarisation significantly.

For interviews where recording quality is consistently a challenge, a clip-on wireless microphone (such as the RØDE Wireless GO II, around 250 EUR) is a significant upgrade. The practical setup guide covers equipment options in more detail.

Reviewing the transcript before analysis

AI transcription handles accents, varying speech patterns, and moderate background noise well. What it handles less well is proper nouns: specific company names, product names, technical acronyms, and people's names, particularly unusual ones.

Before moving to analysis, read through the transcript with this specifically in mind. You do not need to correct every minor error. An error rate of 2 to 5% is acceptable for qualitative analysis, where you are working with meaning and themes rather than counting word frequencies. Spot-correct the terms that matter: key concepts in your research topic, names you plan to quote, and acronyms that appear repeatedly in your codes.

A useful practical approach: run a word search for the technical terms and proper nouns you expect to appear before you read through in full. Fix those first, then skim the rest. For a 60-minute interview, this typically takes 10 to 15 minutes.

What happens after transcription

Getting the transcript is the end of the transcription step and the beginning of the research workflow.

If you plan to share transcripts with colleagues, clients, or external reviewers, or if your research involves a promise of anonymity to participants, anonymise the transcripts before doing so.

Skimle Anonymise handles this systematically. It detects identifiers across six categories (names, titles, locations, organisations, dates, and other), applies transformation rules consistently across all files in a project, and produces an audit report that documents every decision. This is the step that manual find-and-replace typically gets wrong: it catches obvious names but misses indirect identifiers (the unusual job title, the regional office, the combination of details that together identify a person).

For business and HR settings, our guide on anonymising interview transcripts for compliance covers the practical and legal dimensions in more detail.

Move to analysis

Once your transcripts are clean and anonymised, you are ready for systematic analysis. Skimle's analysis workflow reads each transcript using structured AI calls to identify insights and themes, builds a unified category structure across all interviews, and links every insight back to the specific quotes that support it.

The how to analyse interview transcripts guide walks through this in detail, from first read to synthesis. For multilingual projects, the multi-language analysis guide covers how Skimle handles analysis across languages and what to watch for when coding interviews conducted in different languages. For teams integrating qualitative analysis into broader AI workflows, agentic chat and MCP shows how Skimle's structured analysis connects to external AI environments after transcription is complete.

On AI you can trust

One question that comes up frequently: how confident should you be in AI-generated outputs, both from transcription and analysis?

It is a fair question. The answer is that two-way transparency matters here. For transcription, you can verify the output against the source audio for any quote you plan to use verbatim. For analysis, every theme in Skimle links back to the specific passages that generated it, so you can check the reasoning rather than taking the output on faith. Learn more about how Skimle works.

Ready to try Skimle's transcription on your own interviews? Start for free and upload your first recording today. Transcription is included in all plans, and you can be working with a finished transcript within minutes.

Want to go deeper on the workflow? Read our end-to-end practical interview setup guide for the full picture from recording to analysis, or jump straight to how to analyse interview transcripts once your transcripts are ready.

Compare options: Best AI transcription tools for researchers in 2026

About the authors

Henri Schildt is a Professor of Strategy at Aalto University School of Business and co-founder of Skimle. He has published over a dozen peer-reviewed articles using qualitative methods, including work in Academy of Management Journal, Organisation Science, and Strategic Management Journal. His research focuses on organisational strategy, innovation, and qualitative methodology. Google Scholar profile

Olli Salo is a former Partner at McKinsey & Company where he spent 18 years helping clients understand their markets and themselves, develop winning strategies, and improve their operating models. He has conducted over 1000 client interviews and published over 10 articles on McKinsey.com and beyond. LinkedIn profile

Transcribing audio interviews and videos with Skimle

Upload audio or video to Skimle, select the language, and get an accurate transcript in minutes — with automatic speaker identification and GDPR-compliant EU hosting.

How Skimle's transcription works in practice