FOIA document analysis with AI: finding the story in 10,000 pages of leaked or released records

Quick answer: To analyse a large FOIA release or leaked document set, upload everything into one workspace, let AI build a theme structure across the set (recurring claims, contradictions between documents), then verify every finding against its source page before writing a word. Tools like Skimle handle the thematic analysis; tools like DocumentCloud handle OCR and publishing source documents alongside your story.

A FOIA response lands in your inbox as a single PDF, 4,200 pages long, with no index. A leaked dataset arrives as a folder of scanned memos in no particular order. In both cases, the actual story is buried somewhere inside, and you will not find it by reading from page one. Investigative journalists, accountability researchers, and the public records officers who support them have built an entire toolkit around exactly this problem: how do you find the four memos that matter inside four thousand that do not, and how do you prove it when someone challenges your story.

Federal agencies received 1.5 million FOIA requests in fiscal year 2024, a 25% increase from the 1.2 million requests filed the year before, according to the Brechner Center for Freedom of Information. More requests means more releases landing on more desks, often in the same unwieldy, unindexed PDF dumps that have always made document-heavy reporting slow. This guide covers how to approach large document sets methodically, where purpose-built journalism tools like DocumentCloud and Overview fit, and where AI-assisted thematic analysis adds a layer those tools were not built to provide.

Why large document releases are a different problem to a single leaked memo

A single leaked document is a research problem: read it, verify it, find context. A 4,000-page FOIA release or a multi-gigabyte leak is a different kind of problem entirely. The story might be in any one of those pages, several pages might contradict each other, and you will not know which until you have read enough to notice the pattern.

The Panama Papers are the reference case for what this looks like at the extreme end. The leak comprised more than 11.5 million financial and legal records, totalling 2.6 terabytes of data, according to the International Consortium of Investigative Journalists (ICIJ). ICIJ coordinated more than 100 media partners working in 25 languages across nearly 80 countries to work through the material over the course of a year. Most newsrooms will never see a leak that size, but the underlying challenge scales down perfectly to a 500-page FOIA release: you cannot read everything closely enough to be confident you found what matters, and you cannot publish a claim you cannot point to on a specific page.

This is also why document-heavy investigative work increasingly depends on a workflow with two distinct stages: triage and verification (find the OCR text, redactions, and relevant pages, then publish the source alongside the story) and analysis (read everything, find the patterns and contradictions, decide what the patterns mean). Most of the established tools in this space, including DocumentCloud, are built for the first stage. The second stage is where a thematic analysis approach, the kind used across qualitative research more broadly, becomes useful for journalism too.

What journalism-specific tools already do well

Before deciding where AI analysis fits, it helps to know what the existing toolkit already covers.

DocumentCloud, run by the MuckRock Foundation (the platform previously sat under Investigative Reporters and Editors from 2011 to 2017 before the two organisations combined in 2018), is the closest thing to a standard for newsroom document handling. It lets journalists upload primary source documents, run OCR (including Amazon Textract integration through its Add-Ons system), annotate pages, redact sensitive material, and publish documents publicly alongside a story, according to DocumentCloud's own about page and its Wikipedia entry. As of May 2023 the platform hosted more than 5 million uploaded documents from newsrooms worldwide. It is built around a single goal: if you are more open about your sourcing, readers trust the reporting more.

Overview (OverviewDocs) was built specifically to address the FOIA-and-leak triage problem. Developed by researchers including Jonathan Stray, it clusters large document collections by content similarity and displays them as an interactive tree, so a journalist working through thousands of pages can see groupings rather than a flat list, according to the original research paper from the project. It adds entity detection, tagging, and full text search on top. Case studies from the project's own research cite collections such as 625 White House emails and 6,849 State Department cables, and Overview-assisted reporting was a finalist for the 2014 Pulitzer Prize in the Public Service category for an investigation into New York state secrecy laws hiding police misconduct.

Hunchly is sometimes mentioned in the same breath as these tools, but it solves a different problem: it is an automatic web-capture tool for OSINT investigators, recording the pages, screenshots, and metadata of websites visited during an investigation so the evidence holds up later. It does not analyse document dumps; it preserves web-based evidence. Worth knowing about if your investigation involves a lot of open-source web research, but not a substitute for document-set analysis tools.

Tool	Built for	Core strength	What it does not do
DocumentCloud	Newsroom document publishing and OCR	Hosting, OCR, redaction, annotation, public sourcing	Cross-document thematic analysis at scale
Overview	Visual document mining for FOIA/leaks	Clustering, entity detection, topic visualisation	Synthesised cross-document narrative with full source traceability
Hunchly	OSINT web evidence capture	Recording and preserving web pages visited	Document-set or FOIA release analysis
Skimle	Thematic analysis of document corpora	Recurring themes, claims, contradictions, traced to source page	OCR, redaction, public document hosting

How does AI thematic analysis fit into a FOIA or leak investigation?

The tools above solve the problem of finding and preserving the right pages. They do not solve the problem of reading several hundred documents closely enough to notice that three separate memos, written months apart by different officials, describe the same unresolved safety issue in slightly different language, or that an agency's public statement contradicts an internal briefing released six weeks later.

This is the layer Skimle is built for. Upload the FOIA release, the leaked files, or both, as a single project. Skimle reads every document and builds a bottom-up theme structure of what topics recur, what claims are made, and where documents agree or contradict each other, the same approach Skimle applies to due diligence document sets and competitive intelligence research. Every theme links back to the specific document and passage that supports it, so when you write "internal emails show officials were aware of the problem in March, three months before the public statement," you can click through to the exact email and line.

That traceability matters more in journalism than almost anywhere else. A consultant who misreads a data room document embarrasses a client. A journalist who misreads a FOIA release and publishes a wrong claim faces a correction, a credibility hit, or in some jurisdictions a legal letter. Being able to show an editor, a lawyer, or a reader exactly which document and page supports a specific claim is not an analytical nice-to-have. It is the evidentiary basis the story stands on.

A practical workflow for analysing a large document release

1. Get the documents into one place, in readable form. If the release arrived as a single scanned PDF with no OCR layer, run it through DocumentCloud or another OCR tool first. Skimle supports most common document formats, but garbage-in OCR text produces garbage-out analysis regardless of what reads it next.

2. Set out what you are actually looking for. Vague curiosity produces vague results. A working hypothesis, "did the agency know about X before it told the public," or "do the leaked emails show a pattern of officials downplaying a known risk," gives the analysis somewhere to anchor. You can describe this in plain language and let the analysis structure itself around it, or set up predefined categories if you already know the specific claims you are checking.

3. Let the full set get read, not sampled. A 2,000-page release sampled by an exhausted reporter at 11pm on deadline is not the same as a 2,000-page release read in full. This is the single advantage AI analysis brings to document-heavy reporting: completeness. Nothing gets skipped because it appeared in the back half of document 38 of 50.

4. Review themes, consensus, and contradictions. Look specifically for places where documents disagree with each other, or where an internal account contradicts a public one. These contradictions are frequently where the actual story lives, and they are exactly the pattern that is hardest to spot by reading documents one at a time, weeks apart, without a structured way to compare them.

5. Verify every claim against its source before you write it. Click through from the theme to the supporting passage. Confirm the document, the date, and the page. This step is not optional for any story that will face an editor, a lawyer, or a subject's response. If a finding cannot be traced to a specific passage, it does not go in the story until it can be.

6. Keep the rest of your existing workflow intact. Publishing the source documents alongside the story (DocumentCloud), preserving web-based evidence (Hunchly, where relevant), and visual triage of enormous collections (Overview) remain valuable. Skimle is not trying to replace any of that. It sits between "I have the documents" and "I know what they show," which is where a lot of reporting time disappears on a tight deadline.

What this approach cannot do

AI-assisted thematic analysis surfaces patterns and flags contradictions. It does not establish that a document is authentic, it does not replace source protection practices, and it does not make an editorial or legal judgement about what is safe or fair to publish. Those decisions stay with the journalist and the newsroom. The analysis layer narrows 4,000 pages down to the 40 that matter and shows you exactly where each finding came from. What you do with that, and how you verify it independently before publication, is reporting, not software.

It is also worth being clear about scope: Skimle does not run OCR on scanned documents, manage redactions, or host source files for public access the way DocumentCloud does. If your release arrives as unreadable scans, or if your story requires publishing the underlying documents for reader transparency, you still need a tool built for that. Skimle's role starts once the text is readable and ends before publication, the analysis stage in between.

Frequently asked questions

What is the best AI tool for analysing FOIA documents?

There is no single tool that covers the entire workflow. DocumentCloud is the standard for OCR, annotation, and publishing source documents alongside a story. Overview is built for visual clustering and triage of very large collections. Skimle is built for the analysis layer: reading every document, building a theme structure, and tracing every finding back to its source page. Most serious document-heavy investigations end up using more than one tool, each for the part of the workflow it is built for.

Can AI analyse leaked documents that have not been OCR'd?

Not usefully. If a document is a scanned image with no text layer, any text-based analysis tool, including Skimle, needs the text extracted first. Run scanned material through an OCR step (DocumentCloud's Add-Ons, Amazon Textract, or similar) before uploading it for thematic analysis.

How do journalists verify AI-generated findings from a document set before publishing?

By tracing every claim back to its exact source passage and checking it manually before it goes in the story. A workflow with traceability built in, where each theme or finding links to the specific document and page it came from, makes this step fast. A workflow without it means re-reading the relevant documents from scratch to confirm anything the AI surfaced, which defeats much of the time saving.

Is it safe to upload leaked or sensitive documents to an AI tool?

Check the tool's data handling policy before uploading anything sensitive, including where data is stored, whether it is used to train models, and what retention and deletion practices apply. For source protection reasons, many newsrooms also keep a strict separation between document analysis tools and any platform used to communicate with sources.

How many documents can AI realistically analyse in one project?

This depends on the tool, but modern AI-assisted analysis platforms are built to handle document sets that would take a human reader weeks to get through manually. The constraint is usually not document count but document quality (clean OCR text) and how well the research question is framed before analysis starts.

Ready to find the story in your document set?

Working through a large FOIA release or leaked dataset? Try Skimle for free and see how the analysis surfaces themes and contradictions across the full set, with every finding traceable to its source page.

Want to go deeper on document-heavy analysis methodology? Read our guides on commercial due diligence with large document sets and how transparency builds confidence in AI-assisted findings. If you are an independent reporter or researcher working without a newsroom's resources, see how Skimle fits a solo or informal research workflow.

Related reading:

About the authors

Henri Schildt is a Professor of Strategy at Aalto University School of Business and co-founder of Skimle. He has published over a dozen peer-reviewed articles using qualitative methods, including work in Academy of Management Journal, Organisation Science, and Strategic Management Journal. Google Scholar profile

Olli Salo is a former Partner at McKinsey & Company where he spent 18 years helping clients understand their markets, develop winning strategies and improve their operating models. He has done over 1000 client interviews and published over 10 articles on McKinsey.com and beyond. LinkedIn profile