Wayback Machine blocking is now the weapon of choice for publishers fighting AI companies—and it’s backfiring spectacularly. Twenty-three major news outlets, including The New York Times, USA Today, Reddit, and The Guardian, have blocked ia_archiverbot, the web crawler that powers the Internet Archive’s Wayback Machine, according to analysis by Originality AI. The outlets claim archived content is being weaponized to train AI models. Instead, they’re erasing one of journalism’s most essential tools for accountability.
Key Takeaways
- 23 major news sites now block Wayback Machine’s crawler, fearing AI model training on archived content
- The New York Times stated archived content is being used “to directly compete with us”
- Internet Archive holds over one trillion archived web pages, used by journalists, courts, and fact-checkers
- A coalition of over 100 journalists, including Rachel Maddow and Taylor Lorenz, signed a letter supporting the Archive
- The Guardian allows crawling but filters archived content from public access and API use
Why Wayback Machine blocking threatens journalism
The irony is almost too perfect: news outlets that built their credibility on accountability journalism are now blocking the very tool that enables it. USA Today used the Wayback Machine to investigate ICE detention policies and track policy changes under the Trump administration. The New York Times itself has relied on the Archive—in 2016, the Wayback Machine exposed the Times quietly revising a Bernie Sanders article without disclosure. Now these same outlets are cutting off access, claiming that archived content trains competing AI models unfairly.
Mark Graham, director of the Internet Archive, captured the contradiction bluntly: “They’re able to pull together their story research because the Wayback Machine exists. At the same time, they’re blocking access”. Publishers justify the blocks as routine anti-scraping measures, but the timing is suspicious. USA Today Co., which operates over 200 media outlets, frames its blocking as standard bot defense, not a targeted attack on the Archive. Yet the selective nature of these blocks—targeting specifically the Wayback Machine crawler while allowing others—suggests something more deliberate.
The AI training argument doesn’t hold up
Publishers claim that archived content violates copyright when used to train AI models. The New York Times stated that archived material is being used “to directly compete with us”. But this argument conflates two separate issues: whether AI training constitutes fair use (a legal question still being litigated in courts), and whether blocking the Archive will actually stop that training (it won’t). The Electronic Frontier Foundation has pointed out that blocking the Wayback Machine erases the web’s historical record without actually preventing AI companies from accessing or training on published content. Publishers are suing AI companies directly over training practices—blocking the Archive is a side effect, not the main battle.
The real problem is that publishers want to control how their content is used after publication. Archived pages represent editorial decisions, corrections, and changes over time. Once content goes online, publishers lose absolute control over it. Blocking the Archive doesn’t restore that control; it just hides the evidence of what was published and when. For a news industry already struggling with credibility, that’s a dangerous move.
Who supports keeping Wayback Machine open
A coalition of over 100 journalists—including Rachel Maddow, Kat Tenbarge, and Taylor Lorenz—signed a letter supporting continued access to the Internet Archive. Their argument echoes how journalism has always worked: “In previous generations, journalists would turn to the physical archives of a local newspaper or of a local public library to access historical reporting and follow the threads of the present back into history”. The Wayback Machine is simply the digital equivalent of those physical archives.
Internet Archive director Mark Graham is in talks with blocked outlets to restore access. The stakes are enormous. The Wayback Machine contains over one trillion archived web pages used by journalists for fact-checking, researchers for historical analysis, courts for evidence, and fact-checkers for editorial accountability. Losing access to that archive doesn’t hurt AI companies—it hurts the people trying to hold institutions accountable.
What happens if more outlets block Wayback Machine
If blocking spreads beyond the current 23 outlets, the consequences compound. Journalists investigating corporate malfeasance, political corruption, or editorial dishonesty lose their ability to prove what was originally published. Fact-checkers can’t verify claims about what a publication said in the past. Researchers studying media narratives over time lose access to the historical record. The web becomes less archival and more ephemeral—which benefits those who want to rewrite history without evidence.
The Guardian’s approach—allowing crawling but filtering archived content from public access and excluding it from the Archive’s API—is a middle ground that suggests compromise is possible. Full blocking is scorched earth. Partial blocking at least preserves some historical record while restricting AI training access. Yet even this compromise is imperfect, since it still limits the Archive’s core mission.
Is Wayback Machine blocking really about AI training?
Publishers claim blocking protects against unfair AI training. But the timing and breadth of blocks suggest a broader concern: loss of control over their content once it’s published online. AI training is the headline, but the real issue is that archived content shows editorial changes, corrections, and reversals that publishers might prefer to hide. A news organization that published a false story, then quietly corrected it, would rather that correction history disappear. The Archive makes that impossible.
This isn’t unique to news. The Wayback Machine has exposed corporate malfeasance, government agency reversals, and institutional hypocrisy across every industry. Blocking it is an attempt to make the internet more forgettable. For publishers claiming to defend truth and accountability, that’s a contradiction they can’t spin away.
Can outlets legally block the Wayback Machine?
Yes, technically. Websites can use robots.txt files or other technical measures to exclude crawlers, including the Internet Archive’s bot. The New York Times has gone beyond standard robots.txt measures to block the Archive. Legally, publishers have the right to do this. Ethically and practically, it’s a different story. The Archive respects opt-out requests, which means publishers could have negotiated access terms rather than implementing blanket blocks. That they chose blocking instead suggests either poor negotiation or unwillingness to compromise.
Should you worry about Wayback Machine being blocked?
Yes, if you care about digital accountability. Every time you fact-check a claim about what a website said in the past, you’re relying on the Wayback Machine. Every time a journalist investigates corporate or government misconduct by showing what was originally published versus what was changed, the Archive is the evidence. Losing access to that tool means losing the ability to verify historical claims about what was actually published. In an era of misinformation and editable online content, that’s catastrophic.
What’s the difference between Wayback Machine blocking and other archive tools?
The Wayback Machine is the dominant public web archive, holding over one trillion pages. While physical newspaper archives and library systems exist, they’re geographically limited and require physical access. The Wayback Machine democratized historical access to the web. There’s no direct digital replacement. Losing access to it doesn’t mean switching to an alternative—it means losing the ability to verify web history at scale.
The fight over Wayback Machine blocking reveals a deeper tension: publishers want the credibility benefits of digital permanence without the accountability costs. They want readers to trust their reporting, yet they block the tool that proves what they actually reported. That contradiction is unsustainable. Either publishers commit to digital transparency, or they accept that blocking the Archive will be seen as an attempt to hide their editorial history. Right now, they’re choosing the latter—and damaging journalism’s credibility in the process.
This article was written with AI assistance and editorially reviewed.
Source: Tom's Hardware


