The IA problem with AI

How artificial intelligence, paywalls, and web scraping are reshaping the future of digital preservation

by Curtis Whiting-Mar 12, 2026

Recent developments in artificial intelligence are reshaping how internet businesses operate — and how the web itself is preserved. For decades, one quiet norm of the digital world was that web pages were archived for posterity by the Internet Archive’s Wayback Machine. This corner of the internet isn’t just for nostalgia. Like any archive, it serves as a historical record — preserving evidence, accountability, and cultural memory.

That norm is now under pressure. Several news organizations have reported that businesses are increasingly blocking the Wayback Machine from archiving their pages. On February 17, 2026, Mark Graham, Director of the Wayback Machine, responded to these developments in a post on Techdirt, reiterating the organization’s mission: “to preserve knowledge and make it accessible for research, accountability, and historical understanding.”

In October 2025, the Internet Archive celebrated a milestone: the collection of its 3 trillionth web page. At the same time, artificial intelligence was rapidly gaining ground. For many major technology companies, AI represents the future. Google, Meta, and others have invested billions of dollars in building and training large language models (LLMs). AI has already proven useful in areas such as accelerating scientific research at institutions like MIT.

The concern, however, lies in how these systems are trained. AI companies have faced lawsuits over allegations that they trained models on copyrighted materials without permission. According to reporting by The Guardian, there is evidence that Google trained its AI models on content from numerous websites, including archived pages from the Internet Archive.

Robert Hahn, head of Business Affairs and Licensing at The Guardian, told Nieman Lab:

A lot of these AI businesses are looking for readily available, structured databases of content. The Internet Archive’s API would have been an obvious place to plug their own machines into and suck out the IP.

AI training has already disrupted the Internet Archive. In 2023, an AI company overwhelmed its servers with thousands of data extraction requests per second, temporarily forcing the archive offline. The company later apologized and donated funds to the Archive.

At the heart of the issue are copyright and permissions. Some companies have been accused of scraping material without authorization, using it to train models that generate derivative content. The New York Times sued OpenAI over alleged unauthorized use of its reporting. News Corp filed suit against Perplexity AI. In these cases, publishers argue that their intellectual property is being exploited without compensation.

Importantly, few frame the Internet Archive itself as a bad actor. The Archive’s mission — preserving the web for posterity — is widely respected. The Guardian, as reported by Nieman Lab, described the Internet Archive as a “frequent crawler” and a “good citizen” in terms of web archiving. Yet even publishers who value archival work have moved to block the Wayback Machine, arguing that archived pages make it easier for third parties to scrape paywalled material.

Mark Graham offers a different perspective:

The Wayback Machine is built for human readers. We use rate limiting, filtering, and monitoring to prevent abusive access, and we watch for and actively respond to new scraping patterns as they emerge.

We acknowledge that systems can always be improved. We are actively working with publishers on technical solutions to strengthen our systems and address legitimate concerns without erasing the historical record.

Another concern for publishers is paywalls. Archived versions of articles may allow readers to access content without paying. Blocking archivists to protect subscription revenue, however, runs counter to the Internet Archive’s mission. As Nieman Lab noted, many publishers lack the infrastructure to preserve their own historical pages. Poynter has partnered with the Internet Archive to help news organizations preserve their digital records.

Meanwhile, major AI companies have sought to address copyright concerns by signing licensing agreements with publishers. These agreements allow AI firms to scrape and train on content in exchange for payment. Companies that have entered into such contracts include:

OpenAI: Shutterstock, Associated Press, Axel Springer, Le Monde & Prisa Media, Financial Times, Dotdash Meredith, News Corp, The Atlantic, Vox Media, Time, Condé Nast, Hearst, Future, Axios, Schibsted, The Guardian, The Washington Post, Reddit

Meta: CNN, Fox News, People Inc., Reuters

Microsoft: People Inc., Financial Times, Reuters, Axel Springer, Hearst Magazines, USA Today Network, Informa

Perplexity: Getty, Gannett, The Independent, Los Angeles Times, Lee Enterprises, Time, Der Spiegel, Texas Tribune, Fortune

Amazon: Condé Nast, Hearst, The New York Times

Prorata.AI: News Media Alliance, DMG Media, The Guardian, Sky News, Prospect, Financial Times, The Atlantic, Axel Springer, Fortune

Synesthesia: Shutterstock

Mistral: Agence France-Presse

Google: Associated Press, Reddit

As this list shows, some publishers contract with multiple AI firms, creating new revenue streams from their archives. Yet companies such as Reddit — which has licensed content to Google AI — have blocked the Internet Archive from preserving their pages.

The consequences are significant. If platforms prevent archival access, posts can be edited, deleted, or replaced without public trace. Controversial statements may vanish. Historical accountability becomes harder to enforce. While companies claim they are not targeting archivists specifically, the practical effect is a narrowing of public memory.

The internet began as a project grounded in the free flow of information. Increasingly, that information is enclosed behind paywalls, licensing deals, and access controls.

For Graham, the stakes extend beyond copyright disputes.

Generative AI presents real challenges in today’s information ecosystem. But preserving the time-honored role of libraries and archives in society has never been more important. We’ve worked alongside news organizations for decades. Let’s continue working together in service of an open, referenceable, and enduring web.

In the race to monetize AI and protect intellectual property, the question remains: who preserves the past — and who decides what disappears?

Sources

Techdirt. Mark Graham Response:

https://www.techdirt.com/2026/02/17/preserving-the-web-is-not-the-problem-losing-it-is/

Nieman Labs article

https://www.niemanlab.org/2026 /01/news-publishers-limit-internet-archive-access-due-to-ai-scraping-concerns/

Meta signs raft of AI deals

Meta signs raft of AI content licensing deals - Press Gazette

The Conversation – Businesses are locking out Internet Archive

https://theconversation.com/news-sites-are-locking-out-the-internet-archive-to-stop-ai-crawling-is-the-open-web-closing-274968

AI in Scientific Research

https://science.mit.edu/researchers-explore-mutual-benefits-of-ai-and-science/

Menu

Related articles and entities

The IA problem with AI

How artificial intelligence, paywalls, and web scraping are reshaping the future of digital preservation

by Curtis Whiting-Mar 12, 2026