1.2.10
1.2.10
1.2.10
Newsjunkie.net is a resource guide for journalists. We show who's behind the news, and provide tools to help navigate the modern business of information.
Use of DataThe Internet Archive is an independent digital library. With a collection spanning billions of websites, books, movies, music and software, IA is among the largest digital repositories in the world. Its stated mission is to provide “Universal Access to All Knowledge.”
Archive.org is the entry point to this vast collection. It operates online services that search, access and expand the information base, including the Wayback Machine (historic webpages) and the Open Library (digitized books). The extensive record of ephemeral media, including billions of websites captured at key moments over decades, comprise a unique asset for researchers and historians. “Every librarian has two things in their soul: preservation and access,” said Brewster Kahle, the Internet innovator who founded IA in 1996. “We try to put things in context so people know how to relate to older materials.”
Through Archive-It, launched in 2006, IA has helped digitize physical libraries and the collections of various cultural heritage institutions. Over 800 groups, including Caltech, the Southern Poverty Law Center, the American Academy of Pediatrics, and the New York State Archives, use Archive-It to store and manage their collections. The data captured varies according to the needs of the organization, but everything is added to IA’s database, which maintains primary and secondary copies of all the material. “A lot of a librarian’s job is just to keep things available in shifting times,” added Kahle of the wide-ranging digitization efforts.
Keeping things on the shelves also involves recording government documents. An End of Term (EOT) Archive has been hosted on IA during the U.S. presidential transitions since 2008. It captures material from federal sources that are “at risk of changing (i.e., whitehouse.gov) or disappearing altogether during government transitions.” These documents come from a select list of government websites that researchers and institutions, using Archive-It, record and upload to an archive accessible to the public through the Wayback Machine
IA operates digitizing and storage centers around the world, where it houses multiple copies of more than 150 petabytes—150,000,000 gigabytes—of data.
Near San Francisco’s historic Presidio, a decommissioned Christian Science Church now houses the Internet Archive offices. Looming Doric columns decorate the facade of the grand white building, echoing IA’s Library of Alexandria-esque logo. The complex also serves as one of several locations for the group’s digitizing equipment and storage servers.
The Wayback Machine, launched in 2001, is the most popular way to access IA’s treasures. It provides a publicly searchable database of over 900 billion webpage snapshots. Anyone can upload snapshots to the archives, but the majority of these pages are logged by webcrawlers, or programs that systematically search and record websites.
IA also operates Open Library, a project with the stated goal “to make all the published works of humankind available to everyone in the world,” particularly by creating a webpage for every published book. It contains more than 28 million works. For titles in the public domain, there are public-facing webpages with the entire text. For other, newer works, Open Library offers a digital version of a traditional lending service, where users can borrow copies.
Its massive collection of varied data has made IA an appealing source for groups building artificial intelligence models, which require enormous troves of information to refine their products. Kahle said that most major players in the AI world have used archival materials to train their models.
Other significant IA projects include the TV News Archive, a collection of televised news broadcasts. Its first public facing project was a record of news broadcasts covering the 9/11 terrorist attacks. IA’s website describes the project as cataloging what is a similarly “ephemeral medium” to the Internet.
“As things are moving more digital, libraries are being counted out,” said Kahle. “But I think we’re in a great position if we do things right to allow people to use library materials at scale in really interesting and useful ways. That will require empowering the cultural heritage sector.”
Kahle, 64, founded IA with the intention of archiving as much of the Internet as possible. From his undergraduate education at MIT in the 1980s where he studied in the university’s AI lab, to his early years in Silicon Valley, he witnessed the Internet’s accelerating evolution. This prompted him to want to preserve elements from the Internet for posterity. To that end, he started IA around the time he founded his first company, Alexa Internet, a web traffic analysis company. He sold Alexa to Amazon in 1999 for $250 million in stock.
Kahle contrasted IA’s mission with the publisher-first landscape of today’s entertainment and news industries, in which companies both determine how their products can be consumed and the narrative around their products’ history. Under that system, said Kahle, “you can change the past.”
“You have a generation raised on screens, but let’s just take the 20th century. It’s not on the Internet,” he added. “If you can’t quote things, you can’t put them into a new context. You can’t think critically. And when you’re talking about what we think—with news, books, reference materials—they can be changed at any time by one player. That should be frightening. Traditionally, libraries have been the antidote to that.”
Brewster Kahle is clear about the mission. “Our job is to try to keep the good works of humankind available to whoever comes looking.” Ω
As a 501(c)(3) nonprofit, Internet Archive is funded primarily by donations. It publishes a list of significant contributors, such as the National Science Foundation, the Democracy Fund and Andrew W. Mellon Foundation. |
More About End of Term Harvest and public data archives
Sources
Newsjunkie. Brewster Kahle intervew by Andrew Checchia, Jan 17, 2025
KALW. Internet Archive stores our digital history
ProPublica. Internet Archive financial and tax profile
The Verge. Podcast: Mark Graham on Ai, linkrot
Newsjunkie. Mark Graham interview
Internet Archive. About Internet Archive
Internet Archive. TV news - general archive
Internet Archive. TV news - 9/11 archive
Internet Archive. Granted official library designation by California (Jun 25, 2007)
Internet Archive. Brewster leads tour of IA scanning center (Mar 29, 2013)
Open Library. Welcome to Open Library
Archive-It. What you can do with your Archive It account
TED. Video - Brewster Kahle: A digital library, free to the world (2007)
© 2025 Newsjunkie.net