1.2.10
Newsjunkie.net is a resource guide for journalists. We show who's behind the news, and provide tools to help navigate the modern business of information.
Use of DataThe Internet Archive is an innovative and capacious repository of all things digital. Since 2008 it has anchored a coalition called End of Term Archive, consisting of institutional libraries and data services working to capture and preserve federal research websites during the months before and after presidential transitions.Mark Graham, director of the Wayback Machine, the aptly-named website capture/retrieval system within the Internet Archive’s labyrinthine libraries, sat down with Newsjunkie’s Angie Coiro to discuss the scope and scale of the End of Term project, and how the public benefits. |
Mark Graham: The End of Term Archive is for everyone. It's for the world. It's for the history books. It’s part of larger efforts by the Internet Archive and other organizations and individuals to help preserve more of what our society creates, ensuring that this material is available not just for future generations after we're long gone, but even for next week. To be able to reference and understand where we've been, and where we're going.
It's definitely more politicized. People have the example of what happened when Donald Trump assumed the presidency the first time. There was a fair amount of research on the archives made principally by EDGI, the Environmental Data and Governance Initiative.
A lot of press at the time documented changes made to what was available from government websites. I think people look at that near history, and at other signposts along the way, as harbingers of what may come to pass.
So there's definitely a heightened sensitivity and an awareness of some of the risks we face. But to a large degree, nothing has changed. We're still doing what we have been doing since 2008—collectively working to help to archive much of the material made available by the U.S. government. That really hasn't changed in any kind of material way. Of course we want to archive the at-risk material first, helping to ensure a degree of completeness around that.
But how does one identify what at-risk material is? You don't necessarily know until it's gone, and then it's too late. So you cast your net widely; try to be informed by past events, or by signals you might have that certain material might be at risk.
Now, just because something might be considered at risk doesn't necessarily mean there's a malevolent factor behind it. It could be a completely benign motivation. For example, the CNN front page changes every few minutes. There are lots of normal reasons websites change material. The White House website will absolutely change immediately after the inauguration. Because it's under new management! The names of the people there, the schedules—they're all going to be different. In a perfect world, it means it's just been moved around. It doesn't mean it was deleted.
One of the challenges is that the web has no inherent change-control mechanism. It has no ordered set of rules, policies, or practices about how this might be done: how updates might be made, how material might be moved around, how live archives might be organized and presented.
So people will go to a URL that used to work and now it doesn't, and think immediately, “Oh my, this material was deleted,” when in fact, it was simply moved to some other part of that website.
And there are those occurrences where material might actually be simply removed, or even more challenging, changed. In a news article, for example, there might be a spelling error or some other minor error that gets corrected or edited after the fact. Not all news organizations are going to put a little note on that page saying, “we made this change on this date,” et cetera. They'll just make the change and republish the page.
So one might archive a page one day, then another day or another week archive the same page. It might look pretty much the same, but there might be subtle differences in it. There's no mechanism in the web to be alerted about that. With the Wayback Machine—a time machine for the web that periodically backs up of large amounts of the public web—we have built tools to help people identify and visualize changes to individual web pagesGetting over time.
The project lives within the context of the larger web archiving the Internet Archive does, in partnership with more than a thousand librarians through our Archive-It program. These are libraries, museums, governments, and others we work with, doing their own job of curating lists of webpages and websites to be archived. That's an ongoing effort that happens day in and day out, and will continue independent of the End of Term Archive project.
Another key difference: in the End of Term Archive, the files we store are made publicly available. These can be downloaded by researchers from the Internet Archive directly. In addition, we upload these files to Amazon Web Services. They have an open data program where material that's like publicly available can be posted, and researchers can download the material from that source as well.
The End of Term archive is one of many projects where we focus on a particular collection of websites or genre for a period of time. It's defined by the specific websites we're addressing, and particular time periods where we will work to get a lot of the material. It's very hard to understand in many cases what “most” or “all” is for a website. But we endeavor to get a lot of the material archived from tens of thousands of websites identified as published by the U.S. government.
We worked with the Library of Congress, the National Archives and Records Administration, and the U. S. Government Publishing Office to identify the sites. We put out calls for people to nominate websites or web pages that they thought might be at risk, or that we should pay special attention to. Then we crawl and archive the pages and sites for the time leading up to the presidential election. Now we're doing a second phase of this project: post-election, pre-inauguration. Then we're going to do a third pass after the inauguration of the president on January 20th.
Sometimes we've done it just before the election, and after the inauguration. We don't expect material is going to change very much prior to the inauguration. But there will be significant changes in those first few hours of the new presidency, think of “under new management” signs you would expect to see. Of course there will be more changes that take place gradually over time.
Generally speaking, all of the material archived into the Wayback machine is publicly available. You're able to see archives going back over time—a web page on a particular time and date, which might be 10 years past, and replay that page.
In the case of the End of Term archive, we do make this material available. How much material are we talking about? Since 2008, we’ve gathered a little bit more than a half a petabyte of material. This round, we're going to get a half a petabyte just in this pass.
In other words, this time around the End of Term Archive will be equal in size to all of the other End of Term Archive projects we've done since 2008. Why? Well, there's more material available. As the web gets older, more material has been added; but it's also a reflection of our efforts to do an even better job of trying to get more of the material, and assuring it is preserved.
It's been fairly well-documented that there were changes made around definitions of climate change. For example, there was material produced for teaching and learning on the EPA website that discussed climate change—that was removed. And there were other subtle changes in how things were described—for example the words used to describe climate change. In many cases, it was changed to the word “weather,” as just one example.
In some ways this is par for the course. You would expect that any administration, any politician, any person in any dimension of responsibility is going to put forward their own brand, their own priorities, emphasize more of this and less of that.
What I'm focused on is working very hard to archive this material, make it available to researchers and others so that they can do the analysis and identify what the changes might be, and help to make them understandable to people—so that they don't just jump to the conclusion, “Oh, this administration is trying to suppress this information or delete this bit of history.” That might be true. But I don't think that's the majority of the cases that we're looking at here.
The web itself is a constantly changing ecosystem. It's this information space managed through URLs, through web pages that don't have any kind of any predictable way to have changes be represented. That mechanism doesn't exist for the general web – ergo, the Wayback Machine.
I'm an archivist. So my number one, number two, and number three job is to hit record on this digital landscape and ensure that those recordings are preserved for future generations. And then work to make what is preserved discoverable by people and make it useful.
If I reduce down the Wayback Machine to a mission statement, it would be something like, “to help make the web more useful and reliable.” In that vein, the End of Term archive is a special focus on helping make material published by the U.S. government more useful and more reliable.
Past participants in this project were the Library of Congress, the U.S. Government Publishing Office, and the National Archives and Records Administration. They are not participants at this time; they have been historically, but not now. At the same time we have some new participants. One would be the Common Crawl Foundation, and also the Library Innovation Lab at Harvard University. Those are in addition to participants I mentioned earlier - EDGI, the University of North Texas, Stanford Libraries and some others.
This is a coalition of the willing, if you will. Organizations came together and pooled their resources and work every four years on this effort.
Maybe a little bit. But generally speaking, this comes out of the general funds of the Internet Archive, which we enjoy support from more than 150,000 individuals who have contributed money this last year. Some foundations—the Filecoin Foundation for the Decentralized Web in particular – helped fund this particular phase of the End of Term Archive.
There are opportunities to use AI to analyze material after the fact but we're not using any AIs right now, either in identification of the URLs of the websites to archive, or the actual archiving process. I'm certain AI will be used by researchers and others to help analyze and understand the material, to help make it more useful and accessible.
An example of how we are using AI: in addition to archiving much of the public web, we archive television news from around the world – from Russia, Belarus, Ukraine, Iran, et cetera. We're archiving material from, in many cases, state-run national news channels within those countries, in the language of those countries—Farsi, Russian, whatever have you. We’re using a variety of AIs in these pipelines to, for example, transcribe the audio into text.
So you start with Russian audio and translate that into English. Then take all of that for a period of time, for example a day, and identify what the unique stories were. Then take those 10 or 15 or 20 stories key stories over, say, a day, and then summarize them: maybe a one paragraph summary with a synthetically created headline for that particular story. So that's maybe four different AIs that we're currently using to help make, for example, material from Russian news more accessible to a global audience.
No, this is pretty widely supported. I can't think of any time when anyone has said, “Do less of that”, or to say “we don't want that to be there.” That would be very disturbing. And it wouldn't go unnoticed. If it was material published by the U. S. government, it would be ostensibly a U. S. government entity that would make that request, or a particular politician. It wouldn't be very wise in the United States in any kind of political climate, because there would be attention drawn to it. And I don't think the reception would be welcomed. So no—gratefully not.
But there are very practical examples of where material was available and then all of a sudden it's not, and you could ascribe some political motivation to this. I'm just going to say this is how the system works. Select committees, for example, of Congress: in the last Congress there were four select committees established, one investigating the events around January 6th. When the new Congress came into power, one of the first things they did was eliminate that website. They eliminated the website of four select committees. There was a period of time where, if you went to those URLs for those committee websites, you got a “page not found,” with no redirect or hint of where to go. After a period of time, at least one of those subcommittee websites was moved to a member's website. So it was like /archives/ something or other, and you could find it. Also, the National Archive and Records Administration archives much of this material as well, although they may not make it available immediately.
In the case of the Wayback Machine, the material is available pretty much immediately, within minutes of it being archived into the Wayback Machine. The moment they were not available on the live web, you could get to them from the Wayback Machine.
“All” is a challenging word. We get a lot of it done. I'm sure there will be some things that for technical reasons represent challenges. Or places where we are not going to go deep enough or far enough.
“All” is probably exabytes of material. There's a lot of material that's out there. We're going to archive a few hundred terabytes of this material—material that's available in text format, HTML, et cetera. It doesn't take a lot of terabytes to add up to a lot of material.
When you start getting into things like a U. S. Geological Survey, weather data, NASA data from telescopes, or high energy physics data, the numbers start to get very big. We're not trying to get that material.
So no, we're not going to get all of it, but that's okay.
Yes. One can always help to archive the Web. Use the Wayback Machine. We have a feature on the right-hand side called Save Page Now. Anyone can enter in a URL. Or you can use a Google sheet to enter a lot of URLs, then archive those through Save Page Now. Those won't necessarily make their way into the End of Term Archive project, but they will make their way into the Wayback Machine.
Try practicing archiving some material. One of the partners in this project is Webrecorder. You could download their software and do some archiving locally on your own hard drive. It’s a way to get your feet wet and start experiencing what it's like to be a citizen archivist.
So I say just do it. Start using the archives by searching for things, follow your curiosity and explore what's there.
Published January 22, 2025
Source
Angie Coiro interviewed Mark Graham on December 27, 2024.
© 2025 Newsjunkie.net
1.2.10
1.2.10