Archiving the Web

Archiving the web is an important mission and the resources it requires is immense. It is thanks to a select few that it gets done at all.


If I were to reach into the history of the world to find some kind of precedent for the World Wide Web, I might pull out the Library of Alexandria. It was built circa 300 B.C. by the ancient Greeks, and it’s goal was ambitious if not distinct: to house the collective knowledge of everything that ever was. The library was packed with over 700,000 scrolls spanning every known topic and originating from every known region. Scholars came from all over to study there. If there was some nugget of information, no matter how obscure, it could be found at the Library of Alexandria. It’s grand design might not be a direct influence on the web, but they are certainly cut from the same cloth.

As the legend goes, a few hundred years later, the Library of Alexandria was destroyed in a fire ignited by invaders, and all of those collected learnings evaporated in the smoke. That’s why it’s often used as symbol and a warning whenever humanity faces a similar loss.

The Library of Alexandria burns down

Except the Library of Alexandria didn’t burn down all at once. Historians have discovered that the process was far more gradual than that. That it was years and years of neglect and small isolated incidents that, over time, led to the library’s rather unspectacular demise. Which is really interesting. Because we’re talking about the web here. A web that has been decaying from the edges since the beginning, as sites blink out of existence, as pages become replaced, or worse vanish with nothing to show except a 404 status and an error that reads “Not Found.” There’s a word for it. It’s called link rot. And it’s a massive problem on the web.

Incidentally, the Library of Alexandra is the through line of Brewster Kahle’s career. Kahle began working with computers, and the Internet, in the mid-80’s, when he co-founded the company Thinking Machines. His first major project was known as Wide Area Information Severs (WAIS).

Like the web (even though it predated it), WAIS allowed users to connect from a client to a remote server, and access documents within it. But WAIS was a decentralized system with an emphasis on search. With a single text field, users could search through hundreds of WAIS servers and line up a list of relevant results. It was, in essence, a way of collecting knowledge in a readable and accessible format and then disseminating that into the minds of users as effectively as possible. A digital descendant of the Library of Alexandria.

WAIS also included a built-in version control system that allowed users to navigate through previous versions of documents stored on servers. A small piece of the puzzle that foreshadowed the rest of Kahle’s career.

Thinking Machines sold to the Oracle Corporation, and Kahle got to work on his next project. This time he dropped all pretense and named the project Alexa, after Kahle’s most palpable influence. Alexa crawled the web, slowly and deliberately, analyzing user patterns as it went. Alexa used this data to help users find their next destination from a webpage, based on the decisions everyone else had made when they reached that same page.

As Alexa snaked it’s way through the web, it would slowly deposit each page (on a six-month delay) into another project of Kahle’s. A project he called the Internet Archive.

The Internet Archive is an archive for the Internet. Which might be a sort of matter of fact concept, but is limitlessly complex. The Internet Archive’s goal is to store a copy of every single website. And not just one copy. Hundreds of them. Thousands of them. Copies of every single webpage and every single change ever made to that page. All housed together on cheap, durable, and distributed storage so that our history is never lost.

The Wayback Machine when it first launched

Alexa and the Internet Archive were tied together until 1999, when Alexa officially sold to Amazon. The day after the sale, Kahle created an independent non-profit for the archive, and it’s been running on its own ever since. The tools have improved, and the company has grown, but its purpose has remained constant and substantial. To date it has captured hundreds of billions of websites. One page at a time.

The Internet Archive isn’t just webpages either. Over the years, its amassed digital records of books, videos. TV, images, and music. But the web has always been the backbone of the archive. In 2001, the web archiving section of the site broke off into its own entity that most of you are most likely familiar with. The Wayback Machine.

The Wayback Machine has a pulse, rhythmic and steady as it loops endlessly through the spaces of the web. At each stop, it collects what it can, essentially grabs a snapshot of a webpage frozen in the moment of its capture. It’s not always perfect, in fact it rarely is, but it is essential. Because the average lifespan of a webpage is a mere 100 days. We add 500TB of data a day just to Facebook. There’s over 20 petabytes of data in the Internet Archive. That’s 20,000,000,000,000,000 bytes. The numbers are staggering, and the effort is never-ending.

It’s an effort that is underwritten by an army of historians, librarians, and archivists who are given the rare opportunity to deposit (earmark for special insertion) sites manually into the archive. There are times when their expediency is needed, a necessary moment that needs to be recorded. Like in 2014, when a Ukrainian separatist group posted about their involvement in an attack on a Boeing  777. The post was deleted in hours, but before it  vanished, an Internet Archive volunteer deposited the page into the archive for the official record.

And there are others that have answered the call. In 2009, Yahoo up and decided it was time to put an end to Geocities. Geocities, a site that brought hundreds of thousands of people online for the very first time. That offered free digital space for users to piece together unique, wondrous, though sojmetimes ugly fragments of HTML into the smallest of digital corners. Yahoo was going to delete every single one, all at once. So Jason Scott decided to step in.

Jason Scott
Jason Scott

Scott, one of the more charismatic characters of the web, had been interesting in archiving, in a general sense, at a pretty young age. In the late 90’s he downloaded a bunch of messages from just about every bulletin billboard system (a whopping 12MB in all) and hosted them all on the site textfiles.com. The Geocities thing struck a chord. Scott gathered up a few like-minded archivists that didn’t want to see a massive piece of the web splinter off, and began downloading Geocities. All of it. One site at a time. By the time Yahoo pulled the plug, they had managed to download the whole thing (you can still see it mirrored at Reocities). Out of this communal spirit came the Archive Team.

The Archive Team are group of individuals that run into the incessant fire of link rot and save what is most precicious. Whenever a company decides it’s lights out for a popular service or site, the Archive Team rallies it’s volunteers to save it. They write a script that will crawl the site and download the archives into one place. Then a whole bunch of people install a virtual box on their computer that runs the script in tandem, crowdsourcing bandwidth and downloads from its users. You can do it too.

For almost ten years, the Archive Team has been saving data before it can be ripped away. They’ve done it for Geocities, Mobile.me, .Mac, Posterous, and plenty of others. And they’re still going.

There’s also Perma.cc, a project out of Harvard Library that offers a simple solution to what’s commonly referred to as reference rot. As information has migrated from the physical world to the digital web, it’s become the basis for far more citations by lawyers, academics, and historians. When site’s disappear though, these citations are rendered invalid. Perma.css bridges the gap, allowing its users to manually deposit pages manually into the Internet archive, dated appropriately, and automatically appended to citations. It’s a simple and elegant solution to a problem that’s not going away anytime soon.

When the Internet Archive kicked off the Wayback Machine, they also made the entire archive public. Using the Wayback Machine, you can browse about as close to the entitreity of the web as you’ll find anywhere, from the 90’s to today. It’s a rewarding and essential experience. The Wayback Machine is visited between three and four million times every day. That says something. The web’s history is meaningful and imperative. It’s worth saving.

Sources