A story in a list.
1. I love archive.org. It's essential. Without it, we would have absolutely no record of the early sites that are gone. Hooray for Brewster Kahle. No sarcasm. He should get a Turing Award. Researchers of the future will praise him, and rightfully so, for having the foresight and will to provide an archive of so much of the early web.
2. If you want to see how necessary it is, spend a few minutes clicking on links from the archive of this blog from the 1990s. Most of the items I linked to are gone. The only way to see the content they once pointed to is to go to archive.org.
I picked one random month, June 1998. Very depressing. All the links to DaveNet pieces in that month, for example, are broken. Oy. After a little investigation it's clear why. Umpteen migrations later, and in all that a dependency on .htaccess files. However S3 does not understand .htaccess files. Oh well. Not easily fixable in a few minutes. But the data is still there. You just have to know where to find it.
Who's the best? It looks like the NY Times. All their links still work. They have some amazing future-thinking people there, and they mean business. Good work.
3. However, sometimes that doesn't work, because the archive is imperfect. Sometimes the sites had bugs, and saw the archive.org crawler as a denial-of-service attack. A large number of UserLand-hosted sites are missing from archive.org for that reason. It was our fault, but in our defense, we were fighting fires with a very small staff and a company that was perpetually running out of money. Yet, we had as one of our goals to preserve the content we were hosting. Luckily most, but not all of our sites are still accessible at their original addresses. (See #6 below to learn why, it's not an accident.)
4. As the blogging art progressed, we realized we should be creating static content. That would make it easier to archive. So we were able to find a home for the Radio UserLand community, with Matt Mullenweg. He has a similar problem for WordPress sites, and Matt is young, and likely to be around longer than I am, so it made sense to trust him with this content. It's not by any means an ideal solution, but it's good enough. And I do trust Matt. He has a good heart and sees a big picture. I think he values history as much as I do.
5. We had a scare with the RSS 2.0 spec. It went offline in a server upgrade at Harvard a couple of years after I gave the spec to them, hoping it would survive longer there than on UserLand's servers. I learned a lot from the experience of trying to future-safe it. It's been at its current location for almost ten years. So I'm fairly optimistic that it will survive for a while longer. :-)
6. I am hosting a large part of the archive from UserLand and scripting.com. Jake Savin, a former UserLand guy is hosting the other large part, out of the goodness of his heart, and respecting our early mission to produce a record. Both of us are looking for a way to make this stuff much more safe than it currently is.
7. Not to be dramatic or anything, but no more than 40 days after I die, and probably much sooner, all of the content I'm hosting will disappear. That's just not acceptable. What was the point of trying to save it if that's the best we can do. (And after I'm gone it isn't my problem anymore.)
8. I'm looking to create new art for long-lived websites. I see this as more of a financial and organizational problem than a technical one. From a technical standpoint, the best approach are static files served by any software capable of serving folders of static files. I like Apache. The domains should continue to work.
9. More people would then create new content in this format, so that there will be no need to archive the sites. There should be the idea of content that's designed to survive as long as the web survives. This is especially important for sites that are trying to create a record. Government and news sites. Researchers who want their work to be built on in the future. Writers who think their ideas might be meaningful in a future context. Basically, anything that isn't completely ephemeral. We complain that the UGC companies aren't doing a good job of preserving their archives. But the awful truth is that no one is. The web is very very fragile. We can and should be making an investment in making it stronger and more resistent to damage.
10. Right now the archive of the early blogosphere is in an unknown state. There were sites at Blogger, Movable Type, we hosted some at EditThisPage.com, and weblogs.com. I would like to see all of it safe and ready for future readers and researchers. Not as museum exhibits (this is what an ancient blog looked like) but as literature that's available to anyone at any time. There probably were some great writers back then, and there certainly is history that we have already lost that can be resurrected. But every day that gets harder. We should do something about this.
I am mentioning this now because it's a long-term objective. I hope to be around for a while to work on this with others, hopefully younger people, who have a longer horizon than I do. I'm not looking for anything like a quick solution, because this is a problem that took many years to create. But I want to start making things better, one step at a time. I will go to conferences, if necessary -- but only conferences that are about preserving the early blogopshere. I am in favor of preserving everything, but one person can't take on everything. :-)
Technorati kept an archive of the blogosphere it had indexed, from its birth in '03 until it ceased to be a blog search engine in whatever-it-was ('10?). It's a different kind of company now: same name, different business. Anyway, there was a lost opportunity to save the data on those servers. From what I gather, they got wiped.
Maybe what's needed is less reliance on domain names. You have to go through a fairly substantial checklist to port a WordPress site to a new domain which includes running queries against the database.
Code can tell what domain I'm on. Relative links can go a long way.
Maybe there's a file in your site that says:
Which sort of leads me to think sites could be managed via a version control system. In fact you can serve a website from a github repo using something they call pages http://pages.github.com/ You basically tell it your homepage and it serves the pages out of the repo. If you combine a repo with a deployment system you could sort of hop around from host to host. If no one was there to do your hopping maybe your repo could be put into cold storage on cheap amazon archival storage until a researcher pulled out your repo, deployed it with a click, locally even, did her research, and moved on.