Disaster recovery planning.

The Wikimedia Foundation is in no danger of collapse. There’s all sorts of deeply problematic things about it, but no more than at any other small charity. Situation normal all fouled up.

But it would be prudent to be quite sure that the Foundation failing — through external attack or internal meltdown — would not be a disaster.

The projects’ content: The dumps are good for small wikis, but not for English Wikipedia — they notoriously take ages and frequently don’t work. There are no good dumps of English Wikipedia available from Wikimedia. (I asked Brion about this and he says the backup situation should improve pretty soon, and Jeff Merkey has been putting backups up for BitTorrent.)

The English Wikipedia full text history is about ten gigabytes. The image dumps (which ahahaha you can’t get at all from Wikimedia) are huge, as in hundreds of gigabytes. It’ll be a few years before hard disks are big enough for interested geeks to download this stuff for the sake of it. What can be done to encourage widespread BitTorrenting right now?

The easiest way for a hosting organisation to proprietise a wiki, despite the license, is simply not to make dumps available or usable. And to block spidering the database fast enough to substitute. This is happening inadvertently now; it would be too easy to do deliberately.

Who are you? The user-password database is private to the Foundation, for obvious good reason. But I really hope the devs trusted with access to it are keeping backups in case of Foundation failure.

In the longer term, going to something like OpenID may be a less bad idea for identifying editors.

Hosting it somewhere that can handle it: MediaWiki is a resource hog. Citizendium got lots of media interest and their servers were crippled by the load, with the admin having to scramble to reconfigure things. Conservapedia was off the air for days at a time just from blogosphere interest. Who could put up a copy of English Wikipedia quickly and not be crippled by it?

Suitable country for hosting: What is a good legal regime for the hosting to be under? The UK is horrible. The US seems workable. The Netherlands is fantastic if you can afford the hosting fees. Others? (I fear languages going to the countries they’re spoken in would be a disaster for NPOV.)

Multiple forks: No-one will let a single organisation be the only Wikipedia host again. So we’ll end up with multiple forks for the content. In the short term we’ll have gaffer-and-string kludges for content merging … and lots of POV forking. A Foundation collapse would effectively “publish” wikipedia as of the collapse date — or as of the previous good dump — as the final result of all this work.

(The English Wikipedia community could certainly do with a reboot. Hopefully that would be a benefit. It could, of course, get worse.)

In the longer term, for content integrity, we’ll need a good distributed database backend. (There’s apparently-moribund academic work to this end, and Wikileaks note they’ll need something similar.)

Worst case scenario: A 501(c)(3) can only be eaten by another 501(c)(3), but the assets of a dead one (domains, trademarks, logos, servers) can be bought by anyone. Causing the Foundation to implode could be a very profitable endeavour for a commercial interest, particularly if they smelt blood in the water.

Second worst case scenario: The Wikimedia Foundation’s assets (particularly the trademarks and logos) go to another 501(c)(3): Google.org. Wikipedia’s hosting problems are solved forever and Google further becomes the Internet. Google gets slack about providing database dumps …

What we need:

  • Good database dumps more frequently. This is really important right now. If the Foundation fails tomorrow, we lose the content.
    • People to want to and be able to BitTorrent these routinely.
  • Backups of the user database.
    • A user identification mechanism that isn’t a single point of failure.
  • Multiple sites not just willing but ready to host it.
  • Content merging mechanisms between the multiple redundant installations.
    • A good distributed database backend.
  • The trademarks to become generic should the Foundation fail.

I’d like your ideas and participation here. What do we do if the Foundation breaks tomorrow?

(See also the same question on my LJ.)

Correction: Google.org is not a 501(c)(3). So it couldn’t gobble up Wikimedia directly.

15 thoughts on “Disaster recovery planning.”

  1. David,

    Very interesting write-up. I wish Wikipedia the best.

    One minor point– Google.org is a “for-profit charity”, not a 501(c)(3) and as such couldn’t merge with a 501(c)(3).

    Sincerely,
    Mike

  2. Good question. Note that any individual GFDL article can be accessed as XML using Special:Export, though. However, if you try to get the lot fast enough to be useful, you’ll be blocked.

  3. No current backups? I doubt Iliad would have thought his recent cartoon was so timely.

    I always thought Wikipedia performed its backups following the example of Linus Torvalds: allow the Internet to copy the work. At least the current versions of each article could be restored. Recovering the previous versions would probably be far more difficult.

    Geoff

  4. “It’ll be a few years before hard disks are big enough for interested geeks to download this stuff for the sake of it.”

    A mere 300 bucks buys you a 1TB external hard drive, and that should be more than enough to store the history of all Wikipedias and all images. Affordable hard drive space grows faster than Wikipedia.

    My worst case scenario: the board invites a couple of people to get “outside perspectives”, and those outsiders determine that running ads on Wikipedia would be a good fund raising idea. Overnight this creates thousands of highly determined vandals with inside knowledge and the project dies.

    Axel

  5. I used to think assimilation by Google would be the best thing that could happen to the bloody train wreck.
    Besides, it seems to be King James’ wet dream, so he can buy a jet to go with his yacht.
    Now I’m not so sure…especially after witnessing what has happened to YouTube in the wake of its “Anschluss” with big G.
    Contrary to popular belief, David, I don’t wish to see Wikipedia destroyed…at least not yet:) There is too much good there still (Baby meet bathwater). Which makes all the fuck-ups all the more infuriating.

  6. I am left wondering if the Wikimedia Foundation is covered under any insurance policies. If not, where are we most at risk and what policies can we afford?

    Failure is not inevitable, but planning for it is still prudent. This post is a thoughtful writeup.

    -George

  7. I have wondered what would happen if a hurricane or fire were to take out the Florida datacenter.

    In private industry, I have always trusted Sungard (http://www.availability.sungard.com/). This may not be optimal for Wikimedia.

    Perhaps the simplest thing is to spread the servers around the country similar to Google’s “borg cube” strategy.

    -Jason Potkanski

  8. Jason – yep, or the distributed backend. Wikimedia has servers around the world, but they’re all Squids – the data lives on three master servers in Tampa.

    I have long joked that the admin politics on English Wikipedia are silly because the developers hold all the real power. However, I’m increasingly thinking that developers are critical to the content surviving and continuing to be worked on – that improving MediaWiki is the vitally important thing. And even though content writers love Wikipedia and flock to it, developers aren’t. Hence my frequent attempts to recruit anyone working on customising MediaWiki to their ends to contribute back to the main stream of development.

    The idea of rebooting the English Wikipedia community – delete most or all of the Wikipedia: page space and start over – is appealing to more than a few people. Perhaps they all grew up in the 1980s and remember the Cold War feeling of comfortable doom. In any case, anyone wanting to destroy Wikipedia only has until the next good database dump ;-)

  9. hundreds of gigs – that doesnt sound very much in an age where 300Gb+ hds are sold over the counter for affordable prices.
    How many hunderds of Gb are we talking about?

    i used to download nld:, nl: and af: text-wikis every Q, if it is a service to the wiki community I am willing to consider backups of commons, provided we can work out a good way to download them.

  10. It might be a very good thing.

    Rewriting the policy basis to make it coherent would be a good thing. Not pretending that the notions that work in a small community scale up well would be another. Also, deciding *what it is* would help. Citizendium has the advantage there: it knows what it wants to be. Wikipedia suffers a lot from lack of direction, and from the vagueness of principle that underlies it.

    And losing Jimbo, who is a total dead weight on the project, would be a positive. A thousand courtiers clapping when the emperor fucks something up doesn’t make it less of a fuckup.

  11. Just a brief answer about storage: In Serbia storage computer (6×1.5TB, good tower and power supply, dual core processor, something like 2-4GB RAM) is somewhat less than 1000 EUR. I suppose that such configuration may be less than $1000 in USA, maybe close to $500. And I suppose that there are enough of enthusiasts who are able to buy that.

    With 16Mbps link it is possible to synchronize file repository (images and so on) in less than 20 days.

    BTW, backups are not just about disaster recovery, but, also, about doing something useful with them. Like categorizing and describing images may be. If we have a client program for such purpose, similar to Google Picassa, but also connected to Wikipedia, we would be able to have millions of copies of everything. In a couple of years, updating Wikipedia (and Commons, and Wikibooks, and…) may become the same sort of thing like updating Debian. And we need P2P implementation for that.

Leave a Reply

Your email address will not be published. Required fields are marked *