Staring into the eye of Cthulhu.

The MediaWiki wikitext parser is not a “parser” as such; it’s a pile of regular expressions, using PCRE as found in PHP. There are preprocessing and postprocessing steps. No formal definition of wikitext exists; the definition is literally “whatever the parser does.” Lots of features of wikitext that people use in practice are actually quirks of the implementation.

This is a serious problem. Rendering a complex page on en:wp can take several seconds on the reasonably fast WMF servers. Third-party processing of wikitext into XML, HTML or other formats is not reliably possible. You can’t drop in a faster parser if you happen to have access to gcc on your server. Solid WYSIWYG editing, as opposed to the many approximations over the years (some very good, but still very approximate), could really do with a formally-described language to work to. (That’s not all it needs, but it’s pretty much needed to make it solid.)

Actually describing wikitext is something many people have attempted and ended up dashing their brains against the rocks of. The hard stuff is the last 5%, and almost all of the horrible stuff needs to work because it’s used in the vast existing body of wikitext. Wikitext is provably impossible to describe as EBNF. Steve Bennett tried ANTLR and that effort failed too.

If you’ve ever spat and cursed at the MediaWiki parser, you may care to glance at this month’s wikitext-l archives. (That’s the list Tim Starling Domas Mituzas created to keep us from clogging wikitech-l with gibbering insanity.) Andreas Jonsson has been having a good hack at it, and he thinks he’s cracked it.

This won’t become the parser without some serious compatibility testing … and being faster than the existing one. But this even existing will mean third parties can use a compiled C parser instead of PHP, third parties can process wikitext with blithe abandon without a magic black box MediaWiki installation, dogs and cats can live together in Californian gay marriage and the world will be just that little bit more beautiful. Andreas’ mortal shell, mind destroyed by contemplation of insanity beyond the power of the fragile human frame to take, would be in line for the Nobel Prize for Wikipedia. Could be good. Should be in the WMF Subversion within a few days.

Update: Svn, explanation. Performance is actually comparable to the present parser. Not perfect as yet, but not bad.

We will add your activist distinctiveness to our own.

Despite the media attention, I don’t think this is any threat to the integrity of the encyclopedias’ content.

The Wikipedias get waves of activists and are used to dealing with them. The ones who don’t take the time to understand Neutral Point Of View, their stuff gets removed. The ones who do, their stuff stays and their cause gets accurately described and represented. Best case, we get more good new Wikipedians.

This applies to any activist for any cause whatsoever and has applied at least since I started on en:wp in 2004.

The advice I have for activists is: strict neutrality with excellent citations will do your cause justice. Everything else will be removed.

The broader advice is: there is no plausible attack on the integrity of the encyclopedias themselves that is not already something we are quite used to dealing with on a daily basis for many years.

I wonder if the presently prominent group of activists have taken in this one in the quest to have their stuff stick.

There’s a hole in my bucket.

Only a few calls after Afghan War Diary from people who think Wikileaks is part of Wikimedia. I must stress again the two are utterly unconnected, though I remain a big fan of Wikileaks.

What happens if the Pentagon manages to nail Julian Assange? Maybe, just maybe, Wikileaks posts the key to the file tagged “INSURANCE”.

In the meantime, US military are banned from looking at Wikileaks. I’m sure that’ll seal all leaks just fine. The Taliban can still read it, of course.

The old media aren’t happy either. I bet the RIAA wishes it had thought of calling in military strikes on Napster.

And to be on-topic: Wikileaks reveals US Army Intelligence cribs from Wikipedia, too. (Cache.)