Staring into the eye of Cthulhu.

The MediaWiki wikitext parser is not a “parser” as such; it’s a pile of regular expressions, using PCRE as found in PHP. There are preprocessing and postprocessing steps. No formal definition of wikitext exists; the definition is literally “whatever the parser does.” Lots of features of wikitext that people use in practice are actually quirks of the implementation.

This is a serious problem. Rendering a complex page on en:wp can take several seconds on the reasonably fast WMF servers. Third-party processing of wikitext into XML, HTML or other formats is not reliably possible. You can’t drop in a faster parser if you happen to have access to gcc on your server. Solid WYSIWYG editing, as opposed to the many approximations over the years (some very good, but still very approximate), could really do with a formally-described language to work to. (That’s not all it needs, but it’s pretty much needed to make it solid.)

Actually describing wikitext is something many people have attempted and ended up dashing their brains against the rocks of. The hard stuff is the last 5%, and almost all of the horrible stuff needs to work because it’s used in the vast existing body of wikitext. Wikitext is provably impossible to describe as EBNF. Steve Bennett tried ANTLR and that effort failed too.

If you’ve ever spat and cursed at the MediaWiki parser, you may care to glance at this month’s wikitext-l archives. (That’s the list Tim Starling Domas Mituzas created to keep us from clogging wikitech-l with gibbering insanity.) Andreas Jonsson has been having a good hack at it, and he thinks he’s cracked it.

This won’t become the parser without some serious compatibility testing … and being faster than the existing one. But this even existing will mean third parties can use a compiled C parser instead of PHP, third parties can process wikitext with blithe abandon without a magic black box MediaWiki installation, dogs and cats can live together in Californian gay marriage and the world will be just that little bit more beautiful. Andreas’ mortal shell, mind destroyed by contemplation of insanity beyond the power of the fragile human frame to take, would be in line for the Nobel Prize for Wikipedia. Could be good. Should be in the WMF Subversion within a few days.

Update: Svn, explanation. Performance is actually comparable to the present parser. Not perfect as yet, but not bad.

6 Responses to “Staring into the eye of Cthulhu.”

  1. Powers says:

    If I may, what are the “features” that are actually quirks?

  2. David Gerard says:

    Things like the apostrophe handling, and distinguishing whether it’s bold, italic, French or what. The current behaviour is just how PCRE in PHP happens to behave, not anything hugely planned – wikitext was originally just supposed to translate more or less directly to HTML, after all.

  3. […] Staring into the eye of Cthulhu. « David Gerard Actually describing wikitext is something many people have attempted and ended up dashing their brains against the rocks of. (tags: wikipedia mediawiki parser wikitext) […]

  4. Platonides says:

    That’s wrong. Apostrophe handling is done by purpose in that way, trying to make the ‘ next to a single letter in case they don’t match and so on.
    Precisely that heuristic makes it hard to incorporate it into another pass.

  5. MZMcBride says:

    I think Domas created wikitext-l, not Tim. http://lists.wikimedia.org/pipermail/wikitech-l/2007-November/035050.html and my IRC logs from that day seem to confirm this.

    Tim did (kind of recently) create a “parsers” directory in MediaWiki trunk, though: http://svn.wikimedia.org/viewvc/mediawiki/trunk/parsers/

  6. Baylink says:

    Yeah, when wikitext-l was created was the last time I kibitzed on this, and whomever’s project was nascent then is the one I was thinking about.

Leave a Reply