Regular expressions to EBNF?

Last Thursday at London.PM, I got asked a lot why MediaWiki wikitext doesn’t have a WYSIWYG editor. The answer is that a WYSIWYG editor would need to know wikitext grammar, and there is no defined grammar. The MediaWiki “parser” is not actually a parser — it’s a twisty series of regular expressions (PHP’s version of PCREs).

So any grammar effort (and several What You See Is All You Get editors — others just forget wikitext and write HTML) requires reverse-engineering that, and lots of people have tried and gotten 90% of the way before stalling. It doesn’t help that wikitext is (I’m told) provably impossible to just put into a single lump of EBNF.

The goal is to replace the twisty series of regexps with something generated from a grammar. Tim Starling has said, more or less: “We can’t change wikitext. Go away and write something that (a) covers almost all of it (b) is comparably fast in PHP.” Harsh, but fair.

It occurred to me that there must exist tools to convert regexps into EBNF. And that if we can get it into even a few disparate lumps of hideous EBNF, there should be tools to take those and simplify them somewhat. (Presumably with steps to say what given bits mean.) Or possibly things other than EBNF, just as long as the result is parseable.

I am not (even slightly) a computer scientist, but many of you are. Does anyone have any ideas on this? Or pointers to anyone having done anything even remotely similar? Or knowledgeable friends they could point this query at?

The other approach is parserTests.php. Running maintenance scripts, the scripts (look for parserTests), the list of tests. A “parser” will be anything that passes the unit tests.

4 thoughts on “Regular expressions to EBNF?”

  1. The absense of a formal grammar for the wiki markup is only one reason for not having whsiwyg — another one is that many parts of the wiki syntax can not be handled at all on the client side, because the require access to the database: this includes templates, some parser functions and many extension tags. Also, a true wysiwyg-editor has no way to edit invisible things like DISPLAYTITLE and such. And I can’t even begin to imagine what would happen if you tried to edit one of the more complex templates with such an editor. So, the maximum one could aim for is wysiwym (what you see is what you mean), maybe a bit like what the WikEd script does [1]. And it would have to be very carefull not to break thinks that it doesn’t “understand”. It would need “soft warsing” — a tricky problem.

    As to parsing: converting regular expressions to EBNF is relatively trivial, in might not even be needed, depending on what framework you use to build the parser. The problem is that mediawiki syntax is not regular [2], and very likely not LL-1 or LR-1 [1] either. I’m not even convinced that it’s even context-free in all cases [4] (parsing non-CF grammar is very hard, even theoretically). It also has an annyoing tendency to depend on localization and customization: a promintent example are image links. They follow a different grammar than normal links, but can only be detected when knowing all local names of the image namespace. Another problems are templates that are “syntactically incomplete”: you can not parse the template code individually, and then use the result. You need a separate preprocessor pass (using a limited grammar) to resolve them (which requires database access). Same goes for some parser functions. This makes it very hard to detach the parser from the rest of the system.

    For the parser tests: the are badly named, really — what they test is code generation, not parsing as such. What we currently have is a “munger” that mogrifies the wikitext until it resembles html — a real parser would not generate html. It would generate a parse tree (or parse events), and a code generator would be plugged into that. For a wysiwyg engine, it might not generate html code at all, but a DOM tree or something like that. And for exporting, mit might generate something else, like TeX (that would rock).

    The advantage of having a formal grammar would be to allow people to build parsers on different platforms — for php to be used in MediaWiki itself, in javascript for a web based wysiwym-editor, in python for use with bots, etc.

    My point is: it’s not just a lot of work someone needs to do. There are conceptual problems with this. On of the biggest is that due to the reliance on configuration, localization, database content and extension-defined syntax, a formal grammar in the academic sense (EBNF or a production grammar [5]) is not even possible.

    To me this means that we can not hope for a simple or clean solution. We can only try to build something that is better than what we have, and live with the quirks.

    [1] http://en.wikipedia.org/wiki/User:Cacycle/wikEd
    [2] http://en.wikipedia.org/wiki/Regular_grammar
    [3] http://en.wikipedia.org/wiki/LL_parser resp. http://en.wikipedia.org/wiki/LR_parser
    [4] http://en.wikipedia.org/wiki/Context_free_grammar
    [5] http://en.wikipedia.org/wiki/Production_%28computer_science%29

  2. Oh yeah. I should have noted that a grammar is necessary, not sufficient.

    I think the real win will be in third-party parsers of provable 100% accuracy. PHP in the default install, compiled C on a heavy-duty site or for intensive post-processing.

    Tim has said we don’t necessarily have to preserve every stupid corner case and piece of emergent behaviour … unless they’re used by people and considered part of wikitext. This is a soft boundary.

Leave a Reply

Your email address will not be published. Required fields are marked *