{"id":92,"date":"2008-04-09T23:09:11","date_gmt":"2008-04-09T23:09:11","guid":{"rendered":"http:\/\/davidgerard.co.uk\/notes\/2008\/04\/09\/regular-expressions-to-ebnf\/"},"modified":"2008-04-09T23:12:36","modified_gmt":"2008-04-09T23:12:36","slug":"regular-expressions-to-ebnf","status":"publish","type":"post","link":"https:\/\/davidgerard.co.uk\/notes\/2008\/04\/09\/regular-expressions-to-ebnf\/","title":{"rendered":"Regular expressions to EBNF?"},"content":{"rendered":"<p>Last Thursday at <a href=\"http:\/\/london.pm.org\/\">London.PM<\/a>, I got asked a lot why <a href=\"http:\/\/www.mediawiki.org\/wiki\/MediaWiki\">MediaWiki<\/a> <a href=\"http:\/\/en.wikipedia.org\/wiki\/Wikipedia:Cheatsheet\">wikitext<\/a> doesn&#8217;t have a <a href=\"http:\/\/www.mediawiki.org\/wiki\/WYSIWYG_editor\">WYSIWYG editor<\/a>. The answer is that a WYSIWYG editor would need to know wikitext <a href=\"http:\/\/en.wikipedia.org\/wiki\/Grammar_(formal_language_theory)\">grammar<\/a>, and <i>there is no defined grammar<\/i>. The MediaWiki &#8220;parser&#8221; is not actually a parser &mdash; it&#8217;s a <a href=\"http:\/\/www.mediawiki.org\/wiki\/Manual:Parser.php\">twisty series<\/a> of <a href=\"http:\/\/en.wikipedia.org\/wiki\/Regular_expression\">regular expressions<\/a> (PHP&#8217;s version of <a href=\"http:\/\/en.wikipedia.org\/wiki\/Perl_Compatible_Regular_Expressions\">PCRE<\/a>s).<\/p>\n<p>So any grammar effort (and several What You See Is All You Get editors &mdash; others just forget wikitext and write HTML) requires reverse-engineering that, and lots of people have <a href=\"http:\/\/www.mediawiki.org\/wiki\/Markup_spec\">tried<\/a> and <a href=\"http:\/\/www.mediawiki.org\/wiki\/Category:Parser\">gotten 90% of the way<\/a> before stalling. It doesn&#8217;t help that wikitext is (I&#8217;m told) provably impossible to just put into a single lump of <a href=\"http:\/\/en.wikipedia.org\/wiki\/Extended_Backus%E2%80%93Naur_form\">EBNF<\/a>.<\/p>\n<p>The goal is to replace the <a href=\"http:\/\/svn.wikimedia.org\/svnroot\/mediawiki\/trunk\/phase3\/includes\/Parser.php\">twisty series of regexps<\/a> with something generated from a grammar. <a href=\"http:\/\/en.wikipedia.org\/wiki\/User:Tim_Starling\">Tim Starling<\/a> has said, more or less: <i>&#8220;We can&#8217;t change wikitext. Go away and write something that (a) covers almost all of it (b) is comparably fast in PHP.&#8221;<\/i> Harsh, but fair.<\/p>\n<p>It occurred to me that there must exist tools to convert regexps into EBNF. And that if we can get it into even a few disparate lumps of hideous EBNF, there should be tools to take those and simplify them somewhat. (Presumably with steps to say what given bits mean.) Or possibly <a href=\"http:\/\/www.mediawiki.org\/wiki\/Markup_spec\/ANTLR\">things other than EBNF<\/a>, just as long as the result is parseable.<\/p>\n<p>I am not (even slightly) a computer scientist, but many of you are. Does anyone have any ideas on this? Or pointers to anyone having done anything even remotely similar? Or knowledgeable friends they could point this query at?<\/p>\n<p>The other approach is parserTests.php. <a href=http:\/\/www.mediawiki.org\/wiki\/Manual:Maintenance_scripts>Running maintenance scripts<\/a>, <a href=http:\/\/svn.wikimedia.org\/viewvc\/mediawiki\/trunk\/phase3\/maintenance\/>the scripts (look for parserTests), <a href=http:\/\/svn.wikimedia.org\/viewvc\/mediawiki\/trunk\/phase3\/maintenance\/parserTests.txt>the list of tests<\/a>. A &#8220;parser&#8221; will be anything that passes the unit tests.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Last Thursday at London.PM, I got asked a lot why MediaWiki wikitext doesn&#8217;t have a WYSIWYG editor. The answer is that a WYSIWYG editor would need to know wikitext grammar, and there is no defined grammar. The MediaWiki &#8220;parser&#8221; is not actually a parser &mdash; it&#8217;s a twisty series of regular expressions (PHP&#8217;s version of &hellip; <a href=\"https:\/\/davidgerard.co.uk\/notes\/2008\/04\/09\/regular-expressions-to-ebnf\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Regular expressions to EBNF?&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[4],"tags":[],"class_list":["post-92","post","type-post","status-publish","format-standard","hentry","category-wiki"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p4FmVR-1u","_links":{"self":[{"href":"https:\/\/davidgerard.co.uk\/notes\/wp-json\/wp\/v2\/posts\/92","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/davidgerard.co.uk\/notes\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/davidgerard.co.uk\/notes\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/davidgerard.co.uk\/notes\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/davidgerard.co.uk\/notes\/wp-json\/wp\/v2\/comments?post=92"}],"version-history":[{"count":0,"href":"https:\/\/davidgerard.co.uk\/notes\/wp-json\/wp\/v2\/posts\/92\/revisions"}],"wp:attachment":[{"href":"https:\/\/davidgerard.co.uk\/notes\/wp-json\/wp\/v2\/media?parent=92"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/davidgerard.co.uk\/notes\/wp-json\/wp\/v2\/categories?post=92"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/davidgerard.co.uk\/notes\/wp-json\/wp\/v2\/tags?post=92"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}