<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Regular expressions to EBNF?</title>
	<atom:link href="http://davidgerard.co.uk/notes/2008/04/09/regular-expressions-to-ebnf/feed/" rel="self" type="application/rss+xml" />
	<link>http://davidgerard.co.uk/notes/2008/04/09/regular-expressions-to-ebnf/</link>
	<description>arrogant pontification</description>
	<lastBuildDate>Sun, 22 Apr 2012 11:18:34 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.4</generator>
	<item>
		<title>By: David Gerard</title>
		<link>http://davidgerard.co.uk/notes/2008/04/09/regular-expressions-to-ebnf/comment-page-1/#comment-7994</link>
		<dc:creator>David Gerard</dc:creator>
		<pubDate>Thu, 10 Apr 2008 10:54:50 +0000</pubDate>
		<guid isPermaLink="false">http://davidgerard.co.uk/notes/2008/04/09/regular-expressions-to-ebnf/#comment-7994</guid>
		<description>Oh yeah. I should have noted that a grammar is necessary, not sufficient.

I think the real win will be in third-party parsers of provable 100% accuracy. PHP in the default install, compiled C on a heavy-duty site or for intensive post-processing.

Tim has said we don&#039;t necessarily have to preserve every stupid corner case and piece of emergent behaviour ... unless they&#039;re used by people and considered part of wikitext. This is a soft boundary.</description>
		<content:encoded><![CDATA[<p>Oh yeah. I should have noted that a grammar is necessary, not sufficient.</p>
<p>I think the real win will be in third-party parsers of provable 100% accuracy. PHP in the default install, compiled C on a heavy-duty site or for intensive post-processing.</p>
<p>Tim has said we don&#8217;t necessarily have to preserve every stupid corner case and piece of emergent behaviour &#8230; unless they&#8217;re used by people and considered part of wikitext. This is a soft boundary.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Kinzler</title>
		<link>http://davidgerard.co.uk/notes/2008/04/09/regular-expressions-to-ebnf/comment-page-1/#comment-7993</link>
		<dc:creator>Daniel Kinzler</dc:creator>
		<pubDate>Thu, 10 Apr 2008 10:40:49 +0000</pubDate>
		<guid isPermaLink="false">http://davidgerard.co.uk/notes/2008/04/09/regular-expressions-to-ebnf/#comment-7993</guid>
		<description>The absense of a formal grammar for the wiki markup is only one reason for not having whsiwyg -- another one is that many parts of the wiki syntax can not be handled at all on the client side, because the require access to the database: this includes templates, some parser functions and many extension tags. Also, a true wysiwyg-editor has no way to edit invisible things like DISPLAYTITLE and such. And I can&#039;t even begin to imagine what would happen if you tried to edit one of the more complex templates with such an editor. So, the maximum one could aim for is wysiwym (what you see is what you mean), maybe a bit like what the WikEd script does [1]. And it would have to be very carefull not to break thinks that it doesn&#039;t &quot;understand&quot;. It would need &quot;soft warsing&quot; -- a tricky problem. 

As to parsing: converting regular expressions to EBNF is relatively trivial, in might not even be needed, depending on what framework you use to build the parser. The problem is that mediawiki syntax is not regular [2], and very likely not LL-1 or LR-1 [1] either. I&#039;m not even convinced that it&#039;s even context-free in all cases [4] (parsing non-CF grammar is very hard, even theoretically).  It also has an annyoing tendency to depend on localization and customization: a promintent example are image links. They follow a different grammar than normal links, but can only be detected when knowing all local names of the image namespace. Another problems are templates that are &quot;syntactically incomplete&quot;: you can not parse the template code individually, and then use the result. You need a separate preprocessor pass (using a limited grammar) to resolve them (which requires database access). Same goes for some parser functions.  This makes it very hard to detach the parser from the rest of the system.

For the parser tests: the are badly named, really --- what they test is code generation, not parsing as such. What we currently have is a &quot;munger&quot; that mogrifies the wikitext until it resembles html -- a real parser would not generate html. It would generate a parse tree (or parse events), and a code generator would be plugged into that. For a wysiwyg engine, it might not generate html code at all, but a DOM tree or something like that. And for exporting, mit might generate something else, like TeX (that would rock).

The advantage of having a formal grammar would be to allow people to build parsers on different platforms -- for php to be used in MediaWiki itself, in javascript for a web based wysiwym-editor, in python for use with bots, etc.

My point is: it&#039;s not just a lot of work someone needs to do. There are conceptual problems with this. On of the biggest is that due to the reliance on configuration, localization, database content and extension-defined syntax, a formal grammar in the academic sense (EBNF or a production grammar [5]) is not even possible. 

To me this means that we can not hope for a simple or clean solution. We can only try to build something that is better than what we have, and live with the quirks.

[1] http://en.wikipedia.org/wiki/User:Cacycle/wikEd
[2] http://en.wikipedia.org/wiki/Regular_grammar
[3] http://en.wikipedia.org/wiki/LL_parser resp.  http://en.wikipedia.org/wiki/LR_parser
[4] http://en.wikipedia.org/wiki/Context_free_grammar
[5] http://en.wikipedia.org/wiki/Production_%28computer_science%29</description>
		<content:encoded><![CDATA[<p>The absense of a formal grammar for the wiki markup is only one reason for not having whsiwyg &#8212; another one is that many parts of the wiki syntax can not be handled at all on the client side, because the require access to the database: this includes templates, some parser functions and many extension tags. Also, a true wysiwyg-editor has no way to edit invisible things like DISPLAYTITLE and such. And I can&#8217;t even begin to imagine what would happen if you tried to edit one of the more complex templates with such an editor. So, the maximum one could aim for is wysiwym (what you see is what you mean), maybe a bit like what the WikEd script does [1]. And it would have to be very carefull not to break thinks that it doesn&#8217;t &#8220;understand&#8221;. It would need &#8220;soft warsing&#8221; &#8212; a tricky problem. </p>
<p>As to parsing: converting regular expressions to EBNF is relatively trivial, in might not even be needed, depending on what framework you use to build the parser. The problem is that mediawiki syntax is not regular [2], and very likely not LL-1 or LR-1 [1] either. I&#8217;m not even convinced that it&#8217;s even context-free in all cases [4] (parsing non-CF grammar is very hard, even theoretically).  It also has an annyoing tendency to depend on localization and customization: a promintent example are image links. They follow a different grammar than normal links, but can only be detected when knowing all local names of the image namespace. Another problems are templates that are &#8220;syntactically incomplete&#8221;: you can not parse the template code individually, and then use the result. You need a separate preprocessor pass (using a limited grammar) to resolve them (which requires database access). Same goes for some parser functions.  This makes it very hard to detach the parser from the rest of the system.</p>
<p>For the parser tests: the are badly named, really &#8212; what they test is code generation, not parsing as such. What we currently have is a &#8220;munger&#8221; that mogrifies the wikitext until it resembles html &#8212; a real parser would not generate html. It would generate a parse tree (or parse events), and a code generator would be plugged into that. For a wysiwyg engine, it might not generate html code at all, but a DOM tree or something like that. And for exporting, mit might generate something else, like TeX (that would rock).</p>
<p>The advantage of having a formal grammar would be to allow people to build parsers on different platforms &#8212; for php to be used in MediaWiki itself, in javascript for a web based wysiwym-editor, in python for use with bots, etc.</p>
<p>My point is: it&#8217;s not just a lot of work someone needs to do. There are conceptual problems with this. On of the biggest is that due to the reliance on configuration, localization, database content and extension-defined syntax, a formal grammar in the academic sense (EBNF or a production grammar [5]) is not even possible. </p>
<p>To me this means that we can not hope for a simple or clean solution. We can only try to build something that is better than what we have, and live with the quirks.</p>
<p>[1] <a href="http://en.wikipedia.org/wiki/User:Cacycle/wikEd" rel="nofollow">http://en.wikipedia.org/wiki/User:Cacycle/wikEd</a><br />
[2] <a href="http://en.wikipedia.org/wiki/Regular_grammar" rel="nofollow">http://en.wikipedia.org/wiki/Regular_grammar</a><br />
[3] <a href="http://en.wikipedia.org/wiki/LL_parser" rel="nofollow">http://en.wikipedia.org/wiki/LL_parser</a> resp.  <a href="http://en.wikipedia.org/wiki/LR_parser" rel="nofollow">http://en.wikipedia.org/wiki/LR_parser</a><br />
[4] <a href="http://en.wikipedia.org/wiki/Context_free_grammar" rel="nofollow">http://en.wikipedia.org/wiki/Context_free_grammar</a><br />
[5] <a href="http://en.wikipedia.org/wiki/Production_%28computer_science%29" rel="nofollow">http://en.wikipedia.org/wiki/Production_%28computer_science%29</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: David Gerard</title>
		<link>http://davidgerard.co.uk/notes/2008/04/09/regular-expressions-to-ebnf/comment-page-1/#comment-7988</link>
		<dc:creator>David Gerard</dc:creator>
		<pubDate>Thu, 10 Apr 2008 07:11:56 +0000</pubDate>
		<guid isPermaLink="false">http://davidgerard.co.uk/notes/2008/04/09/regular-expressions-to-ebnf/#comment-7988</guid>
		<description>Yeah. It&#039;s doing surprisingly well. But I doubt they&#039;d disagree at the singular ... joys ... of reverse-engineering wikitext.</description>
		<content:encoded><![CDATA[<p>Yeah. It&#8217;s doing surprisingly well. But I doubt they&#8217;d disagree at the singular &#8230; joys &#8230; of reverse-engineering wikitext.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Brad</title>
		<link>http://davidgerard.co.uk/notes/2008/04/09/regular-expressions-to-ebnf/comment-page-1/#comment-7970</link>
		<dc:creator>Brad</dc:creator>
		<pubDate>Thu, 10 Apr 2008 01:34:46 +0000</pubDate>
		<guid isPermaLink="false">http://davidgerard.co.uk/notes/2008/04/09/regular-expressions-to-ebnf/#comment-7970</guid>
		<description>Have you seen &lt;a href=&quot;http://mediawiki.fckeditor.net/&quot; rel=&quot;nofollow&quot;&gt;MediaWiki+FCKeditor&lt;/a&gt;? It appears to be under &lt;a href=&quot;http://dev.fckeditor.net/report/12&quot; rel=&quot;nofollow&quot;&gt;active development&lt;/a&gt;.</description>
		<content:encoded><![CDATA[<p>Have you seen <a href="http://mediawiki.fckeditor.net/" rel="nofollow">MediaWiki+FCKeditor</a>? It appears to be under <a href="http://dev.fckeditor.net/report/12" rel="nofollow">active development</a>.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

