/usr/share/doc/ruby-treetop/manual-html/pitfalls_and_advanced_techniques.html is in ruby-treetop 1.4.10-5.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | <html><head><link href="./screen.css" rel="stylesheet" type="text/css" />
<script src="http://www.google-analytics.com/urchin.js" type="text/javascript">
</script>
<script type="text/javascript">
_uacct = "UA-3418876-1";
urchinTracker();
</script>
</head><body><div id="top"><div id="main_navigation"><ul><li>Documentation</li><li><a href="contribute.html">Contribute</a></li><li><a href="index.html">Home</a></li></ul></div></div><div id="middle"><div id="main_content"><div id="secondary_navigation"><ul><li><a href="syntactic_recognition.html">Syntax</a></li><li><a href="semantic_interpretation.html">Semantics</a></li><li><a href="using_in_ruby.html">Using In Ruby</a></li><li>Advanced Techniques</li></ul></div><div id="documentation_content"><h1>Pitfalls</h1>
<h2>Left Recursion</h2>
<p>An weakness shared by all recursive descent parsers is the inability to parse left-recursive rules. Consider the following rule:</p>
<pre><code>rule left_recursive
left_recursive 'a' / 'a'
end
</code></pre>
<p>Logically it should match a list of 'a' characters. But it never consumes anything, because attempting to recognize <code>left_recursive</code> begins by attempting to recognize <code>left_recursive</code>, and so goes an infinite recursion. There's always a way to eliminate these types of structures from your grammar. There's a mechanistic transformation called <em>left factorization</em> that can eliminate it, but it isn't always pretty, especially in combination with automatically constructed syntax trees. So far, I have found more thoughtful ways around the problem. For instance, in the interpreter example I interpret inherently left-recursive function application right recursively in syntax, then correct the directionality in my semantic interpretation. You may have to be clever.</p>
<h1>Advanced Techniques</h1>
<p>Here are a few interesting problems I've encountered. I figure sharing them may give you insight into how these types of issues are addressed with the tools of parsing expressions.</p>
<h2>Matching a String</h2>
<pre><code>rule string
'"' (!'"' . / '\"')* '"'
end
</code></pre>
<p>This expression says: Match a quote, then zero or more of any character but a quote or an escaped quote followed by a quote. Lookahead assertions are essential for these types of problems.</p>
<h2>Matching Nested Structures With Non-Unique Delimeters</h2>
<p>Say I want to parse a diabolical wiki syntax in which the following interpretations apply.</p>
<pre><code>** *hello* ** --> <strong><em>hello</em></strong>
* **hello** * --> <em><strong>hello</strong></em>
rule strong
'**' (em / !'*' . / '\*')+ '**'
end
rule em
'**' (strong / !'*' . / '\*')+ '**'
end
</code></pre>
<p>Emphasized text is allowed within strong text by virtue of <code>em</code> being the first alternative. Since <code>em</code> will only successfully parse if a matching <code>*</code> is found, it is permitted, but other than that, no <code>*</code> characters are allowed unless they are escaped.</p>
<h2>Matching a Keyword But Not Words Prefixed Therewith</h2>
<p>Say I want to consider a given string a characters only when it occurs in isolation. Lets use the <code>end</code> keyword as an example. We don't want the prefix of <code>'enders_game'</code> to be considered a keyword. A naiive implementation might be the following.</p>
<pre><code>rule end_keyword
'end' &space
end
</code></pre>
<p>This says that <code>'end'</code> must be followed by a space, but this space is not consumed as part of the matching of <code>keyword</code>. This works in most cases, but is actually incorrect. What if <code>end</code> occurs at the end of the buffer? In that case, it occurs in isolation but will not match the above expression. What we really mean is that <code>'end'</code> cannot be followed by a <em>non-space</em> character.</p>
<pre><code>rule end_keyword
'end' !(!' ' .)
end
</code></pre>
<p>In general, when the syntax gets tough, it helps to focus on what you really mean. A keyword is a character not followed by another character that isn't a space.</p></div></div></div><div id="bottom"></div></body></html>
|