PetitParser Bon Mots

By: on August 20, 2011

I’ve been working with parsing number literals in Smalltalk recently. I’ve discovered a small collection of tricks not entirely explained in PetitParser’s otherwise excellent documentation, so I thought I’d share my experiences:

Lookahead/peeking

If we want to peek into the stream – see what’s next, without changing our position – use the and and not combinators. The result of that peek will be included in what the containing parser returns:

    ($a asParser , $a asParser and , $a asParser) parse: 'aa'
    "=>  #($a $a $a)"
    "         ^^    "
    ($a asParser , $a asParser and , $a asParser and, $a asParser) parse: 'aa'
    "=> #($a $a $a $a)"
    "        ^^^^^    "
    ($a asParser , $a asParser not , $b asParser) parse: 'ab'
    "=> ($a nil $b)"
    "       ^^^    "
    ($a asParser , $a asParser not , $a asParser) parse: 'aa'
    "=> ' at 1', meaning a failed parse."

Note that the not combinator peeks, and fails the parse when that peek succeeds. It, like and, preserves our position in the stream.

This works for arbitrary parsers too:

    "Make sure the stream contains the string 'abc', but only read the $a."
    'abc' asParser and , $a asParser parse: 'abc'
    "=>  #('abc' $a)"

Invoking external parsers

Sometimes we already have a parser for some subsection of our grammar. Perhaps it’s part of a third party library, and we want to integrate it with our PetitParser grammar. We need a PPActionParser, added to our parser by [:stream | "do our manual parsing here"] asParser. So let’s try it out:

    ($a asParser , [:stream | $b asParser] asParser) parse: 'a'
    "=>  {$a . a PPLiteralObjectParser(930611200, $b)}"

Er, what? Well, that PPActionParser hasn’t actually parsed anything. Instead, the result of the block has been returned as part of our parse result: “an ‘a’, and a parser that can parse a ‘b'”. Let’s see another example:

    ($a asParser , [:stream | $b asParser] asParser , $b asParser) parse: 'ab'
    "=> {$a . a PPLiteralObjectParser(1027866624, $b) . $b}"

Indeed, the PPActionParser does not, in fact parse anything. Which makes sense, since you didn’t ask it to. The second example shows that we can make a parser that returns a parser, just for kicks. So, to use a PPActionParser properly:

    ($a asParser , [:stream | $b asParser parseOn: stream] asParser end) parse: 'ab'
    "=> #($a $b)"

    "A parser that when it parses something, returns a parser than can accept a $b:"
    (([:stream | $b asParser] asParser) parse: 'ab') parse: 'b'
    "=>  $b"

Instead of a simple parser like the one we’re using here, we can use any object that consumes a stream and returns a result.

In summary, when you use a PPActionParser, make sure that the block returns a result, not a parser.

Conditional parsing

Smalltalk number literals are usually decimal – 112358, 3.14 – but also permit any base from 2 to 36 – 2r101, -36r-xyzzy.quux. When parsing a number literal, we thus need to read in some signed series of digits. After this point, if there’s an ‘r’ in the stream, we need to read in the rest of the number in base magnitude-of-whatever-we-just-read, otherwise we just read in a base-10 number. This sounds complicated! So here’s the plan of attack: we peek ahead in the stream to see if we’re dealing with a specific-base number or not. We then parse the entire literal either as a decimal number, or in some particular base (which could, of course, be base 10).

    PPSmalltalkNumberParser >> number
        ^ [:stream | | radix result |
            "Read in just enough of the number to decide which parser to use."
            result := ($- asParser optional , digits flatten , $r asParser optional) and
                parseOn: stream.
            result isPetitFailure
                ifTrue: [PPFailure message: 'Digit expected' at: stream position]
                ifFalse: [
                    "result first is the optional sign. result second is the sequence of digits."
                    radix := Number readFrom: result second.
                    "result third is the possible $r, indicating an explicit radix."
                    (result third
                        ifNil: [decimalNumber]
                        ifNotNil: [self numberBase: radix]) parseOn: stream]] asParser

There are two distinct phases we need to bear in mind: composition time, and parse time. Composition time is when you instantiate the parser, which is when the productions in our PPCompositeParser execute. In particular, that means that our productions – our method implementations – are essentially lazy initialisers. Writing something like this:

    PPSmalltalkGrammar >> number
        ^ PPSmalltalkNumberGrammar new number

looks expensive: “but aren’t we going to create a new, stateless, object every time we call this?”. Yes, but it’s only ever (normally) going to execute once, when the PPCompositeParser does its initialisation. I say “normally” – as we’ve seen we usually refer to subparsers through an instance variable. If we use a message send – self number, say – then of course we’ll be doing extra work. (Note though that PPSmalltalkGrammar >> number is still a constant function – we won’t see any side effects there… but we could, if we accidentally wrote a side-effectful method.)

In particular, this means that PPSmalltalkNumberParser >> number above is a bit tricky. It runs once, yet it (a) has a conditional parse in it and (b) the second half of that conditional instantiates a parameterised parser (through the self numberBase: radix send): feed in '2r101' and a different parser will execute to parsing 10r9.

One could trade memory and readability for execution speed by making this parse explicit:

    PPSmalltalkNumberGrammar >> integer
        ^ ($- optional, '16r' asParser , PPPredicateObjectParser anyOf: '0123456789abcdefABCDEF')
            / ($- optional , '8r' asParser , PPPredicateObjectParser anyOf: '01234567')
            / ($- optional , '2r' asParser , PPPredicateObjectParser anyOf: '01')
            / ($- optional , PPPredicateObjectParser anyOf: '0123456789')

and so on. The above isn’t complete, but you can see how tedious it would get…

So: PetitParser is an expressive library allowing one to quickly write a parser. Through the use of delegate parsers it handles left-recursive rules. And it’s not even that inefficient either!

Share

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*