Non-blocking parsing

By: on March 23, 2013

Last month we saw one way how to produce decent error messages while parsing. I’ve also mentioned that parsing with derivatives is a non-blocking parsing technique. What’s that actually mean?

Let’s look at a counterexample first. PetitParser assumes that it will parse a complete whatever-it-is-you’re-parsing: a complete Smalltalk method, a complete SIP packet, a complete HTML document. If you attach a PetitParser to a socket, and you give it half a SIP packet, it will wait for more input, blocking the execution of that thread. (Really, in most Smalltalks that means blocking a Process, which represents a green (preemptively multitasked) thread.

If you’re writing a network server and you use a blocking parser, that means you’ve just added a denial of service vector: an attacker can open connections which will use up more and more threads, and give you a headache.

In contrast, a non-blocking parser will read what it can from its input, and return. In the case of parsing with derivatives, it will return a parser that contains parses for the input received so far. So let’s try it out!

| d num s |
d := DerivingParser deriverBlock.
num := (LiteralSet tokens: ($0 to: $9)) star reduce: [:digits |
    digits inject: 0 into: [:acc :each |
        (acc * 10) + (each asInteger - $0 asInteger)]].

s := '12a3' reading positionTracking parsing: num.
{s get parseNull.
 s get parseNull.
 [s get] on: ParseError do: [:e | e messageText]}
"=> {a Set(1).
     a Set(12).
     'Error at line 1, column 3: Expected one of <$0, $1, $2, $3, $4, $5, $6, $7, $8, $9>, found a <$a>.'}"

Note that each get pulls precisely one character off the underlying stream. In the context of a network server, we can simply read whatever’s in our socket buffer, and the parser will process what it can and return. No denial of service, and it means you can employ standard asynchronous IO in a thread that services many sockets as and when they receive data.

Under the hood, the implementation’s quite simple: #parsing: returns an XTTransformReadStream. This stream lets us read an arbitrary number of elements from a source stream, process them in any fashion we like, and then write an arbitrary number of elements to a destination stream (from which downstream streams will consume). Here, we read one element from input, calculate the derivative of the current parser with that object, record that parser, and push it to output. We do have to fiddle a bit with the output – contentsSpecies: Array so that XTreams knows how to store the interim parsers.

XTPositionTrackingStream >> parsing: aParser
    "Return an XTream that takes a parser P, consumes a character C, calculates
     the derivative of P with respect to C, and produces a new parser. Throws a
     ParseError if it unexpectedly derives Empty."

    | d derivative parse |
    derivative := aParser.
    d := DerivingParser deriverBlock.
    ^ (self transforming: [:input :output |
        parse := [:next :p | | drv |
            drv := ParserCompacter value: (d value: next value: p).
            drv isEmptyParser
                ifTrue: [ParseError
                    signalLine: input lineNumber
         column: input columnNumber
          error: drv error]
                ifFalse: [drv]].
        derivative := parse value: input get value: derivative.
        output write: derivative])
            contentsSpecies: Array;

One last note for those not familiar with Smalltalk syntax, the bit with the semicolon is a cascade. Messages separated by semicolons indicate a sequence of messages sent to the receiver of the first message. So

| arr |
arr := #(1 2 3).
    at: 1 put: 0;
    at: 2 put: 1;
    at: 3 put: 2;

"is equivalent to"
arr at: 1 put: 0.
arr at: 2 put: 1.
arr at: 3 put: 2.
arr yourself.

arr yourself == arr "=> true"

The last message – #yourself – simply returns the original receiver. In XTPositionTrackingStream >> parsing: that means we return the XTTransformReadStream.


Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>