simple s-expression parser in python

A few weeks back, I boasted in a bar that anyone could write a simple s-expression parser from scratch (no flex, no yacc, no bison, no antler, etc).  Well, I could not boast without doing it myself, so I finally did.  My s-expression lexer/parser weighs in at about 170 lines without tests (450 lines with them), and parsers integers and symbols.  I even created a little repl which takes this:

(+ 1 2)

And converts it into this:

[Symbol(+), Number(1), Number(2)]

A pretty-printed python list consisting of objects representing symbols and numbers.

I think the design is fairly readable if a bit unusual. There are classes for each token (instances of which are used for fully and partially formed tokens) and we “add” these tokens together to form a larger token and ultimately invoke a next_token_exception allowing the lexer to yield the largest, greediest, most fully-formed token possible.

Here’s an example of the lexer in action:

>>> symbol_token("h")

>>> _ + symbol_token("i")
<parse_sexp.symbol_token object at 0x7fb9e2d25710>
>>> formed = _
>>> formed + " "
Traceback (most recent call last):
    File "", line 1, in 
    File "parse_sexp.py", line 33, in __add__
    raise next_token_exception()
parse_sexp.next_token_exception
>>> formed.value
'hi'

After lexing, we’re left with a stream of fully-formed tokens; it’s a simple matter to filter out the whitespace tokens and start yielding instances of Symbol() and Number() classes with proper, pythonic types for the data.

Although I used regexes (quite unnecessarily) and generators, this little thing could probably be ported to C. It’d be the fastest, most useless parser/lexer around :)

Happy Hacking!

Comments !

social

tags