A few weeks back, I boasted in a bar that anyone could write a simple s-expression parser from scratch (no flex, no yacc, no bison, no antler, etc). Well, I could not boast without doing it myself, so I finally did. My s-expression lexer/parser weighs in at about 170 lines without tests (450 lines with them), and parsers integers and symbols. I even created a little repl which takes this:
(+ 1 2)
And converts it into this:
[Symbol(+), Number(1), Number(2)]
A pretty-printed python list consisting of objects representing symbols and numbers.
I think the design is fairly readable if a bit unusual. There are classes for each token (instances of which are used for fully and partially formed tokens) and we “add” these tokens together to form a larger token and ultimately invoke a next_token_exception allowing the lexer to yield the largest, greediest, most fully-formed token possible.
Here’s an example of the lexer in action:
>>> symbol_token("h") >>> _ + symbol_token("i") <parse_sexp.symbol_token object at 0x7fb9e2d25710> >>> formed = _ >>> formed + " " Traceback (most recent call last): File "", line 1, in File "parse_sexp.py", line 33, in __add__ raise next_token_exception() parse_sexp.next_token_exception >>> formed.value 'hi'
After lexing, we’re left with a stream of fully-formed tokens; it’s a simple matter to filter out the whitespace tokens and start yielding instances of Symbol() and Number() classes with proper, pythonic types for the data.
Although I used regexes (quite unnecessarily) and generators, this little thing could probably be ported to C. It’d be the fastest, most useless parser/lexer around :)
Happy Hacking!
Comments !