a tour of python’s ElementTree path language

posted on May 03, 2012 - tagged as: python, xml

A few weeks back I started looking at ElementPath package of the ElementTree module to explore a new syntax for finding things in xml documents, but this was a bust. The existing path syntax (a sort of limited and significantly less confusing subset of xpath) works quite nicely and I can’t see any reason to fiddle with it. And while the implementation of ElementPath is simple and readable, I liked exploring it so much I thought I’d share some thoughts since it serves as a nice example of the power of generators and first class functions.

First consider this simple xml document as an example:

<root>
  <a>
    <b value="2"/>
  </a>
  <a>
    <b>
      <c value="101"/>
    </b>
  </a>
  <a value="will not be found"/>
</root>

The ElementPath expression for finding all “b” elements that are children of “a” elements is simply “a/b” (quite familiar if you’ve ever worked with xpath). The code to build up the ElementTree and actually evaluate this expression is also fairly straightforward. Assuming we’ve defined the string TEST_DOC to be the xml document in the code section above, to do this we simply build the ElementTree and call findall():

root = xml.etree.ElementTree.fromstring(TEST_DOC)
root.findall(path)

First it’s useful to think how we might write some code to find all “b” elements that are children of an “a” element without the benefit of a path language. Writing this by hand, we simply walk the tree, finding all “a” elements and checking to see if they have a “b” element as a child. Here is one possible way to write this code:

def brute_find(root):
    def find_a(root):
        for node in root:
            if node.tag == "a":
                 yield node

    def find_b(root):
        for node in root:
            if node.tag == "b":
                yield node

        for a_node in find_a(root):
            for b_node in find_b(a_node):
                yield b_node

In this particular implementation, we chain together two generators. The first yields all “a” elements, and the second yields all “b” elements that are children of those elements yielded by the first function (“a” elements). When evaluating the path expression “a/b” the actual ElementPath implementation ends up doing something quite similar. Let’s take a look at what happens step by step.

First the expression is tokenized using a regular expression. The stream of tokens for “a/b” looks like:

[('', 'a'), ('/', ''), ('', 'b')]

The meat of the work in the path language happens when it uses the first first element of the first tuple in the stream of token-tuples to dispatch to the functions responsible for creating the code that will be used to find elements.

To build up code that will walk the tree like our hand-coded example above, the ElementPath implementation looks at the first item in the tuples of the stream and dispatches to helper functions. The first item in the first tuple of the stream of tokens above, the empty string, dispatches to the prepare_child() function, which looks this:

def prepare_child(next, token):
    tag = token[1]
    def select(context, result):
        for elem in result:
            for e in elem:
                if e.tag == tag:
                    yield e
    return select

Generally speaking, all tokens in the stream cause these helper functions to be dispatched (if you are looking at the ElementPath source code, each helper function starts with “prepare”). Each helper function returns a function that contains the code for walking the tree and finding nodes of interest (and optionally the helper function gobbles up more tokens in the stream using the next() iterator). All the functions returned contain a common calling convention, taking a context and a result as parameters. I’ll ignore the context parameter except to say that it is used for path expressions that look backwards up the tree instead of just at attributes or child elements. The result parameter is the result of the previously filtered node (if any) or the root of the tree. In this particular instance, prepare_child() returns a function select() that ignores the context (it does no backtracking) and iterates over the result element, yielding children of the result element if their tag matches the second item in the provided token-tuple.

All of the functions returned by the helper functions are appended to a list, selector, and ultimately chained together:

result = [elem]
...
for select in selector:
    result = select(context, result)
return result

So, in our example path, “a/b”, the select() function looking for “a” elements is provided as input to the select() function looking for “b” elements. Conceptually we can think of this as:

function_looking_for_b_elements(context={},
                                result=function_looking for_a_elements())

Therefore when the first select function (looking for “b” elements) is evaluated, it forces the evaluation of the second function looking for “a” elements. Remember that these are all generators and inherently lazy, and this means that the entire process is lazy. If you’re only looking for one match, you’ll get the first one in the xml tree (by first I’m mean using a DFS) and no further evaluation is required. The find(), findall(), and iterfind() methods on each ElementTree node let you find one, all, or as many nodes as you need respectively.

To summarize this process, the original path expression is tokenized. Then functions are dispatched using these tokens. The functions may consume more tokens in the stream, but ultimately return functions themselves. The functions returned are chained together and the root of the tree is passed in.  All of the functions in the chain are generators, and asking for the first matched element will force evaluation all the way up the chain.

As a final thought, remember that all of this finding of elements happens against a tree that is completely loaded into memory. Obviously this isn’t always plausible. Some subsets of this path language and others like xpath, can, however, be used in a streaming setting. Adding something like this to the standard library could be useful (especially if it was general enough to be applied to streams of JSON).

All of the code I wrote for this post can be found in this gist. It has been tested in Python 3.2 only.

Comments !

social

tags