Elegant XML parsing using the ElementTree Module


Mark Mruss

Note: This article was first published the October 2007 issue of Python Magazine

XML is everywhere. It seems you can’t do much these days unless you utilize XML in one way or another. Fortunately, Python developers have a new tool in our standard arsenal: the ElementTree module. This article aims to introduce you to reading, writing, saving, and loading XML using the ElementTree module.

  1. Introduction
  2. Reading XML data
  3. Listing 1
  4. Listing 2
  5. Reading XML Attributes
  6. Writing XML
  7. Listing 3
  8. Writing XML Attributes
  9. Reading XML Files
  10. Writing XML Data to a File
  11. Reading from the Web
  12. Conclusion

Introduction

It seems like everyone needs to parse XML these days. They’re either saving their own information in XML or loading in someone else’s data. This is why I was glad to learn that as of Python 2.5, the ElementTree XML package has been added to the standard library in the XML module.

What I like about the ElementTree module is that it just seems to make sense. This might seem like a strange thing to say about an XML module, but I’ve had to parse enough XML in my time to know that if an XML module makes sense the first time you use it, it’s probably a keeper. The ElementTree module allows me to work with XML data in a way that is similar to how I think about XML data.

A subset of the full ElementTree module is available in the Python 2.5 standard library as xml.etree, but you don’t have to use Python 2.5 in order to use the ElementTree module. If you are still using an older version of Python (1.5.2 or later) you can simply download the module from its website and manually install it on your system. The website also has very easy to follow installation instructions, which you should consult to avoid issues while installing ElementTree.

In general, the ElementTree module treats XML data as a list of lists. All XML has a root element that will have zero or more subelements (or child elements). Each of those subelements may in turn have subelements of their own. The best way to think about this is with a brief example.

First let’s take a look at some sample XML data:

<root>
<child>One</child>
<child>Two</child>
</root>

Here we have a root element with two child elements. Each child element has some text associated with it seen here as “one” and “two”. If we examine the XML as a hierarchical list of lists we see that we have one element “root” in our root list. Within the “root” element we have a list containing two subelements “child” and “child”. The two “child” elements would then contain empty lists representing their lack of subelements. Not too complicated so far, is it?

Reading XML data

Now let’s use the ElementTree package to parse this XML and print the text data associated with each child element. To start, we’ll create a Python file with the contents shown in Listing 1.

Listing 1

#!/usr/bin/env python

def main():
	pass

if __name__ == "__main__":
	main()

This is basically a template that I use for many of my simple “*.py” files. It doesn’t actually do anything except set up the script so that when the file is run, the main method will be executed. Some people like to use the Python interactive interpreter for simple hacking like this. Personally, I prefer having my code stored in a handy file so I can make simple changes and re-run the entire script when I am just playing around.

The first thing that we need to do in our Python code is import the ElementTree module:

from xml.etree import ElementTree as ET

Note: If you are not using Python 2.5 and have installed the ElementTree module on your own, you should import the ElementTree module as follows:

from elementtree import ElementTree as ET

This will import the ElementTree section of the module into your program aliased as ET. However, you don’t have to import ElementTree using an alias; you can simply import it and access it as ElementTree. Using ET is demonstrated in the Python 2.5 “What’s new” documentation[1] and I think it’s a great way to eliminate some key strokes.

Now we’ll begin writing code in the main method. The first step is to load the XML data described above. Normally you will be working with a file or URL; for now we want to keep this simple and load the XML data directly from the text:

element = ET.XML(
       "<root><child>One</child><child>Two</child></root>")

The XML function is described in the ElementTree documentation as follows: “Parses an XML document from a string constant. This function can be used to embed “XML literals” in Python code”[2].

Be careful here! The XML function returns an Element object, and not an ElementTree object as one might expect. Element objects are used to represent XML elements, whereas the ElementTree object is used to represent the entire XML document. Element objects may represent the entire XML document if they are the root element but will not if they are a subelement. ElementTree objects also add “some extra support for serialization to and from standard XML.”[3] The Element object that is returned represents the element in our XML data.

Thankfully, the Element object is an iterator object so we can use a for loop to loop through all of its child elements:

for subelement in element:

This will give us all the child elements in the root element. As mentioned earlier, each element in the XML tree is represented as an Element object, so as we iterate through the root element’s child elements we are getting Element objects with which to work. Meaning that each loop though the for loop will give us the next child element in the form of an Element object until there are no more children left. In order to print out the text associated with an Element object we simply have to access the Element object’s text attribute:

for subelement in element:
       print subelement.text

To recap, have a look at the code in Listing 2.

Listing 2

#!/usr/bin/env python

from xml.etree import ElementTree as ET

def main():
	element = ET.XML("<root><child>One</child><child>Two</child></root>")
	for subelement in element:
		print subelement.text

if __name__ == "__main__":
	# Someone is launching this directly
	main()

Once you run the code you should get the following output:

One
Two

If an XML element does not have any text associated with it, like our root element, the Element object’s text attribute will be set to None. If you want to check if an element had any text associated with it, you can do the following:

if element.text is not None:
       print element.text

Reading XML Attributes

Let’s alter the XML that we are working with to add attributes to the elements and look at how we would parse that information.

If the XML uses attributes in addition to, or instead of, inner text they can be accessed using the Element object’s attrib attribute. The attrib attribute is a Python dictionary and is relatively easy to use:

def main():
       element = ET.XML(
               '<root><child val="One"/><child val="Two"/></root>')
       for subelement in element:
               print subelement.attrib

When you run the code you get the following output:

{'val': 'One'}
{'val': 'Two'}

These are the attributes for each child element stored in a dictionary. Being able to work with an XML element’s attributes as a Python dictionary is a great feature and fits well with the dynamic nature of XML attributes.

Writing XML

Now that we’ve tried our hand at reading XML, let’s try creating some. If you understand the reading process, you should have no trouble understanding the creation process because it works in much the same manner. What we are going to do in this example is recreate the XML data that we were working with above.

The first step is to create our element:

#create the root <root>
root_element = ET.Element("root")

After this code is executed, the variable root_element is an Element object, just like the Element objects that we used earlier to parse the XML.

The next step is to create the two child elements. There are two ways to do this.

In the first method, if you know exactly what you are creating, it’s easiest to use the SubElement method, which creates an Element object that is a subelement (or child) of another Element object:

#create the first child <child>One</child>
child = ET.SubElement(root_element, "child")

This will create a Element that is a child of root_element. We then need to set the text associated with that element. To do this we use the same text attribute that we used in the first parsing example. However, instead of simply reading the text attribute we set its value:

child.text = "One"

The second approach to creating a child element is to create an Element object separately (rather than a sub element) and append it to a parent Element object. The results are exactly the same – this is simply a different approach that may come in handy when creating your XML,or working with two sets of XML data.

First we create an Element object in the same way that we created the root element:

#create the second child <child>Two</child>
child = ET.Element("child")
child.text = "Two"

This creates the child Element object and sets its text to “Two”. We then append it to the root element:

#now append
root_element.append(child)

Pretty simple! Now, if we want to look at the contents of our root_element (or any other Element object for that matter) we can use the handy tostring function. It does exactly what it says that it does: it converts an Element object into a human readable string.

#Let's see the results
print ET.tostring(root_element)

Listing 3

#!/usr/bin/env python

from xml.etree import ElementTree as ET

def main():
	#create the root </root><root>
	root_element = ET.Element("root")
	#create the first child <child>One</child>
	child = ET.SubElement(root_element, "child")
	child.text = "One"
	#create the second child <child>Two</child>
	child = ET.Element("child")
	child.text = "Two"
	#now append
	root_element.append(child)
	#Let's see the results
	print ET.tostring(root_element)

if __name__ == "__main__":
	# Someone is launching this directly
	main()

To recap, have a look at the code in Listing 3. When you run this code you will get the following output:

</root><root><child>One</child><child>Two</child></root>

Writing XML Attributes

If you want to create the XML with attributes (as illustrated in the second reading example), you can use the Element object’s set method. To add the val attribute to the first element, use the following:

child.set("val","One")

You may also set attributes when you create Element objects:

child = ET.Element("child", val="One")

Reading XML Files

Most of the time you won’t be working with XML data that you explicitly create in your code, instead you will usually read the XML data in from a data source, work with it, and then save it back out when you are done. Fortunately, configuring ElementTree to work with different data sources is very easy. For example, let’s take the XML data that we first used and save it into a file named our.xml in the same location as our Python file.

There are a few methods that we can use to load XML data from a file. We are going to use the parse function. This function is nice because it will accept, as a parameter, the path to a file OR a “file-like” object. The term “file-like” is used on purpose because the object does not have to be a file object per se – it simply has to be an object that behaves in a file-like manner. A “file-like” object is an object that implements a “file-like” interface, meaning that it shares many (if not all) methods with the file object. If an object is “file-like” this fact will usually be prominently mentioned in its documentation.

The first thing that we need in order to load the XML data is determine the full path to the our.xml file. In order to calculate this, we determine the full path of our Python source file, strip the filename from it, and then append our.xml to the path. This is rather simple given that the __file__ attribute (available in Python 2.2 and later) is the relative path and filename of our Python source file. Although the __file__ attribute will be a relative path, we can use it to calculate the absolute path using the standard os module:

import os

We then call the abspath function to get the absolute path:

xml_file = os.path.abspath(__file__)

However, since we only want the directory name (not the full path and filename of our Python source file) we have to strip off the filename:

xml_file = os.path.dirname(xml_file)

Now that we have the directory in which the our.xml file resides, all we have to do is append the our.xml filename to the xml_file variable. However, instead of just doing something like:

xml_file += "/our.xml"

we will use the os module to join the two paths so that the resulting path is always correct regardless of what operating system our code is executed on:

xml_file = os.path.join(xml_file, "our.xml")

Note: If you have any trouble understanding what any of the code used to determine the path of our.xml is doing, try printing out xml_file after each of the above lines and it should become clear.

We now have the full path to the our.xml file. In order to load its XML data we simply pass the path to the parse function:

tree = ET.parse(xml_file)

We now have an ElementTree object instance that represents our XML file.

Since we are working with files, we should watch out for incorrect paths, I/O errors, or the parse function failing for any other reason. If you wish to be extra careful, you can wrap the parse function in a try/except block in order to catch any exceptions that may be thrown:

try:
       tree = ET.parse("sar")
except Exception, inst:
       print "Unexpected error opening %s: %s" % (xml_file, inst)
       return

In the except block, I catch the Exception base class so that I catch any and all exceptions that may be thrown (in the case of a missing file it will most likely be an IOError exception).

Writing XML Data to a File

Now that we know how to read in XML data, we should look at how one writes XML data out to a file. Let’s assume that after reading in the out.xml fiie we want to add another item to the XML file that we just read in:

child = ET.SubElement(tree.getroot(), "child")
child.text = "Three"

Notice that in order to add a child to the root element we used the ElementTree object’s getroot function. The getroot function simply returns the root Element object of the XML data.

Now that we have a third child element, let’s write the XML data back out to our.xml. Thanks to ElementTree this is a painless experience:

tree.write(xml_file)

That’s it!

If we want to be really careful when writing the XML data out to a file, we’ll watch out for exceptions. However most of the time the write method will succeed without throwing an exception; it is more important to be sure that the path used is correct. Often times, instead of getting the exception that you want, you end up with an XML file stored in some far off and strange location on your hard drive because your path was incorrect or you did not specify the full path. But, as is often the case when programming, better safe than sorry:

try:
       tree.write(xml_file)
except Exception, inst:
       print "Unexpected error writing to file %s: %s" % (xml_file, inst)
       return

To recap you can find all of the code from this section in Listing 4.

Listing 4

#!/usr/bin/env python

from xml.etree import ElementTree as ET
import os

def main():

	xml_file = os.path.abspath(__file__)
	xml_file = os.path.dirname(xml_file)
	xml_file = os.path.join(xml_file, "our.xml")

	try:
		tree = ET.parse(xml_file)
	except Exception, inst:
		print "Unexpected error opening %s: %s" % (xml_file, inst)
		return

	child = ET.SubElement(tree.getroot(), "child")
	child.text = "Three"

	try:
		tree.write(xml_file)
	except Exception, inst:
		print "Unexpected error writing to file %s: %s" % (xml_file, inst)
		return

if __name__ == "__main__":
	# Someone is launching this directly
	main()

When you run the code and take a look at the our.xml file you should see that the the third child element has been added:

<root>
<child>One</child>
<child>Two</child>
<child>Three</child>
</root>

Reading from the Web

Working with a local file is very useful, but you might also be in a situation where you will have to work with an XML file that is located on the Internet, perhaps an RSS feed. Fortunately, since the parse function explained above works with file-like elements, loading a URL is very easy.

First off, you need to import the urllib module; a standard module that allows you to open URLs in a method similar to opening local files:

import urllib

In order to open a URL we use:

feed = urllib.urlopen("http://pythonmagazine.com/c/news/atom")
tree = ET.parse(feed)

Conclusion

And that’s that! This concludes our brief introduction to XML parsing using the ElementTree module. Hopefully throughout this article you have seen how easy it is to create and manipulate XML using ElementTree …and I’ve only scratched the surface. For more information take a look at the official Python documentation and some of the great examples on the effbot website. I’m sure you’ll be an XML wizard in no time.

[1] http://docs.python.org/whatsnew/modules.html#SECTION0001420000000000000000
[2] http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm#elementtree.ElementTree.XML-function
[3] http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm#elementtree.ElementTree.ElementTree-class

selsine

del.icio.us del.icio.us

28 Responses to “Elegant XML parsing using the ElementTree Module”

  1. georgeblunt
    Says:

    It’s elegant, that’s for sure. But what’s still missing, is a good support for the xpath API. That’s a shame.

  2. Anonymous
    Says:

    If you want to use xpath just try lxml. You can use same ElementTree api and also have access to full xpath support querying.

  3. georgeblunt
    Says:

    Yes, I know about lxml.. but lxml doesn’t come with the standard python package (which isn’t such a huge problem, i know, but in my opinion it would be great to have xpath functionality in the standard python libs, just for convenience. The XPath API isn’t such an out-of-the-world feature for an xml library)

  4. Eoin
    Says:

    I think your websites .htaccess has been compromised.
    it is redirecting all users coming via google away from your site and to some kind of advertising site.

  5. Ankur
    Says:

    An elegantly written article. Thank you Sir !

  6. Python to CUAHSI WaterML & WaterOneFlow web service, Pt. 1 « Mi estero
    Says:

    [...] with Python 2.5, ElementTree comes as part of the standard library, pre-installed. ElementTree “treats XML data as a lists of lists”, and is widely considered a more intituitive and pythonic way of processing XML, working seamlessly [...]

  7. Raja
    Says:

    Awesome tutorial. thanks

  8. Artellos.com Blog » Using Python to access the RolePlayGateway API
    Says:

    [...] of the site, Eric Martindale, who is actively maintaining RolePlayGateway. Actually accessing and manipulating XML in python was easier then I thought it would be and bringing it into practice was a lot of fun. With a bit of [...]

  9. nml
    Says:

    Really cool tutorial

  10. Aaron Calderon
    Says:

    This xml reading/writing tutorial is very helpful. I think I can use it to work with the Wine tutorial series that you have.

    Thanks again for your grate work.

  11. Peter Downs
    Says:

    I find it useful to define a parsing function before I do anything with XML. The parsing function will return an array of all of the elements it has found, whether they are stored as element.text or element.attrib. Here it is:

    def parse_XML(element):
    a = []
    for subelement in element:
    if subelement.text is None:
    a.append(subelement.attrib)
    else:
    a.append(subelement.text)
    return a

    sorry about indentation.

  12. Artellos.com » Using Python to access the RolePlayGateway API
    Says:

    [...] of the site, Eric Martindale, who is actively maintaining RolePlayGateway. Actually accessing and manipulating XML in python was easier then I thought it would be and bringing it into practice was a lot of fun. With a bit of [...]

  13. Parsing XML with Python | bundles of links
    Says:

    [...] http://www.learningpython.com/2008/05/07/elegant-xml-parsing-using-the-elementtree-module/ [...]

  14. david
    Says:

    IT’s a totally simple yet helpful tutorial. Thanks alot

  15. Saint
    Says:

    The article is very disappointing when it comes to parsing xml files. Im sure if somebody is parsing a xml file, he would know how to form paths etc. Even if it absolutely had to be explained here, I guess the writer completely forgot to add the part where any post processing could be done on parsed xml file. Whats the point of just parsing. What next after parsing?? How do I access different element of the tree? How do I see the parsed data?
    Its a useless article which doesnt tell you ANYTHING whatsoever.

  16. selsine

    selsine
    Says:

    Hi Saint,

    Thanks for the feedback, this is a simple introduction to parsing xml. What you do after you have parsed the xml is up to you, and dependent on what you need to do.

    I go into using the parsed data a little bit in the Reading XML Data, I have some examples of reading text and attributes. I also loop through elements in the tree.

    So hopefully that might help you. But if you don’t like it that’s all right as well. To each their own, and best of luck.

  17. Mariana_Scorp
    Says:

    Thanks a lot. You saved my life today))

  18. Rajeev
    Says:

    Wow. Thanks a ton. Just what I was looking for to use in test automation. Very useful indeed.

    Rajeev

  19. xml man
    Says:

    EXCELLENT tutorial for beginner ….
    thanks…

  20. Yuliy
    Says:

    Thanks a lot! This brief tutorial was very useful for me.

  21. Finding where an operator reads from the construction history | eX-SI Support
    Says:

    [...] a Python snippet that uses ElementTree to parse the connectionstack XML and then log the TextureOp tooltip that says where the op reads [...]

  22. OKeymaker
    Says:

    Thank you! This guide explains XML and python in sucha way that almost anyone can understand. I bet you have saved several noobs lives.
    Keep it up! :)

  23. OKeymaker
    Says:

    Just a thing. How to print out the xml file that I have loaded from internet? Thats all I want and thats all I dont know…

  24. Infinit's Notepad
    Says:

    AI File Management….

    Goals. Generate file List.  File list to contain the following: filename filepath filesize hash  How will the list be stored?  XML Paths to be scanned to be held in config file….

  25. danish
    Says:

    hi..it’s a very helpful article…but i was having a problem. i know how to take a file in a variable(tree) through ET.parse..but this is a ElementTree object..now i am having trouble to loop through the elements of that object, because i cant iterate over tree in a for loop(for iteration i need elements)..there should be a step through which i can access the elements of the object

  26. jorge
    Says:

    Excellent and well written tutorial. Thx for your time

  27. Manish
    Says:

    Hi

    I Just started learning python and trying to do web data parsing. I am trying to use xpath to get the value from the below html code.

    A
    AAAA
    CNAME
    HINFO
    MX

    What I want is to search and print the value of “SELECTED value” that is it should print the output as “CNAME”.

    I tried this code, but it’s print only [].

    xpath(‘//select[@name="record\[[0-9]{1,3}\]\[type\]“]/option/following-sibling::text()[1]‘)

    Any help would be appreciated.

  28. Manish
    Says:

    Sorry here is the html code which I missed it.

    `

    A
    AAAA
    CNAME
    HINFO
    MX

    `

Leave a Reply

 

Popular Posts