Learn Docker With My Newest Course

Dive into Docker takes you from "What is Docker?" to confidently applying Docker to your own projects. It's packed with best practices and examples. Start Learning Docker →

How I Used the lxml Library to Parse XML 20x Faster in Python

blog/cards/how-i-used-the-lxml-library-to-parse-xml-20x-faster-in-python.jpg

I had to parse 400mb of XML for some client work and I tried a few different strategies. Here's what I ended up with.

Quick Jump: Following Along? Getting Set Up | xmltodict vs Python's Standard Library vs lxml

Not too long ago I was writing a Flask service for a client that had to interact with a SOAP API (gross, I know), and one of the goals of this service was to take a bunch of XML data and then compare -> manipulate -> save it to a database.

Most requests were less than 20MB in which case the first solution I used (which was the xmltodict Python library) was fine and dandy but once I had to deal with 400mb of data things got quite slow.

Suddenly it was taking 80 seconds to convert an XML string into a proper data structure that I could iterate over and access fields on. This was the main bottleneck of the service.

After I spent a few hours researching how to improve the parsing speed, I landed on using the lxml library and I was able to bring the parse time down from 80 seconds to 4 seconds which is a 20x improvement.

Following Along? Getting Set Up

This article will have a few code snippets and if you plan to follow along you will need to install the xmltodict library as well as the lxml library so we can compare both libraries.

Creating a directory to store a few files:

It doesn’t matter where you create this directory but we will be creating a few Python files, an XML file and optionally a Dockerfile.

# I created mine within WSL at this location:
mkdir /d/src/tmp/pythonxml

# And now I moved into this directory since we'll be running our commands here:
cd /d/src/tmp/pythonxml
A Dockerfile that you can use:

Since I’m a big fan of Docker, here’s a Dockerfile that you can use to get up and running quickly. If you’re not using Docker and already have a Python 3.x development environment set up then you can install these packages on your system directly.

Create a new Dockerfile and make it look like this:

FROM python:3.7.4-slim-buster

LABEL maintainer="Nick Janetakis <nick.janetakis@gmail.com>"

WORKDIR /app

RUN apt-get update \
  && apt-get install -y build-essential python3-lxml --no-install-recommends \
  && pip install xmltodict==0.12.0 lxml==4.4.1 \
  && rm -rf /var/lib/apt/lists/* \
  && rm -rf /usr/share/doc && rm -rf /usr/share/man \
  && apt-get purge -y --auto-remove build-essential \
  && apt-get clean

ENV PYTHONUNBUFFERED="true"

COPY . .

CMD ["python3"]

It’s worth pointing out that the lxml library requires apt installing python3-lxml on Debian based systems. One of the reasons why lxml is so fast is because it uses that package’s C code to do most of the heavy lifting for parsing XML.

The 2 Python libraries we’re installing are pip install xmltodict==0.12.0 lxml==4.4.1.

Building the Docker image:

Now we need to build our Docker image from our Dockerfile.

docker image build -t pythonxml .

It will take a few minutes to build and when it’s done we’ll have an image named pythonxml.

Creating a Python script to generate a ~250mb sample XML file:

Creating a large XML file by hand would be lame so I whipped up a simple script to generate a ~250mb file for us. This XML file will be the file we run our benchmarks on.

You’ll want to create a new file called generatexml.py and put this in it:

import random
import string

from timeit import default_timer as timer


timer_start = timer()

print('Starting to write ~250mb XML file')

with open('sample.xml', 'w') as xml:
    books = ''

    for _ in range(2000000):
        title = ''.join(random.choices(string.ascii_uppercase, k=16))
        first_name = ''.join(random.choices(string.ascii_lowercase, k=8))
        last_name = ''.join(random.choices(string.ascii_lowercase, k=12))
        alive = random.choice(['yes', 'no'])

        books += f'''
    <Book>
        <Title>{title}</Title>
        <Author alive="{alive}" />{first_name} {last_name}</Author>
    </Book>'''

    content = f'''<?xml version="1.0" encoding="utf-8"?>
<Catalog>{books}
</Catalog>
'''

    xml.write(content)

seconds = timer() - timer_start

print(f'Finished writing ~250mb XML file in {seconds} seconds')

If you’re a Python developer I’m sure you can make sense of the above. How this script generates the sample file isn’t too important. Just know it creates a sample.xml file in the current directory with 2 million <Book></Book> entries.

The reason I generated so many is because there’s very few XML attributes. In my real XML file I had almost 50 XML attributes and over 100,000+ items. I also had closer to a 400mb file, but I wanted to keep it a bit smaller for this isolated benchmark.

Running the Python script to generate a sample 250mb XML file:

Since I’m running everything in Docker I am running a Docker command but if you’re not using Docker then you can just run python3 generatexml.py.

docker container run --rm -v "${PWD}":/app pythonxml python3 generatexml.py

That command should finish running in less than a minute and produce similar output to:

Starting to write ~250mb XML file
Finished writing ~250mb XML file in 32.84721500000025 seconds

It took a while for me since I’m running all of this inside of WSL (v1) with Docker for Windows and I didn’t write it to my SSD. Have to protect those write cycles!

And if you look in your current directory, you should see:

nick@archon:/d/src/tmp/pythonxml $ ls -la
total 237316
drwxr-xr-x 1 nick nick      4096 Aug 17 13:46 .
drwxrwxrwx 1 nick nick      4096 Aug 17 13:44 ..
-rw-r--r-- 1 nick nick       474 Aug 17 13:41 Dockerfile
-rw-r--r-- 1 nick nick       870 Aug 17 13:42 generatexml.py
-rw-r--r-- 1 nick nick      1971 Aug 17 13:46 parsexml.py
-rwxrwxrwx 1 nick nick 243000759 Aug 17 13:44 sample.xml

In my case it generated a 243mb sample.xml file.

You can investigate it by running less sample.xml and paging up / down to view it. Press q to cancel the less tool:

<?xml version="1.0" encoding="utf-8"?>
<Catalog>
    <Book>
        <Title>OKICQOHZWMDOERUD</Title>
        <Author alive="yes" />tcwbfagh nyolfhzeljep</Author>
    </Book>
    <Book>
        <Title>XYMOSXGHGMOBVIOE</Title>
        <Author alive="no" />vkukayhe igtodhnkmgaf</Author>
    </Book>
    [...]
</Catalog>

Cool, so now we have our sample data. The next step is to run a few parsing benchmarks against it using 3 different XML parsing strategies.

Creating a Python script to parse the sample XML file:

The last thing we need to set up is the parsexml.py file to demonstrate how to parse the XML file and also benchmark it.

Create a new parsexml.py and make it look like this:

import random
import string
import sys

from timeit import default_timer as timer


def sample_xml(opts):
    """Return the sample XML file as a string."""
    with open('sample.xml', opts) as xml:
        return xml.read()


# xmltodict--------------------------------------------------------------------
def parse_xmltodict():
    import xmltodict

    xml_as_string = sample_xml('r')

    timer_start = timer()

    print('[xmltodict] Starting to parse XML')

    xml_xmltodict = xmltodict.parse(xml_as_string, dict_constructor=dict)

    seconds = timer() - timer_start

    print(f'[xmltodict] Finished parsing XML in {seconds} seconds')


# etree with Python's standard library ----------------------------------------
def parse_etree_stdlib():
    import xml.etree.ElementTree as etree_stdlib

    xml_as_string = sample_xml('r')

    timer_start = timer()

    print('[etree stdlib] Starting to parse XML')

    tree = etree_stdlib.fromstring(xml_as_string)

    xml_etree_stdlib = tree.findall('./Book', {})

    seconds = timer() - timer_start

    print(f'[etree stdlib] Finished parsing XML in {seconds} seconds')


# etree with lxml -------------------------------------------------------------
def parse_etree_lxml():
    from lxml import etree as etree_lxml

    xml_as_bytes = sample_xml('rb')

    timer_start = timer()

    print('[etree lxml] Starting to parse XML')

    tree = etree_lxml.fromstring(xml_as_bytes)

    xml_etree_lxml = tree.findall('./Book', {})

    seconds = timer() - timer_start

    print(f'[etree lxml] Finished parsing XML in {seconds} seconds')


# command line arguments ------------------------------------------------------
if sys.argv[1] == 'xmltodict':
    parse_xmltodict()
elif sys.argv[1] == 'etree_stdlib':
    parse_etree_stdlib()
elif sys.argv[1] == 'etree_lxml':
    parse_etree_lxml()
else:
    print('Invalid arg, please supply: xmltodict, etree_stdlib or etree_lxml')
    sys.exit(1)

We’ll go over this in a little more detail when comparing the results.

But the basic idea is we read in the sample.xml file and then parse it using 1 of the 3 strategies. We also use the default_timer function from Python’s timeit module to track how long it took to do the work.

I know there’s more robust ways to run benchmarks but this gets the job done for this use case.

A specific parsing strategy can be run depending on what command line argument we pass in, and those can be found near the bottom of the script.

xmltodict vs Python’s Standard Library vs lxml

Now the fun part. Comparing the numbers:

$ docker container run --rm -v "${PWD}":/app pythonxml python3 parsexml.py xmltodict
[xmltodict] Starting to parse XML
[xmltodict] Finished parsing XML in 47.105290600000046 seconds

$ docker container run --rm -v "${PWD}":/app pythonxml python3 parsexml.py etree_stdlib
[etree stdlib] Starting to parse XML
[etree stdlib] Finished parsing XML in 12.256522099999984 seconds

$ docker container run --rm -v "${PWD}":/app pythonxml python3 parsexml.py etree_lxml
[etree lxml] Starting to parse XML
[etree lxml] Finished parsing XML in 3.200624800000014 seconds

With a ~250mb sample it’s not quite a 20x difference but it was 20x with a 400mb sample. Still even in this case it’s about a 15x improvement which is a huge win.

What’s interesting is both Python’s standard library and lxml have an etree library and the lxml variant is pretty close to having the same API as the standard library except it’s a bit more optimized.

If you look at the code in the parsexml.py file for both they are the same. The only difference is lxml expects your file to be sent in as bytes instead of a string.

It’s also worth pointing out you can parse files directly with etree instead of first opening a file and passing in its value to etree.fromstring. For that, look in the docs for etree.parse or even etree.iterparse if you want to read the file in chunks instead of all at once.

Using iterparse could be handy for dealing with massive files that don’t fit in memory or even reading it in from a stream using the requests library if it’s the result of an API call.

How Do You Iterate over the XML with All 3 Strategies?

This is starting to get a bit beyond the scope of this blog post but here’s the basics.

With xmltodict:
for book in xml_xmltodict['Catalog']['Book']:
    print(book)

Produces this output:

{'Title': 'KLLNIKKMUGGCWWZB', 'Author': {'@alive': 'no', '#text': 'trdjaxkj hemsfuuovtrw'}}
{'Title': 'COQDCPGIAUWQIKGG', 'Author': {'@alive': 'yes', '#text': 'vkkraogc bpxqworeqbbk'}}

Since it’s a dictionary you can do whatever you can do with a Python dictionary. Attributes are nested dictionaries and all of this is included in xmltodict’s docs.

With etree (both standard library and lxml):
for book in xml_etree_lxml:
    for property in list(book):
        print(property.tag)
        print(property.text)
        print(property.attrib)
      
    print()

Produces this output:

Title
KLLNIKKMUGGCWWZB
{}
Author
trdjaxkj hemsfuuovtrw
{'alive': 'no'}

Title
COQDCPGIAUWQIKGG
{}
Author
vkkraogc bpxqworeqbbk
{'alive': 'yes'}

Here we can reach into the properties of the book and get anything we want. There is comprehensive documentation available in Python’s docs as well as lxml’s documentation.

So that’s all there is to it. I chose to use lxml and so far it’s working out great.

What are your favorite tips for parsing XML in Python? Let me know below.

Never Miss a Tip, Trick or Tutorial

Like you, I'm super protective of my inbox, so don't worry about getting spammed. You can expect a few emails per month (at most), and you can 1-click unsubscribe at any time. See what else you'll get too.


Comments