How I Used the lxml Library to Parse XML 20x Faster in Python
I had to parse 400mb of XML for some client work and I tried a few different strategies. Here's what I ended up with.
Not too long ago I was writing a Flask service for a client that had to interact with a SOAP API (gross, I know), and one of the goals of this service was to take a bunch of XML data and then compare -> manipulate -> save it to a database.
Most requests were less than 20MB in which case the first solution I used (which was the xmltodict Python library) was fine and dandy but once I had to deal with 400mb of data things got quite slow.
Suddenly it was taking 80 seconds to convert an XML string into a proper data structure that I could iterate over and access fields on. This was the main bottleneck of the service.
After I spent a few hours researching how to improve the parsing speed, I landed on using the lxml library and I was able to bring the parse time down from 80 seconds to 4 seconds which is a 20x improvement.
# Following Along? Getting Set Up
This article will have a few code snippets and if you plan to follow along you will need to install the xmltodict library as well as the lxml library so we can compare both libraries.
Creating a directory to store a few files:
It doesn’t matter where you create this directory but we will be creating a few Python files, an XML file and optionally a Dockerfile.
# I created mine within WSL at this location:
mkdir /d/src/tmp/pythonxml
# And now I moved into this directory since we'll be running our commands here:
cd /d/src/tmp/pythonxml
A Dockerfile that you can use:
Since I’m a big fan of Docker, here’s a Dockerfile that you can use to get up and running quickly. If you’re not using Docker and already have a Python 3.x development environment set up then you can install these packages on your system directly.
Create a new Dockerfile
and make it look like this:
FROM python:3.7.4-slim-buster
LABEL maintainer="Nick Janetakis <nick.janetakis@gmail.com>"
WORKDIR /app
RUN apt-get update \
&& apt-get install -y build-essential python3-lxml --no-install-recommends \
&& pip install xmltodict==0.12.0 lxml==4.4.1 \
&& rm -rf /var/lib/apt/lists/* \
&& rm -rf /usr/share/doc && rm -rf /usr/share/man \
&& apt-get purge -y --auto-remove build-essential \
&& apt-get clean
ENV PYTHONUNBUFFERED="true"
COPY . .
CMD ["python3"]
It’s worth pointing out that the lxml library requires apt installing python3-lxml
on Debian based systems. One of the reasons why lxml is so fast is because
it uses that package’s C code to do most of the heavy lifting for parsing XML.
The 2 Python libraries we’re installing are pip install xmltodict==0.12.0 lxml==4.4.1
.
Building the Docker image:
Now we need to build our Docker image from our Dockerfile.
docker image build -t pythonxml .
It will take a few minutes to build and when it’s done we’ll have an image
named pythonxml
.
Creating a Python script to generate a ~250mb sample XML file:
Creating a large XML file by hand would be lame so I whipped up a simple script to generate a ~250mb file for us. This XML file will be the file we run our benchmarks on.
You’ll want to create a new file called generatexml.py
and put this in it:
import random
import string
from timeit import default_timer as timer
timer_start = timer()
print('Starting to write ~250mb XML file')
with open('sample.xml', 'w') as xml:
books = ''
for _ in range(2000000):
title = ''.join(random.choices(string.ascii_uppercase, k=16))
first_name = ''.join(random.choices(string.ascii_lowercase, k=8))
last_name = ''.join(random.choices(string.ascii_lowercase, k=12))
alive = random.choice(['yes', 'no'])
books += f'''
<Book>
<Title>{title}</Title>
<Author alive="{alive}" />{first_name} {last_name}</Author>
</Book>'''
content = f'''<?xml version="1.0" encoding="utf-8"?>
<Catalog>{books}
</Catalog>
'''
xml.write(content)
seconds = timer() - timer_start
print(f'Finished writing ~250mb XML file in {seconds} seconds')
If you’re a Python developer I’m sure you can make sense of the above. How this
script generates the sample file isn’t too important. Just know it creates a
sample.xml
file in the current directory with 2 million <Book></Book>
entries.
The reason I generated so many is because there’s very few XML attributes. In my real XML file I had almost 50 XML attributes and over 100,000+ items. I also had closer to a 400mb file, but I wanted to keep it a bit smaller for this isolated benchmark.
Running the Python script to generate a sample 250mb XML file:
Since I’m running everything in Docker I am running a Docker command but if
you’re not using Docker then you can just run python3 generatexml.py
.
docker container run --rm -v "${PWD}":/app pythonxml python3 generatexml.py
That command should finish running in less than a minute and produce similar output to:
Starting to write ~250mb XML file
Finished writing ~250mb XML file in 32.84721500000025 seconds
It took a while for me since I’m running all of this inside of WSL (v1) with Docker for Windows and I didn’t write it to my SSD. Have to protect those write cycles!
And if you look in your current directory, you should see:
nick@archon:/d/src/tmp/pythonxml $ ls -la
total 237316
drwxr-xr-x 1 nick nick 4096 Aug 17 13:46 .
drwxrwxrwx 1 nick nick 4096 Aug 17 13:44 ..
-rw-r--r-- 1 nick nick 474 Aug 17 13:41 Dockerfile
-rw-r--r-- 1 nick nick 870 Aug 17 13:42 generatexml.py
-rw-r--r-- 1 nick nick 1971 Aug 17 13:46 parsexml.py
-rwxrwxrwx 1 nick nick 243000759 Aug 17 13:44 sample.xml
In my case it generated a 243mb sample.xml
file.
You can investigate it by running less sample.xml
and paging up / down to
view it. Press q
to cancel the less
tool:
<?xml version="1.0" encoding="utf-8"?>
<Catalog>
<Book>
<Title>OKICQOHZWMDOERUD</Title>
<Author alive="yes" />tcwbfagh nyolfhzeljep</Author>
</Book>
<Book>
<Title>XYMOSXGHGMOBVIOE</Title>
<Author alive="no" />vkukayhe igtodhnkmgaf</Author>
</Book>
[...]
</Catalog>
Cool, so now we have our sample data. The next step is to run a few parsing benchmarks against it using 3 different XML parsing strategies.
Creating a Python script to parse the sample XML file:
The last thing we need to set up is the parsexml.py
file to demonstrate how
to parse the XML file and also benchmark it.
Create a new parsexml.py
and make it look like this:
import random
import string
import sys
from timeit import default_timer as timer
def sample_xml(opts):
"""Return the sample XML file as a string."""
with open('sample.xml', opts) as xml:
return xml.read()
# xmltodict--------------------------------------------------------------------
def parse_xmltodict():
import xmltodict
xml_as_string = sample_xml('r')
timer_start = timer()
print('[xmltodict] Starting to parse XML')
xml_xmltodict = xmltodict.parse(xml_as_string, dict_constructor=dict)
seconds = timer() - timer_start
print(f'[xmltodict] Finished parsing XML in {seconds} seconds')
# etree with Python's standard library ----------------------------------------
def parse_etree_stdlib():
import xml.etree.ElementTree as etree_stdlib
xml_as_string = sample_xml('r')
timer_start = timer()
print('[etree stdlib] Starting to parse XML')
tree = etree_stdlib.fromstring(xml_as_string)
xml_etree_stdlib = tree.findall('./Book', {})
seconds = timer() - timer_start
print(f'[etree stdlib] Finished parsing XML in {seconds} seconds')
# etree with lxml -------------------------------------------------------------
def parse_etree_lxml():
from lxml import etree as etree_lxml
xml_as_bytes = sample_xml('rb')
timer_start = timer()
print('[etree lxml] Starting to parse XML')
tree = etree_lxml.fromstring(xml_as_bytes)
xml_etree_lxml = tree.findall('./Book', {})
seconds = timer() - timer_start
print(f'[etree lxml] Finished parsing XML in {seconds} seconds')
# command line arguments ------------------------------------------------------
if sys.argv[1] == 'xmltodict':
parse_xmltodict()
elif sys.argv[1] == 'etree_stdlib':
parse_etree_stdlib()
elif sys.argv[1] == 'etree_lxml':
parse_etree_lxml()
else:
print('Invalid arg, please supply: xmltodict, etree_stdlib or etree_lxml')
sys.exit(1)
We’ll go over this in a little more detail when comparing the results.
But the basic idea is we read in the sample.xml
file and then parse it using
1 of the 3 strategies. We also use the default_timer
function from Python’s
timeit
module to track how long it took to do the work.
I know there’s more robust ways to run benchmarks but this gets the job done for this use case.
A specific parsing strategy can be run depending on what command line argument we pass in, and those can be found near the bottom of the script.
# xmltodict vs Python’s Standard Library vs lxml
Now the fun part. Comparing the numbers:
$ docker container run --rm -v "${PWD}":/app pythonxml python3 parsexml.py xmltodict
[xmltodict] Starting to parse XML
[xmltodict] Finished parsing XML in 47.105290600000046 seconds
$ docker container run --rm -v "${PWD}":/app pythonxml python3 parsexml.py etree_stdlib
[etree stdlib] Starting to parse XML
[etree stdlib] Finished parsing XML in 12.256522099999984 seconds
$ docker container run --rm -v "${PWD}":/app pythonxml python3 parsexml.py etree_lxml
[etree lxml] Starting to parse XML
[etree lxml] Finished parsing XML in 3.200624800000014 seconds
With a ~250mb sample it’s not quite a 20x difference but it was 20x with a 400mb sample. Still even in this case it’s about a 15x improvement which is a huge win.
What’s interesting is both Python’s standard library and lxml have an etree
library and the lxml variant is pretty close to having the same API as the
standard library except it’s a bit more optimized.
If you look at the code in the parsexml.py
file for both they are the same.
The only difference is lxml expects your file to be sent in as bytes instead of
a string.
It’s also worth pointing out you can parse files directly with etree instead of
first opening a file and passing in its value to etree.fromstring
. For
that, look in the docs
for etree.parse
or even etree.iterparse
if you want to read the file in
chunks instead of all at once.
Using iterparse
could be handy for dealing with massive files that don’t fit
in memory or even reading it in from a stream using the requests
library if it’s the result
of an API call.
How Do You Iterate over the XML with All 3 Strategies?
This is starting to get a bit beyond the scope of this blog post but here’s the basics.
With xmltodict:
for book in xml_xmltodict['Catalog']['Book']:
print(book)
Produces this output:
{'Title': 'KLLNIKKMUGGCWWZB', 'Author': {'@alive': 'no', '#text': 'trdjaxkj hemsfuuovtrw'}}
{'Title': 'COQDCPGIAUWQIKGG', 'Author': {'@alive': 'yes', '#text': 'vkkraogc bpxqworeqbbk'}}
Since it’s a dictionary you can do whatever you can do with a Python dictionary. Attributes are nested dictionaries and all of this is included in xmltodict’s docs.
With etree (both standard library and lxml):
for book in xml_etree_lxml:
for property in list(book):
print(property.tag)
print(property.text)
print(property.attrib)
print()
Produces this output:
Title
KLLNIKKMUGGCWWZB
{}
Author
trdjaxkj hemsfuuovtrw
{'alive': 'no'}
Title
COQDCPGIAUWQIKGG
{}
Author
vkkraogc bpxqworeqbbk
{'alive': 'yes'}
Here we can reach into the properties of the book and get anything we want. There is comprehensive documentation available in Python’s docs as well as lxml’s documentation.
So that’s all there is to it. I chose to use lxml and so far it’s working out great.
What are your favorite tips for parsing XML in Python? Let me know below.