Processing Lines of Output in a Loop with Your Shell Can Be Slow
I ended up rewriting a tool from Bash to Python and one reason for that was processing lines of output in a loop was slow.
Using tools like cat
, grep
, cut
, awk
and friends are super fast. You
provide them hundreds of thousands of lines of text and they will blaze through
processing it.
That makes piping output between processes very efficient. You can manipulate your output as you go to solve a whole host of text processing problems without any custom programming or scripting.
But sometimes you need to do a bunch of processing where piping stuff from A to B isn’t enough. What you really need is to introduce custom scripting into the mix.
I ran into this scenario when building Plutus which is a CLI tool for income and expense tracking. I open sourced it after rewriting it from Bash to Python.
For example maybe you’re using grep
to get a filtered list of results and now
you want to break apart those lines, parse some stuff, calculate results or do
whatever you need and then provide a combination of new output and variable
state changes that will be processed later.
You can do that in a few ways:
grep_matches="$(grep hello example.csv)"
while read -r matched_line; do
# TODO: Implement your custom logic here.
echo "${line}"
done <<< "${grep_matches}"
Or, depending on what you’re doing, maybe this is more convenient:
while read -r matched_line; do
echo "${line}"
done < (grep hello example.csv)
Using echo
isn’t too slow but if you start running other processes in that
loop it can be wildly inefficient. Once you get into the thick of it, you might
find yourself spending 3-5 seconds in a loop on 1,000 items that can be done in
30ms in other scripting languages like Python.
Here’s a 1 liner to reproduce the issue with 1,000 lines and calling 1 process, it takes my machine 900ms to complete:
bash -c 'time for i in {1..1000}; do wc -c <<< "${i}"; done'
If you remove the process call and do :
(a no-op) it’s almost instant:
bash -c 'time for i in {1..1000}; do :; done'
Even with 100,000 instead of 1,000 it finishes in 150ms on my machine. Using
echo
and printf
are both pretty fast too because those are shell built-ins,
you can run type echo
to see that.
This example takes 3.2 seconds to run for me over 3,328 matched lines, of course you can’t reproduce this one but it was a 10,000 line CSV file that was using grep to filter it:
while read -r matched_line; do
# Count the characters for each line.
wc -c <<< "${matched_line}"
done <<< "${grep_matches}"
If all you wanted to do is count the characters for each matched line
individually you can use grep hello example.csv | awk "{ print length() }"
which finishes in 100ms, about 30x faster! But it’s not always that simple,
maybe you want to do many other things so piping directly into another program
isn’t going to cut it.
You might be thinking, oh maybe wc
is slow but it’s not wc
. You can replace
that line with cut -d "," -f 1
to get the first column and it’s as slow as
wc
(~3 seconds to process it).
It gets linearly slower depending on how many processes you call. For example this takes 6.2 seconds, roughly ~3 seconds per process called:
while read -r matched_line; do
wc -c <<< "${matched_line}"
cut -d "," -f 1 <<< "${matched_line}"
done <<< "${grep_matches}"
Doing the same double process call on ~37k matched lines (10x the amount above) takes 1 minute and 12 seconds which is really painful.
Google and other sources say you can try putting your lines into an array such
as readarray -t matched_lines
and then looping over it with for matched_line in "${matched_lines[@]}"
but in my case it made no difference, it was equally
as slow because I don’t think it has to do with the loop itself, it’s running a
process in that loop.
# Bash Is Amazing Except When It Isn’t
I’ve been writing shell scripts for almost 10 years and for tons of text processing problems you can get really fast scripts when piping between tools but if you find yourself needing to do a lot of custom processing on each line then you really have 4 options:
- Become a literal god with
awk
andperl
- Accept the slow speed because maybe that’s ok
- Offload components of your script to another language or tool
- Use something else like Python or whatever your preference is
I never got super into awk
and perl
but I know they are very powerful. I
remember spending like 30 minutes using awk
to convert human time formats
of 1h 30m
or 90m
into decimal hours such as 1.5
.
Here’s the code for that which I cobbled together from about half a dozen sources. It’s part of an invoice script I wrote a long time ago:
awk -F '[h: ]' '{ if ($1 ~ /m$/) printf("%.2f\n", ($1 / 60));
else printf("%.2f\n", $1 + ($2 / 60)) }' <<< "${hours}${minutes}"
By the time I was done with it, it was robust and works amazingly well but I also feel like I lost 5 years of my life in the process of putting it together. I also don’t want to have to write 30 lines of comments for each 1 line of awk.
As for accepting the slowness, I don’t know. I’m a big fan of The Pragmatic Programmer book and it has this concept of broken windows. Plutus would be run at least once a month and I don’t want to be reminded of inefficiencies every time I run it.
Waiting 3-5 seconds for it to generate a report would kill me inside, especially when I put together a proof of concept in Python and was seeing the same thing get produced in 30ms.
I also knew I wanted to open source it and I know how important I value performance within reason. I’ve thrown away tools and never used them again because they opened too slow.
It didn’t feel worth trying to write parts of the script in Bash and others in Python so I made the call to rewrite all of it. I learned a lot along the way so it was very much worth it.
That doesn’t mean Bash is horrible or useless. I use it all the time and I will continue to use it but you sometimes need to realize it’s ok to use other tools.
Realistically using Python was a better choice for this tool anyways.
There is a lot of command line argument parsing, sub-commands, robust CSV file linting, data structure requirements, decimal calculations and other stuff.
The demo video below demonstrates how quickly a loop in Bash can get slow. It shows that 10k and 100k line CSV file being filtered by grep.
# Demo Video
Timestamps
- 0:55 – CLI tools can be fast by themselves
- 1:37 – A no-op loop is really fast
- 2:18 – Using echo is pretty quick
- 2:40 – wc is slow
- 3:26 – So is cut
- 4:16 – An example you can reproduce
- 5:29 – Echo and printf are shell built-ins
- 6:54 – Using a Bash array didn’t make it faster
- 7:31 – Bashing is amazing except when it isn’t
When have you encountered slow loops when shell scripting? Let me know below.