Learn Docker With My Newest Course

Dive into Docker takes you from "What is Docker?" to confidently applying Docker to your own projects. It's packed with best practices and examples. Start Learning Docker →

Processing Lines of Output in a Loop with Your Shell Can Be Slow

processing-lines-of-output-in-a-loop-with-your-shell-can-be-slow.jpg

I ended up rewriting a tool from Bash to Python and one reason for that was processing lines of output in a loop was slow.

Quick Jump:

Using tools like cat, grep, cut, awk and friends are super fast. You provide them hundreds of thousands of lines of text and they will blaze through processing it.

That makes piping output between processes very efficient. You can manipulate your output as you go to solve a whole host of text processing problems without any custom programming or scripting.

But sometimes you need to do a bunch of processing where piping stuff from A to B isn’t enough. What you really need is to introduce custom scripting into the mix.

I ran into this scenario when building Plutus which is a CLI tool for income and expense tracking. I open sourced it after rewriting it from Bash to Python.

For example maybe you’re using grep to get a filtered list of results and now you want to break apart those lines, parse some stuff, calculate results or do whatever you need and then provide a combination of new output and variable state changes that will be processed later.

You can do that in a few ways:

grep_matches="$(grep hello example.csv)"

while read -r matched_line; do
  # TODO: Implement your custom logic here.
  echo "${line}"
done <<< "${grep_matches}"

Or, depending on what you’re doing, maybe this is more convenient:

while read -r matched_line; do
  echo "${line}"
done < (grep hello example.csv)

Using echo isn’t too slow but if you start running other processes in that loop it can be wildly inefficient. Once you get into the thick of it, you might find yourself spending 3-5 seconds in a loop on 1,000 items that can be done in 30ms in other scripting languages like Python.

Here’s a 1 liner to reproduce the issue with 1,000 lines and calling 1 process, it takes my machine 900ms to complete:

bash -c 'time for i in {1..1000}; do wc -c <<< "${i}"; done'

If you remove the process call and do : (a no-op) it’s almost instant:

bash -c 'time for i in {1..1000}; do :; done'

Even with 100,000 instead of 1,000 it finishes in 150ms on my machine. Using echo and printf are both pretty fast too because those are shell built-ins, you can run type echo to see that.

This example takes 3.2 seconds to run for me over 3,328 matched lines, of course you can’t reproduce this one but it was a 10,000 line CSV file that was using grep to filter it:

while read -r matched_line; do
  # Count the characters for each line.
  wc -c <<< "${matched_line}"
done <<< "${grep_matches}"

If all you wanted to do is count the characters for each matched line individually you can use grep hello example.csv | awk "{ print length() }" which finishes in 100ms, about 30x faster! But it’s not always that simple, maybe you want to do many other things so piping directly into another program isn’t going to cut it.

You might be thinking, oh maybe wc is slow but it’s not wc. You can replace that line with cut -d "," -f 1 to get the first column and it’s as slow as wc (~3 seconds to process it).

It gets linearly slower depending on how many processes you call. For example this takes 6.2 seconds, roughly ~3 seconds per process called:

while read -r matched_line; do
  wc -c <<< "${matched_line}"
  cut -d "," -f 1 <<< "${matched_line}"
done <<< "${grep_matches}"

Doing the same double process call on ~37k matched lines (10x the amount above) takes 1 minute and 12 seconds which is really painful.

Google and other sources say you can try putting your lines into an array such as readarray -t matched_lines and then looping over it with for matched_line in "${matched_lines[@]}" but in my case it made no difference, it was equally as slow because I don’t think it has to do with the loop itself, it’s running a process in that loop.

# Bash Is Amazing Except When It Isn’t

I’ve been writing shell scripts for almost 10 years and for tons of text processing problems you can get really fast scripts when piping between tools but if you find yourself needing to do a lot of custom processing on each line then you really have 4 options:

  • Become a literal god with awk and perl
  • Accept the slow speed because maybe that’s ok
  • Offload components of your script to another language or tool
  • Use something else like Python or whatever your preference is

I never got super into awk and perl but I know they are very powerful. I remember spending like 30 minutes using awk to convert human time formats of 1h 30m or 90m into decimal hours such as 1.5.

Here’s the code for that which I cobbled together from about half a dozen sources. It’s part of an invoice script I wrote a long time ago:

    awk -F '[h: ]' '{ if ($1 ~ /m$/) printf("%.2f\n", ($1 / 60));
            else printf("%.2f\n", $1 + ($2 / 60)) }' <<< "${hours}${minutes}"

By the time I was done with it, it was robust and works amazingly well but I also feel like I lost 5 years of my life in the process of putting it together. I also don’t want to have to write 30 lines of comments for each 1 line of awk.

As for accepting the slowness, I don’t know. I’m a big fan of The Pragmatic Programmer book and it has this concept of broken windows. Plutus would be run at least once a month and I don’t want to be reminded of inefficiencies every time I run it.

Waiting 3-5 seconds for it to generate a report would kill me inside, especially when I put together a proof of concept in Python and was seeing the same thing get produced in 30ms.

I also knew I wanted to open source it and I know how important I value performance within reason. I’ve thrown away tools and never used them again because they opened too slow.

It didn’t feel worth trying to write parts of the script in Bash and others in Python so I made the call to rewrite all of it. I learned a lot along the way so it was very much worth it.

That doesn’t mean Bash is horrible or useless. I use it all the time and I will continue to use it but you sometimes need to realize it’s ok to use other tools.

Realistically using Python was a better choice for this tool anyways.

There is a lot of command line argument parsing, sub-commands, robust CSV file linting, data structure requirements, decimal calculations and other stuff.

The demo video below demonstrates how quickly a loop in Bash can get slow. It shows that 10k and 100k line CSV file being filtered by grep.

# Demo Video

Timestamps

  • 0:55 – CLI tools can be fast by themselves
  • 1:37 – A no-op loop is really fast
  • 2:18 – Using echo is pretty quick
  • 2:40 – wc is slow
  • 3:26 – So is cut
  • 4:16 – An example you can reproduce
  • 5:29 – Echo and printf are shell built-ins
  • 6:54 – Using a Bash array didn’t make it faster
  • 7:31 – Bashing is amazing except when it isn’t

When have you encountered slow loops when shell scripting? Let me know below.

Never Miss a Tip, Trick or Tutorial

Like you, I'm super protective of my inbox, so don't worry about getting spammed. You can expect a few emails per year (at most), and you can 1-click unsubscribe at any time. See what else you'll get too.



Comments