Learn Docker With My Newest Course

Dive into Docker takes you from "What is Docker?" to confidently applying Docker to your own projects. It's packed with best practices and examples. Start Learning Docker →

Randomly Order Lines on the Command Line with sort or shuf

blog/cards/randomly-order-lines-on-the-command-line-with-sort-or-shuf.jpg

Once in a while you may want to randomly shuffle lines or select N random lines from a file.

Quick Jump: Order Lines Randomly | Selecting 1 or More Random Lines | Demo Video

We have a couple of ways to do this with sort and shuf.

Both options aren’t going to work with macOS out of the box. In both cases you can brew install coreutils and run gsort and gshuf if you want to follow along. That’s because one of the sort flags is only available with the GNU version of sort. Also shuf isn’t included with macOS by default.

An easy way to play around with both is to use the seq command which returns back a sequence of numbers. Once we do that then we can randomize their order along with pick and choose 1 or more lines.

$ seq 5
1
2
3
4
5

Order Lines Randomly

Both options below run in linear time which means it’s probably going to be fast enough unless you have millions of lines that are being shuffled frequently. We’ll cover benchmarks in a bit.

sort

#              -uR works too if you prefer the short flags
$ seq 5 | sort --unique --random-sort
3
1
5
2
4

If your lines existed in a file you can run: sort --unique --random-sort myfile

Uniqueness use cases?

Technically --unique is optional, it really comes down to your use case.

There are cases where you may want duplicates, for example if you’re running a contest where participants can enter more than once, their name could exist on multiple lines (1 for each entry).

Or, maybe you have a bunch of names in a list and you want to pair everyone up in groups of 2. Making that list unique wouldn’t hurt since a human can only be paired up with another human once.

shuf

$ seq 5 | shuf
3
4
5
1
2

If your lines existed in a file you can run: shuf myfile

You can even use shuf itself to generate the sequence, such as:

#      -i 1-5 works too if you prefer the short flag
$ shuf --input-range 1-5
4
1
2
3
5

If you want unique lines you can do uniq myfile | shuf. That assumes you want uniqueness to happen before it’s shuffled so duplicates are removed ahead of time. It’s up to you. You can choose to run shuf myfile | uniq instead.

Benchmarks

This isn’t super scientific. I ran each set of commands once on my 10 year old i5 3.2ghz CPU with 16 GB of memory and first gen SSD workstation that I assembled from parts.

The numbers range from 100 up to 10,000,000.

Letting shuf generate its own sequence is substantially faster of the 3 options.

$ time (seq 100 | sort -R -o tmpfile)
( seq 100 | sort -R -o tmpfile; )  0.00s user 0.00s system 79% cpu 0.002 total

$ time (seq 1000 | sort -R -o tmpfile)
( seq 1000 | sort -R -o tmpfile; )  0.01s user 0.00s system 96% cpu 0.005 total

$ time (seq 10000 | sort -R -o tmpfile)
( seq 10000 | sort -R -o tmpfile; )  0.04s user 0.01s system 100% cpu 0.045 total

$ time (seq 100000 | sort -R -o tmpfile)
( seq 100000 | sort -R -o tmpfile; )  0.54s user 0.00s system 100% cpu 0.537 total

$ time (seq 1000000 | sort -R -o tmpfile)
( seq 1000000 | sort -R -o tmpfile; )  6.65s user 0.01s system 100% cpu 6.652 total

$ time (seq 10000000 | sort -R -o tmpfile)
( seq 10000000 | sort -R -o tmpfile; )  81.07s user 0.33s system 100% cpu 1:21.34 total
$ time (seq 100 | shuf -o tmpfile)
( seq 100 | shuf -o tmpfile; )  0.00s user 0.00s system 39% cpu 0.002 total

$ time (seq 1000 | shuf -o tmpfile)
( seq 1000 | shuf -o tmpfile; )  0.00s user 0.00s system 39% cpu 0.002 total

$ time (seq 10000 | shuf -o tmpfile)
( seq 10000 | shuf -o tmpfile; )  0.00s user 0.00s system 128% cpu 0.003 total

$ time (seq 100000 | shuf -o tmpfile)
( seq 100000 | shuf -o tmpfile; )  0.01s user 0.00s system 111% cpu 0.012 total

$ time (seq 1000000 | shuf -o tmpfile)
( seq 1000000 | shuf -o tmpfile; )  0.20s user 0.02s system 102% cpu 0.209 total

$ time (seq 10000000 | shuf -o tmpfile)
( seq 10000000 | shuf -o tmpfile; )  2.81s user 0.37s system 101% cpu 3.110 total

$ time (seq 100000000 | shuf -o tmpfile)
( seq 100000000 | shuf -o tmpfile; )  40.05s user 2.92s system 100% cpu 42.800 total
$ time shuf -i 1-100 -o tmpfile
shuf -i 1-100 -o tmpfile  0.00s user 0.00s system 87% cpu 0.001 total

$ time shuf -i 1-1000 -o tmpfile
shuf -i 1-1000 -o tmpfile  0.00s user 0.00s system 88% cpu 0.001 total

$ time shuf -i 1-10000 -o tmpfile
shuf -i 1-10000 -o tmpfile  0.00s user 0.00s system 93% cpu 0.002 total

$ time shuf -i 1-100000 -o tmpfile
shuf -i 1-100000 -o tmpfile  0.01s user 0.00s system 98% cpu 0.013 total

$ time shuf -i 1-1000000 -o tmpfile
shuf -i 1-1000000 -o tmpfile  0.12s user 0.01s system 99% cpu 0.129 total

$ time shuf -i 1-10000000 -o tmpfile
shuf -i 1-10000000 -o tmpfile  1.67s user 0.04s system 99% cpu 1.710 total

$ time shuf -i 1-100000000 -o tmpfile
shuf -i 1-100000000 -o tmpfile  18.20s user 0.77s system 97% cpu 19.517 total

If you’re wondering how much of that time is spent generating the sequence vs doing the sorting. It takes a little over 2 seconds to generate a 10 million number sequence.

$ time seq 100000000 > lines
seq 100000000 > lines  0.86s user 0.45s system 58% cpu 2.241 total

For example, here’s how long it took shuf to sort the lines from a file:

$ time shuf lines -o lines_shuffled
shuf lines -o lines_shuffled  35.62s user 1.40s system 99% cpu 37.030 total

Selecting 1 or More Random Lines

Now let’s say you only want to grab a few random lines from many lines of output. Both examples below grab 3 lines but you can change that to be 1 or however many you want.

sort

$ seq 5 | sort --unique --random-sort | head -n 3
5
2
3

If your lines existed in a file you can run: sort -uR myfile | head -n 3

shuf

#                        -n 3 works too if you prefer the short flag
$ shuf --input-range 1-5 --head-count 3
3
4
1

If your lines existed in a file you can run: shuf -n 3 myfile

If -n happens to be larger than your sequence then shuf will automatically round it down to your range’s upper bound value. For example -i 1-5 -n 10 will only ever go up to 5.

Which One Is Faster?

Here’s the results of picking 3 random lines from a 10 million line file:

$ time shuf -n 3 lines
60518680
61202124
46322091
shuf -n 3 lines  4.59s user 0.16s system 99% cpu 4.756 total

sort -R lines | head -n 3 took so long that I gave up waiting. I mean I was sitting here looking at it and then I went to get some water and I came back and it was still running. That was maybe ~1.5 minutes total. I’m impatient!

Needless to say the shuf solution is a lot faster – at least on my machine.

If you’re optimizing for raw performance, have very large files and need to shuffle them regularly I’m sure there’s faster solutions out there.

The video below goes over running most of these commands.

Demo Video

Timestamps

  • 0:35 – Extra steps if you’re on macOS
  • 1:06 – Generating ordered lines of output
  • 1:33 – Using sort
  • 2:31 – Reading lines from a file
  • 3:25 – Using shuf
  • 4:07 – Going over the benchmarks
  • 7:03 – Selecting random lines from a file with sort and shuf

When was the last time you wanted to do this? Let me know below.

Never Miss a Tip, Trick or Tutorial

Like you, I'm super protective of my inbox, so don't worry about getting spammed. You can expect a few emails per month (at most), and you can 1-click unsubscribe at any time. See what else you'll get too.



Comments