Using the Cut Command to Parse a CSV File by Delimiters and Characters
Cut can make quick work out of extracting useful data from CSV files or output that has a pattern of characters or bytes.
If you prefer video, I recorded a zero ad video of this on YouTube.
I use the cut command a lot. Recently I was parsing a bunch of Stripe payments from a CSV file and wanted to extract out a few pieces of information which included returning back the last few characters of a specific column.
Here’s a simplified version of the CSV file:
ch_abc123,ORD018190: Build a SAAS App with Flask - Nick Janetakis,Payment complete.
ch_def456,ORD018189: Dive Into Docker - Nick Janetakis,Payment complete.
ch_ghi789,ORD018188: Build a SAAS App with Flask - Nick Janetakis,Payment complete.
ch_jkl123,ORD018184: Build a SAAS App with Flask - Nick Janetakis,Payment complete.
ch_mno456,ORD018180: Build a SAAS App with Flask - Nick Janetakis,Payment complete.
What I wanted to do was get the order numbers of each item without a prefix or
zero padding. For example instead of ORD018190
I wanted to get 18190
.
The cut
command lets you filter out specific parts of a file by a delimiter
or character. Technically it can handle bytes too but I haven’t used that
feature.
Here’s how I used cut to arrive at an answer to my CSV parsing problem, although
it’s worth mentioning cut
isn’t specific to only working on CSV files (we’ll
see that soon).
Identifying the Main Pattern
We’re dealing with a CSV file so one pattern is that data is split up by a comma, our order number is also the 2nd column so we can use cut to only return this information back with:
$ cut -d "," -f 2 demo.csv
ORD018190: Build a SAAS App with Flask - Nick Janetakis
ORD018189: Dive Into Docker - Nick Janetakis
ORD018188: Build a SAAS App with Flask - Nick Janetakis
ORD018184: Build a SAAS App with Flask - Nick Janetakis
ORD018180: Build a SAAS App with Flask - Nick Janetakis
The -d ","
flag is our delimiter and the -f 2
flag is which column we want
to get back.
Slicing Up Our Results Even Further
Since I was looking to get the order number we can look at this output and
discover another pattern which is there’s a colon (:
) after the order number.
Let’s build up our command using Unix pipes:
$ cut -d "," -f 2 demo.csv | cut -d ":" -f 1
ORD018190
ORD018189
ORD018188
ORD018184
ORD018180
This is technically an example of not parsing a CSV file because the 2nd cut command is just operating on the output of the previous command.
We’re feeding in a list of output into the 2nd cut command that looks like this:
ORD018190: Build a SAAS App with Flask - Nick Janetakis
And are then splitting it on a colon and grabbing the first result, aka
ORD018190
.
Using cut to Remove N Number of Characters
The last piece of the puzzle here is to only return back the order number
itself without the padded ORD0
. If we count those characters it’s a total of
4, so we could say that we want to get the 5th character, but not just the 5th
character, we want the 5th character and everything after that.
$ cut -d "," -f 2 demo.csv | cut -d ":" -f 1 | cut -c 5-
18190
18189
18188
18184
18180
Besides splitting on a delimiter you can tell cut to select only specific
characters, that’s what we’re doing with the -c 5-
flag. The -
at the end
is very important, that’s letting cut know we want all matches after the cut.
-
also works with the -f
flag too. Here’s a few examples:
# Return the 2nd column and everything after it.
$ cut -d "," -f 2- demo.csv
# Return the 2nd column and everything before it.
$ cut -d "," -f -2 demo.csv
# Return the 1st and 2nd column (this X-Y pattern works with -c too).
$cut -d "," -f 1-2 demo.csv
When you combine all of this together you have a very powerful tool to help you parse and make sense out of semi-structured data.
A Couple of cut Extras
That’s not the only thing cut can do. There’s a --complement
flag that you
can use to return back the opposite of what was cut.
For example, this will return back columns 1 and 3. It’s our same cut command
as before except with the --complement
flag.
$ cut -d "," -f 2 --complement demo.csv
ch_abc123,payment complete.
ch_def456,payment complete.
ch_ghi789,payment complete.
ch_jkl123,payment complete.
ch_mno456,payment complete.
Alternatively you can return back multiple columns by separating them with a
comma in the -f 1,3
flag. This produces the same result as above:
# Notice the lack of --complement flag here.
$ cut -d "," -f 1,3 demo.csv
ch_abc123,payment complete.
ch_def456,payment complete.
ch_ghi789,payment complete.
ch_jkl123,payment complete.
ch_mno456,payment complete.
As for which one to use, that’s up to you. It depends.
Stripe can have CSV files with 56 columns. If you wanted to create 2 separate
files where file A has only column 4 and file B has everything but column 4
then using --complement
is nice so you can run the same cut -d "," -f 4 stripe.csv > only_column_4
command with and without the --complement
flag
and a different output file name.
You can also change the output delimiter which could be handy if another program expects a certain format:
$ cut -d "," -f 1,3 --output-delimiter "|" demo.csv
ch_abc123|Payment complete.
ch_def456|Payment complete.
ch_ghi789|Payment complete.
ch_jkl123|Payment complete.
ch_mno456|Payment complete.
You can use multiple chars for the output delimiter too, such as:
--output-delimiter "WOW"
Now to wrap things up, if you wanted to sort the original list of order IDs you could do:
$ cut -d "," -f 2 demo.csv | cut -d ":" -f 1 | cut -c 5- | sort
18180
18184
18188
18189
18190
I love the command line.
Grep with a Regex Could Have Worked Here Too
There’s lots of way to solve problems on the command line. For this specific tiny CSV file the regular expression below works too:
$ grep -oP "1\d{4}" demo.csv
18190
18189
18188
18184
18180
But what if an order doesn’t start with a 1? Then you need to expand that with
something like (1|2)
and in the real CSV file there’s 56 columns of data with
multiple IDs that could have numbers so you run the risk of having false
positives.
Then you need to factor in the human time to create your regex. I spent longer thinking about that regex than the total time of just chaining together a few easy cut commands.
Also, it was for a 1 off script and while I didn’t time
how long it took, it
felt like the triple cut
command solution finished within 250ms so
performance wasn’t an issue.
With that said I do think using grep and regular expressions are great and I use them all time, often in combination with cut. I ended up using grep a lot to help filter out and group results in my Stripe CSV parsing session.
# Demo Video
Timestamps
- 0:22 – I used cut a lot recently to parse a few CSV files
- 1:43 – Delimiting by a comma and getting the 2nd column
- 2:18 – Parsing out the order id by running cut a second time
- 4:01 – Using cut a third time to remove the first 4 characters of output
- 5:10 – Using a trailing hyphen to get all matches after a specific cut
- 6:07 – Plucking out a few specific columns at once
- 6:47 – Using the hyphen again for returning multiple columns
- 7:07 – The complement flag will get everything but the matched cuts
- 8:48 – Investigating cut’s man pages
- 9:28 – Using the complement flag with the count flag
- 10:23 – What have you cut recently?
What have you parsed recently with the cut command? Let me know below!