Learn Docker With My Newest Course

Dive into Docker takes you from "What is Docker?" to confidently applying Docker to your own projects. It's packed with best practices and examples. Start Learning Docker →

Using the Cut Command to Parse a CSV File by Delimiters and Characters

using-the-cut-command-to-parse-a-csv-file-by-delimiters-and-characters.jpg

Cut can make quick work out of extracting useful data from CSV files or output that has a pattern of characters or bytes.

Quick Jump:

If you prefer video, I recorded a zero ad video of this on YouTube.

I use the cut command a lot. Recently I was parsing a bunch of Stripe payments from a CSV file and wanted to extract out a few pieces of information which included returning back the last few characters of a specific column.

Here’s a simplified version of the CSV file:

ch_abc123,ORD018190: Build a SAAS App with Flask - Nick Janetakis,Payment complete.
ch_def456,ORD018189: Dive Into Docker - Nick Janetakis,Payment complete.
ch_ghi789,ORD018188: Build a SAAS App with Flask - Nick Janetakis,Payment complete.
ch_jkl123,ORD018184: Build a SAAS App with Flask - Nick Janetakis,Payment complete.
ch_mno456,ORD018180: Build a SAAS App with Flask - Nick Janetakis,Payment complete.

What I wanted to do was get the order numbers of each item without a prefix or zero padding. For example instead of ORD018190 I wanted to get 18190.

The cut command lets you filter out specific parts of a file by a delimiter or character. Technically it can handle bytes too but I haven’t used that feature.

Here’s how I used cut to arrive at an answer to my CSV parsing problem, although it’s worth mentioning cut isn’t specific to only working on CSV files (we’ll see that soon).

Identifying the Main Pattern

We’re dealing with a CSV file so one pattern is that data is split up by a comma, our order number is also the 2nd column so we can use cut to only return this information back with:

$ cut -d "," -f 2 demo.csv

ORD018190: Build a SAAS App with Flask - Nick Janetakis
ORD018189: Dive Into Docker - Nick Janetakis
ORD018188: Build a SAAS App with Flask - Nick Janetakis
ORD018184: Build a SAAS App with Flask - Nick Janetakis
ORD018180: Build a SAAS App with Flask - Nick Janetakis

The -d "," flag is our delimiter and the -f 2 flag is which column we want to get back.

Slicing Up Our Results Even Further

Since I was looking to get the order number we can look at this output and discover another pattern which is there’s a colon (:) after the order number.

Let’s build up our command using Unix pipes:

$ cut -d "," -f 2 demo.csv | cut -d ":" -f 1

ORD018190
ORD018189
ORD018188
ORD018184
ORD018180

This is technically an example of not parsing a CSV file because the 2nd cut command is just operating on the output of the previous command.

We’re feeding in a list of output into the 2nd cut command that looks like this:

ORD018190: Build a SAAS App with Flask - Nick Janetakis

And are then splitting it on a colon and grabbing the first result, aka ORD018190.

Using cut to Remove N Number of Characters

The last piece of the puzzle here is to only return back the order number itself without the padded ORD0. If we count those characters it’s a total of 4, so we could say that we want to get the 5th character, but not just the 5th character, we want the 5th character and everything after that.

$ cut -d "," -f 2 demo.csv | cut -d ":" -f 1 | cut -c 5-

18190
18189
18188
18184
18180

Besides splitting on a delimiter you can tell cut to select only specific characters, that’s what we’re doing with the -c 5- flag. The - at the end is very important, that’s letting cut know we want all matches after the cut.

- also works with the -f flag too. Here’s a few examples:

# Return the 2nd column and everything after it.
$ cut -d "," -f 2- demo.csv

# Return the 2nd column and everything before it.
$ cut -d "," -f -2 demo.csv

# Return the 1st and 2nd column (this X-Y pattern works with -c too).
$cut -d "," -f 1-2 demo.csv

When you combine all of this together you have a very powerful tool to help you parse and make sense out of semi-structured data.

A Couple of cut Extras

That’s not the only thing cut can do. There’s a --complement flag that you can use to return back the opposite of what was cut.

For example, this will return back columns 1 and 3. It’s our same cut command as before except with the --complement flag.

$ cut -d "," -f 2 --complement demo.csv

ch_abc123,payment complete.
ch_def456,payment complete.
ch_ghi789,payment complete.
ch_jkl123,payment complete.
ch_mno456,payment complete.

Alternatively you can return back multiple columns by separating them with a comma in the -f 1,3 flag. This produces the same result as above:

# Notice the lack of --complement flag here.
$ cut -d "," -f 1,3 demo.csv

ch_abc123,payment complete.
ch_def456,payment complete.
ch_ghi789,payment complete.
ch_jkl123,payment complete.
ch_mno456,payment complete.

As for which one to use, that’s up to you. It depends.

Stripe can have CSV files with 56 columns. If you wanted to create 2 separate files where file A has only column 4 and file B has everything but column 4 then using --complement is nice so you can run the same cut -d "," -f 4 stripe.csv > only_column_4 command with and without the --complement flag and a different output file name.

You can also change the output delimiter which could be handy if another program expects a certain format:

$ cut -d "," -f 1,3 --output-delimiter "|" demo.csv

ch_abc123|Payment complete.
ch_def456|Payment complete.
ch_ghi789|Payment complete.
ch_jkl123|Payment complete.
ch_mno456|Payment complete.

You can use multiple chars for the output delimiter too, such as: --output-delimiter "WOW"

Now to wrap things up, if you wanted to sort the original list of order IDs you could do:

$ cut -d "," -f 2 demo.csv | cut -d ":" -f 1 | cut -c 5- | sort

18180
18184
18188
18189
18190

I love the command line.

Grep with a Regex Could Have Worked Here Too

There’s lots of way to solve problems on the command line. For this specific tiny CSV file the regular expression below works too:

$ grep -oP "1\d{4}" demo.csv

18190
18189
18188
18184
18180

But what if an order doesn’t start with a 1? Then you need to expand that with something like (1|2) and in the real CSV file there’s 56 columns of data with multiple IDs that could have numbers so you run the risk of having false positives.

Then you need to factor in the human time to create your regex. I spent longer thinking about that regex than the total time of just chaining together a few easy cut commands.

Also, it was for a 1 off script and while I didn’t time how long it took, it felt like the triple cut command solution finished within 250ms so performance wasn’t an issue.

With that said I do think using grep and regular expressions are great and I use them all time, often in combination with cut. I ended up using grep a lot to help filter out and group results in my Stripe CSV parsing session.

# Demo Video

Timestamps

  • 0:22 – I used cut a lot recently to parse a few CSV files
  • 1:43 – Delimiting by a comma and getting the 2nd column
  • 2:18 – Parsing out the order id by running cut a second time
  • 4:01 – Using cut a third time to remove the first 4 characters of output
  • 5:10 – Using a trailing hyphen to get all matches after a specific cut
  • 6:07 – Plucking out a few specific columns at once
  • 6:47 – Using the hyphen again for returning multiple columns
  • 7:07 – The complement flag will get everything but the matched cuts
  • 8:48 – Investigating cut’s man pages
  • 9:28 – Using the complement flag with the count flag
  • 10:23 – What have you cut recently?

What have you parsed recently with the cut command? Let me know below!

Never Miss a Tip, Trick or Tutorial

Like you, I'm super protective of my inbox, so don't worry about getting spammed. You can expect a few emails per year (at most), and you can 1-click unsubscribe at any time. See what else you'll get too.



Comments