Learn Docker With My Newest Course

Dive into Docker takes you from "What is Docker?" to confidently applying Docker to your own projects. It's packed with best practices and examples. Start Learning Docker →

Using Unix Tools and Bash to Convert Blog Post Titles - Part 1

blog/cards/using-unix-tools-and-bash-to-convert-blog-post-titles-part-1.jpg

You can get a lot done using the command line when you combine Bash with a couple of Unix tools. Here's a real world example.

Quick Jump: Thinking about the Problem before Doing Anything | Getting a List of Parsed Blog Post Titles

This is a 2 part series. You are reading part 1. Looking for part 2?

I’ve been blogging now for a couple of years and over that period of time I’ve written and published over 200 posts.

I’m not sure if you’ve noticed but each blog post title is capitalized a certain way.

The title case strategy I used really came down to what was available to me in whatever code editor I happened to be using. When I started this blog I was using Sublime Text and back then I used the Smart Title Case plugin and in VSCode I used VSCodeFirstUpper.

I’ve listed both of these in my Sublime Text plugins and VSCode extensions articles.

Then about 6 weeks ago I switched to Vim which didn’t have a title case plugin that I was happy with, so I wrote a wrapper script called tcc which calls out to the titlecaseconverter.com website to get a very high quality title case based on the Chicago Manual of Style. Then I simply mapped that tcc script to a hotkey in Vim, done!

Of the 3 solutions, titlecaseconverter.com produces the most accurate title case but now that means I have a back log of 200+ post titles that were created with the old strategies.

I didn’t want to individually go to each blog post and update each title one at a time. That would have been quite boring and error prone. Instead I wanted to spend up to 30 minutes cobbling together a solution with a few Unix tools and Bash.

If I had about 10 or maybe 20 posts I probably would have done it manually, but 200+ is pushing my limits for patience, plus I was confident I could come up with an automated solution that was equal to or faster than doing it manually.

Especially if you account for human nature where you might decide to do half of them, and then reward yourself with a bit of Youtube’ing or something else. Before you know it, the task ends up taking twice as long or more.

That and even if my estimate were off, I also title case my h3 and h4 headings which would push it to 750+ titles and there’s no way I’m going to do that manually.

While you might not have an identical task to solve, the tools I used can be used to solve all sorts of text related problems, so let’s go over what was done. It’s also a good exercise in breaking down problems.

Thinking about the Problem before Doing Anything

I gave myself a few minutes to think about the problem which I think is an important step because it helps you break down the problem into tiny steps. This way you can make progress instead of it feeling like an all or nothing task.

Individual Steps

  1. Get a list of blog posts (they are markdown files)
  2. Find the line that contains the blog post’s title
  3. Parse just the title out of that line
  4. Take the original title and convert it into the new title
  5. Replace the original title with the converted title in the file

That’s pretty much it.

Getting a List of Parsed Blog Post Titles

In this post we’re going to cover the first 3 steps of the problem.

1. Getting a List of Posts

I have all of my blog posts sitting inside of a _posts directory. Not only that but they are naturally sorted by date. This is the format that Jekyll expects, and Jekyll is running my whole site. This was easy enough to do with ls.

For example, here’s the first 3 blog posts that were output by ls:

2015-05-20-build-a-saas-app-with-flask-part-1.md
2015-05-23-build-a-saas-app-with-flask-part-2.md
2015-05-30-build-a-saas-app-with-flask-part-3.md 

2. Finding the Title in Each Blog Post

I have a bit of info about each post stored in YAML front-matter, which is specific to Jekyll.

---
layout: "post"
tags: ["flask"]

card: "blog/cards/build-a-saas-app-with-flask-part-1.jpg"
title: "Build a SAAS App with Flask: Part 1"
description:
  Learn about the Build a SAAS App with Flask project, this is part 1 of a 5
  part series.

toc:
  - "Looking for the Other Parts of This Series?"
  - "Baseline Features"
  - "Assets"
  - "User Module"
  - "Support Module"
  - "Billing Module"
  - "Pages Module"
  - "On the Horizon"
---

[insert blog post written in Markdown]

There’s other things I can set, but that’s the basics. The first decision to make was to figure out what makes a title unique. In other words, how can I without a doubt identify which line in this file is the blog post’s title?

At first glance, we could say any line that contains title: is a title.

The grep tool was made for this. It allows you to quickly search a file for some text or regex pattern and then it reports back whether or not it found anything.

So I popped in grep "title:" 2015-05-20-build-a-saas-app-with-flask-part-1.md and as expected, it returned back a match with: title: "Build a SAAS App with Flask: Part 1".

Combing Steps 1 and 2 with Grep

It just so happens grep can operate on multiple files at once. So what I really ended up running in the end was grep -R "title:" . which recursively (-R) looks through all files in the current directory (.) and returns back any lines that match title:.

Running that produced a lot of output so I knew I was on the right track but I wasn’t 100% sure how many matches were found. I wanted to double check how many matches it found vs how many blog posts I have on disk to see if they match.

Getting a taste of Unix pipes to check the results:

The mini task at hand here was to make sure that both ls and grep reported back the same amount of matches.

# List all of the posts and then count how many results there are.
ls | wc -l
218

# List all of the grep results and then count how many results there are.
grep -R "title:" . | wc -l
219

The | is the pipe symbol, and it lets us chain together multiple commands where we send the output of one program as input into another program.

In the above cases we’re sending the output of ls and grep into the wc tool. The wc tool lets you count words in a file and -l gives you a line count.

As you can see here, we’re off by one. There’s 1 extra match with the grep result because it’s safe to assume ls is correct (which I did verify with 1 and 2 files just to be sure).

Making grep more strict:

I scrolled up into my output of grep -R "title:" . to see what went wrong and I noticed this match ./2019-02-26-launching-wsl-programs-from-a-right-click-windows-menu.md:##### Adding finishing touches by customizing the terminal window's title:.

Now it was pretty obvious to see what went wrong. The search pattern wasn’t strict enough and it reported a false positive since I happened to have title: in a sub-heading that wasn’t a blog post title.

Adjusting it to grep -R "title: (extra space at the end) fixed the problem but now that I was aware of potential false positives I wanted to tighten up my grep pattern to be even more restrictive.

I don’t want to go too deep into regular expressions in this post but what I ended up with in the end was grep -RE '^title: ".*$' . | wc -l.

The TL;DR is it makes sure the line starts with title:, followed by 1 space then double quotes and has any amount of characters until the end of the line.

That produced an output of 218 which matched ls and I removed the wc -l to look at the matches manually. It looked good at a quick glance but just to be safe I wrote the results of both ls and that grep command to files and diffed them. They were exactly the same.

Confirming the matches with diff:
# Output only the files that matched and write it to a file.
grep -RE '^title: ".*$' . > /tmp/grep_results

# All of the file names looked like this in the /tmp/grep_results file.
./2015-05-20-build-a-saas-app-with-flask-part-1.md

# I used Vim's visual block mode to remove the ./ at the start of each file
# name and applied it to all of the lines.

# Output the files and write it to a file.
ls > /tmp/ls_results

# Diff both files. This produced no output (ie. no differences).
diff /tmp/grep_results /tmp/ls_results

As an aside, I’m really enjoying Vim. That ./ replacement process took literally 3 seconds to do which included the time for me to think about how I wanted to solve it.

It immediately jumped to me to use visual block mode, select the first 2 characters in the first line, hit G to jump to the bottom of the file and then d to delete the selection.

Parse the Title Out of the Grep Result

The goal now is to transform title: "Build a SAAS App with Flask: Part 1" into Build a SAAS App with Flask: Part 1. This way later on I will be able to take that title and pass it directly into the tcc script to convert it.

This could have been solved in a number of ways, but I decided to Google for a grep solution and ended up with grep -oPR --no-filename '^title: "\K[^"]+'i .:

  • -o only outputs the specific parts of the match instead of the whole line
  • -P instructs grep to use Perl style regular expression patterns
  • --no-filename removes the file name from the output
  • \K with our syntax tells the regex engine to match what’s in between the quotes
The above command gives us the exact blog post title as output like this:
Build a SAAS App with Flask: Part 1
Build a SAAS App with Flask: Part 2
Build a SAAS App with Flask: Part 3
Where as the original grep -RE '^title: ".*$' . command output this instead:
2015-05-20-build-a-saas-app-with-flask-part-1.md:title: "Build a SAAS App with Flask: Part 1"
2015-05-23-build-a-saas-app-with-flask-part-2.md:title: "Build a SAAS App with Flask: Part 2"
2015-05-30-build-a-saas-app-with-flask-part-3.md:title: "Build a SAAS App with Flask: Part 3"

So now we’re in good shape to continue on with steps 4 and 5 of our problem and for that I will see you in part 2.

What types of problems have you solved on the command line? Let me know below!

Never Miss a Tip, Trick or Tutorial

Like you, I'm super protective of my inbox, so don't worry about getting spammed. You can expect a few emails per month (at most), and you can 1-click unsubscribe at any time. See what else you'll get too.


Comments