Updated on April 2, 2019 in #linux

Using Unix Tools and Bash to Convert Blog Post Titles - Part 2

using-unix-tools-and-bash-to-convert-blog-post-titles-part-1.jpg

Combine Bash, grep, sed and Python to title case any number of words. In this example, it is being done on 200+ blog posts.

Quick Jump:

This is a 2 part series. You are reading part 2. Catch up by reading part 1.

Back in part 1, we covered the general problem and also tackled the first 3 steps to solve it which left us with being able to get a parsed list of blog post titles from any number of Markdown files that are formatted for Jekyll.

Now it’s time to pick up from where we left off and cover taking the original title, converting it to the proper title case and then replacing the original title with the converted title for all of the blog posts. We’re going to do that with a few lines of Bash.

# Talking about the Results First

Before getting into the code, I want to talk about the results first because it’s important to see if it’s even worth it.

I mean, if you continued to stick with existing VSCode or Sublime Text plugins to do this, it might be good enough for you, but…

I was pretty surprised at the outcome. Across 218 blog post titles, titlecaseconverter.com changed 28 of them for the better. That means 12.8% of my titles were incorrect before.

It fixed a bunch of cases where plugins for VSCode and Sublime Text incorrectly capitalized words like “with”, “without”, “from”, “how” and quite a few others.

3 examples where this solution fixed mistakes by other editor plugins:

# The top title is the original title and the bottom title is the converted title.

Build a SAAS App With Flask Free Sample Videos
Build a SAAS App with Flask Free Sample Videos

5 Steps to Take Before Moving Your Applications Into Docker
5 Steps to Take before Moving Your Applications into Docker

Understanding how the Docker Daemon and Docker CLI Work Together
Understanding How the Docker Daemon and Docker CLI Work Together

It’s also not as simple as adding hard coded rules to fix these words because the context in which you use them changes how they are capitalized. Sometimes they should be capitalized while other times they shouldn’t.

No title case converter is optimized for programming terms:

After running the script against all of my blog posts, I had to go and manually make changes to 15 of them, but this isn’t due to mistakes from titlecaseconverter.com.

It’s optimized for title casing standard titles, but a decent amount of my titles have words like nginx and tmux in them so it capitalized them as “Nginx” and “Tmux”.

The 15 fixes I had to make came down to transforming those words into lower case.

While creating the script I built in extra functionality to detect changes between the original and converted title. I wanted to minimize the manual work I had to do because I expected I would have a few programming term related titles to fix.

With the help of some print statements it only took a few minutes to identify and fix them.

# Creating a Game Plan to Develop the Script

No way I was going to test this live on my main directory of posts, even though I have daily backups of it. I knew I was going to have to make a bunch of tests and changes that couldn’t be reverted, so it was best off to do it in isolation.

By the way, they couldn’t be reverted because after running the script, the original title would be changed to the newly converted title which means the original title is long gone. In other words, there is no undo.

Creating a Backup Directory

The first thing I did was create a /tmp/tcc directory and this was going to be my working directory for creating and testing the script.

Then I created a couple of directories in there:

/tmp/tcc/posts_all is where I copied all of my posts to
/tmp/tcc/posts_few is where I copied 3 posts to use as an initial test

# Creating and Testing the Script

Looking at the individual steps from the previous article, the agenda was:

Take the original title and convert it into the new title
Replace the original title with the converted title in the file

I already decided I was going to use Bash because from prior experience I knew this was going to be a few lines of code and I had an idea of what the components would look like.

Breaking down the steps further to flesh out what this script will do:

Loop over a list of blog posts
Store the original title in a variable
Convert that title into a new title and store that in another variable
Replace the old title with the new title for the current file in the loop

Iterating on the Script

In most tutorials, they just give you the polished end result which is handy but I think showing the struggle of how you got there is the most important part.

So I’m going to break down how I developed this script from start to finish.

1. Getting something to work:

#!/usr/bin/env bash

echo "Hello world"

No joke. I do this with most projects. I’ll take any and all victories, no matter how small they are. I just like to see things work.

2. Looping over and printing the blog posts:

#!/usr/bin/env bash

for file in /tmp/tcc/posts_few/*.md; do
    [ -f "${file}" ] || break

    echo "${file}"
done

Pretty standard Bash here. That’s how you can loop over a directory of files. I wanted to limit it to *.md which are only Markdown files. Technically this would have worked on all files since only Markdown files are in that directory but I did that out of habit.

[ -f "${file}" ] || break is another guard to make sure that the item is an actual file. Technically a directory called foo.md/ could exist and I wanted to skip over that.

That trick was fresh in my memory since I just used it in some client work.

Then I echo’d the file to make sure it lined up with what ls told me. Yep, I got 3 blog posts back and I’m not going to bother outputting it here. Comparing it to ls was just a confidence booster and acts as a test.

3. Setting up the old and new title variables:

#!/usr/bin/env bash

for file in /tmp/tcc/posts_few/*.md; do
    [ -f "${file}" ] || break

    original_title="$(grep -oPR --no-filename '^title: "\K[^"]+' "${file}")"
    converted_title="${original_title}"

    echo "${original_title}"
    echo "${converted_title}"
    echo
done

All I did was drop in the grep pattern from the end of part 1 and saved it to a variable.

Also instead of acting on the current directory, I passed in the ${file} variable instead. If you’re wondering why there’s so many quotes it’s because you should always quote your variables with Bash.

As for the converted_title I wasn’t ready to start calling the Python tcc script to convert the titles for real because that makes an external network call, so I just mocked it out by setting the converted_title to be the original for now.

Then I echo’d out both variables with an extra space because I wanted to see them together as a group. I knew for sure I wanted to compare them at a glance.

That ended up looking like this:

Build a SAAS App With Flask: Part 1
Build a SAAS App With Flask: Part 1

Build a SAAS App With Flask: Part 2
Build a SAAS App With Flask: Part 2

Build a SAAS App With Flask: Part 3
Build a SAAS App With Flask: Part 3

4. Doing the title replacement for each file:

#!/usr/bin/env bash

for file in /tmp/tcc/posts_few/*.md; do
    [ -f "${file}" ] || break

    original_title="$(grep -oPR --no-filename '^title: "\K[^"]+' "${file}")"
    converted_title="${original_title}"

    sed -i "s/${original_title}/${converted_title}/g" "${file}"

    echo "${original_title}"
    echo "${converted_title}"
    echo
done

The only addition here is the sed command which does an in place edit (-i) on the file to replace the original title with the converted title. sed is great for doing a find / replace.

I was hoping to get no syntax errors and see the exact same output as before which is exactly what happened. Technically the g isn’t needed at the end since we’re only replacing 1 occurrence of the string, but I put it in there at the end out of habit.

5. Repeating step 4 but with all of the blog posts instead of a few:

#!/usr/bin/env bash

for file in /tmp/tcc/posts_all/*.md; do
    [ -f "${file}" ] || break

    original_title="$(grep -oPR --no-filename '^title: "\K[^"]+' "${file}")"
    converted_title="${original_title}"

    sed -i "s/${original_title}/${converted_title}/g" "${file}"

    echo "${original_title}"
    echo "${converted_title}"
    echo
done

The only thing that changed was the path. Now it’s acting on all of the posts.

I expected the same output as before except there would be more titles printed out, but 2 unexpected things happened after I ran it.

First, I noticed there was way too much output for my liking. Having 218 post titles get output twice means there’s over 500 lines of output including the extra line break. I didn’t realize it would be so much until I saw it. It made it too hard to skim for differences.

Secondly, as I was going through the output, I saw a few cases of sed throwing an error. That error was sed: -e expression #1, char 47: unknown option to s'.

Here’s the 3 blog post titles that were throwing the error:

It didn’t take long to see the pattern here. They all have / in their title.

I got kind of lucky here because prior experience let me know that sed supports using any character as a separator. Most people use / but you can use anything. For clarity, the issue here is the / in the title isn’t escaped so sed thinks it’s a separator.

6. Fixing the sed error and limiting the output:

#!/usr/bin/env bash

for file in /tmp/tcc/posts_all/*.md; do
    [ -f "${file}" ] || break

    original_title="$(grep -oPR --no-filename '^title: "\K[^"]+' "${file}")"
    converted_title="${original_title}"

    sed -i "s@${original_title}@${converted_title}@g" "${file}"

    if [ "${original_title}" == "${converted_title}" ]; then
        echo "${converted_title}"
    else
        echo
        echo "${original_title}"
        echo "${converted_title}"
        echo
    fi
done

The first thing I did was limit the output. If both titles were the same I only output the title by itself. I still wanted to see this output so I could keep tabs on the progress of the script. Without this, I wouldn’t know how far along the script was.

Then if the titles were different, I added an extra space. This will naturally group the titles that are different so it will be really easy to see what changed at a glance.

Then, I swapped the sed separator character from / to @. I decided to go with @ because I was 100% sure none of the titles used that character.

After running the script again, I was greeted with less output and no sed errors. Yay!

As for the output, every title was printed by itself because nothing changed. That makes sense because the original and converted titles are still the same value.

7. Converting the titles for real with the tcc Python script:

#!/usr/bin/env bash

for file in /tmp/tcc/posts_all/*.md; do
    [ -f "${file}" ] || break

    original_title="$(grep -oPR --no-filename '^title: "\K[^"]+' "${file}")"
    converted_title="$(tcc "${original_title}")"

    sed -i "s@${original_title}@${converted_title}@g" "${file}"

    if [ "${original_title}" == "${converted_title}" ]; then
        echo "${converted_title}"
    else
        echo
        echo "${original_title}"
        echo "${converted_title}"
        echo
    fi
done

The only thing I changed was converted_title. Instead of it being set to the original title, now it’s using the tcc script I wrote to interface with the titlecaseconverter.com website.

All I had to do now was run it, sit back and relax. I was expecting most of the titles not to change, but then occasionally see grouped up titles that did change.

And that’s exactly what happened:

# A change where the tcc script fixed the word "with" (part of the 28 differences).
Build a SAAS App With Flask Free Sample Videos
Build a SAAS App with Flask Free Sample Videos

# Another change where Docker commands got capitalized (part of the 19 differences).
Docker Tip #24: Difference between docker ps vs docker container ls
Docker Tip #24: Difference between Docker Ps vs Docker Container Ls

At this point I was happy and while it worked on my backup directory of posts, it’s another story to run it on the real directory of posts so I began thinking about other edge cases since up until now all of this from beginning to end was about 20 minutes of real life time.

8. Accounting for 2 edge cases:

#!/usr/bin/env bash

for file in /tmp/tcc/posts_all/*.md; do
    [ -f "${file}" ] || break

    original_title="$(grep -oPR --no-filename '^title: "\K[^"]+' "${file}")"

    if [[ -n "${original_title}" ]]; then
        converted_title="$(tcc "${original_title}")"

        sed -i 's@"${original_title}"@"${converted_title}"@g' "${file}"

        if [ "${original_title}" == "${converted_title}" ]; then
            echo "${converted_title}"
        else
            echo
            echo "${original_title}"
            echo "${converted_title}"
            echo
        fi
    fi
done

The first edge case was if for whatever reason I ended up with an empty original_title, I didn’t want anything else to happen so I used Bash to make sure original_title wasn’t empty before I did anything. The -n makes sure a variable isn’t empty.

The other edge case was much more subtle and potentially deadly. Since I ran into issues with / being in the title I started to think about other characters that might be an issue.

So I scanned through all of my titles manually and made a mental list of interesting characters I saw that did work. That list included: :, ?, #, $, ., (, ), ,, ', & and -.

I noticed I never once used ! in a title, so I changed one of my titles to test for that just in case I wanted to use this script again in the future and ! was good to go, but then I tried ` and it failed. I got another error with sed.

` has a special meaning in Bash. Anything inside of backticks will be evaluated as a command. Technically this could happen if you decided to put inline code into Markdown. This probably wouldn’t happen for blog post titles but it could happen in an h3 or h4.

I didn’t know how to solve that off the top of my head so I Googled around for how to escape backticks with sed and ended up finding the answer. The key is to use single quotes to instruct Bash to treat backticks as literal strings:

sed -i 's@"${original_title}"@"${converted_title}"@g' "${file}"

But I still wanted to double quote the title variables since they have spaces in them, and what you see above is the end result of getting it to work.

Running It against My Real Blog Posts

Once I saw it work with the latest iteration of the script I was confident it would work since it was the same exact data in the end.

While this wasn’t pushing to production directly, it still had that feel to it where you’re sweating a bit when you press the big red button.

But, it all worked out in the end and after about 90 seconds or so the script churned through all of my blog post titles and fixed them without any script errors.

Then I spent the remainder of the time adjusting the 19 posts that had programming terms.

It’s kind of eerie how setting a time limit works out. It took almost exactly 30 minutes and since it took that long, I decided to ignore converting all of the h3 and h4 titles for now.

But what about dealing with `h3` and `h4` headers?

h4 shouldn’t be too bad and after a minute I did come up with grep -oPR '^#### \K[^.*]+' . but h3 is much more difficult since they are not presented as ### Hello in my Markdown. They are part of a toc: property in the YAML front-matter.

That was mentioned in part 1. For example:

toc:
  - "Looking for the Other Parts of This Series?"
  - "Baseline Features"

So now I would have to look for lines that are bullets and quoted, but I know I’ve been lazy with the quotes in some my blog posts and in the past I also used single quotes. Instead of spending the whole day trying to figure that out I wrote these 2 blog posts instead.

Priorities! I suppose the lesson there is you have to pick your battles. Maybe another time.

What are some of your Bash, sed and grep accomplishments? Let me know below.

Like you, I'm super protective of my inbox, so don't worry about getting spammed. You can expect a few emails per year (at most), and you can 1-click unsubscribe at any time. See what else you'll get too.

Learn Docker With My Newest Course

Using Unix Tools and Bash to Convert Blog Post Titles - Part 2

Combine Bash, grep, sed and Python to title case any number of words. In this example, it is being done on 200+ blog posts.

# Talking about the Results First

3 examples where this solution fixed mistakes by other editor plugins:

No title case converter is optimized for programming terms:

# Creating a Game Plan to Develop the Script

Creating a Backup Directory

# Creating and Testing the Script

Breaking down the steps further to flesh out what this script will do:

Iterating on the Script

1. Getting something to work:

2. Looping over and printing the blog posts:

3. Setting up the old and new title variables:

4. Doing the title replacement for each file:

5. Repeating step 4 but with all of the blog posts instead of a few:

6. Fixing the sed error and limiting the output:

7. Converting the titles for real with the tcc Python script:

8. Accounting for 2 edge cases:

Running It against My Real Blog Posts

But what about dealing with `h3` and `h4` headers?

Never Miss a Tip, Trick or Tutorial

Comments

Learn Docker With My Newest Course

Using Unix Tools and Bash to Convert Blog Post Titles - Part 2

Combine Bash, grep, sed and Python to title case any number of words. In this example, it is being done on 200+ blog posts.

# Talking about the Results First

3 examples where this solution fixed mistakes by other editor plugins:

No title case converter is optimized for programming terms:

# Creating a Game Plan to Develop the Script

Creating a Backup Directory

# Creating and Testing the Script

Breaking down the steps further to flesh out what this script will do:

Iterating on the Script

1. Getting something to work:

2. Looping over and printing the blog posts:

3. Setting up the old and new title variables:

4. Doing the title replacement for each file:

5. Repeating step 4 but with all of the blog posts instead of a few:

6. Fixing the sed error and limiting the output:

7. Converting the titles for real with the tcc Python script:

8. Accounting for 2 edge cases:

Running It against My Real Blog Posts

But what about dealing with h3 and h4 headers?

Never Miss a Tip, Trick or Tutorial

Comments

But what about dealing with `h3` and `h4` headers?