Using Unix Tools and Bash to Convert Blog Post Titles - Part 2
Combine Bash, grep, sed and Python to title case any number of words. In this example, it is being done on 200+ blog posts.
This is a 2 part series. You are reading part 2. Catch up by reading part 1.
Back in part 1, we covered the general problem and also tackled the first 3 steps to solve it which left us with being able to get a parsed list of blog post titles from any number of Markdown files that are formatted for Jekyll.
Now it’s time to pick up from where we left off and cover taking the original title, converting it to the proper title case and then replacing the original title with the converted title for all of the blog posts. We’re going to do that with a few lines of Bash.
# Talking about the Results First
Before getting into the code, I want to talk about the results first because it’s important to see if it’s even worth it.
I mean, if you continued to stick with existing VSCode or Sublime Text plugins to do this, it might be good enough for you, but…
I was pretty surprised at the outcome. Across 218 blog post titles, titlecaseconverter.com changed 28 of them for the better. That means 12.8% of my titles were incorrect before.
It fixed a bunch of cases where plugins for VSCode and Sublime Text incorrectly capitalized words like “with”, “without”, “from”, “how” and quite a few others.
3 examples where this solution fixed mistakes by other editor plugins:
# The top title is the original title and the bottom title is the converted title.
Build a SAAS App With Flask Free Sample Videos
Build a SAAS App with Flask Free Sample Videos
5 Steps to Take Before Moving Your Applications Into Docker
5 Steps to Take before Moving Your Applications into Docker
Understanding how the Docker Daemon and Docker CLI Work Together
Understanding How the Docker Daemon and Docker CLI Work Together
It’s also not as simple as adding hard coded rules to fix these words because the context in which you use them changes how they are capitalized. Sometimes they should be capitalized while other times they shouldn’t.
No title case converter is optimized for programming terms:
After running the script against all of my blog posts, I had to go and manually make changes to 15 of them, but this isn’t due to mistakes from titlecaseconverter.com.
It’s optimized for title casing standard titles, but a decent amount of my titles have words like nginx and tmux in them so it capitalized them as “Nginx” and “Tmux”.
The 15 fixes I had to make came down to transforming those words into lower case.
While creating the script I built in extra functionality to detect changes between the original and converted title. I wanted to minimize the manual work I had to do because I expected I would have a few programming term related titles to fix.
With the help of some print statements it only took a few minutes to identify and fix them.
# Creating a Game Plan to Develop the Script
No way I was going to test this live on my main directory of posts, even though I have daily backups of it. I knew I was going to have to make a bunch of tests and changes that couldn’t be reverted, so it was best off to do it in isolation.
By the way, they couldn’t be reverted because after running the script, the original title would be changed to the newly converted title which means the original title is long gone. In other words, there is no undo.
Creating a Backup Directory
The first thing I did was create a /tmp/tcc
directory and this was going to
be my working directory for creating and testing the script.
Then I created a couple of directories in there:
/tmp/tcc/posts_all
is where I copied all of my posts to/tmp/tcc/posts_few
is where I copied 3 posts to use as an initial test
# Creating and Testing the Script
Looking at the individual steps from the previous article, the agenda was:
- Take the original title and convert it into the new title
- Replace the original title with the converted title in the file
I already decided I was going to use Bash because from prior experience I knew this was going to be a few lines of code and I had an idea of what the components would look like.
Breaking down the steps further to flesh out what this script will do:
- Loop over a list of blog posts
- Store the original title in a variable
- Convert that title into a new title and store that in another variable
- Replace the old title with the new title for the current file in the loop
Iterating on the Script
In most tutorials, they just give you the polished end result which is handy but I think showing the struggle of how you got there is the most important part.
So I’m going to break down how I developed this script from start to finish.
1. Getting something to work:
#!/usr/bin/env bash
echo "Hello world"
No joke. I do this with most projects. I’ll take any and all victories, no matter how small they are. I just like to see things work.
2. Looping over and printing the blog posts:
#!/usr/bin/env bash
for file in /tmp/tcc/posts_few/*.md; do
[ -f "${file}" ] || break
echo "${file}"
done
Pretty standard Bash here. That’s how you can loop over a directory of files.
I wanted to limit it to *.md
which are only Markdown files. Technically this
would have worked on all files since only Markdown files are in that directory
but I did that out of habit.
[ -f "${file}" ] || break
is another guard to make sure that the item is
an actual file. Technically a directory called foo.md/
could exist
and I wanted to skip over that.
That trick was fresh in my memory since I just used it in some client work.
Then I echo
’d the file to make sure it lined up with what ls
told me.
Yep, I got 3 blog posts back and I’m not going to bother outputting it here.
Comparing it to ls
was just a confidence booster and acts as a test.
3. Setting up the old and new title variables:
#!/usr/bin/env bash
for file in /tmp/tcc/posts_few/*.md; do
[ -f "${file}" ] || break
original_title="$(grep -oPR --no-filename '^title: "\K[^"]+' "${file}")"
converted_title="${original_title}"
echo "${original_title}"
echo "${converted_title}"
echo
done
All I did was drop in the grep pattern from the end of part 1 and saved it to a variable.
Also instead of acting on the current directory, I passed in the ${file}
variable instead. If you’re wondering why there’s so many quotes it’s because
you should always quote your variables with Bash.
As for the converted_title
I wasn’t ready to start calling the Python tcc
script to
convert the titles for real because that makes an external network call, so I
just mocked it out by setting the converted_title
to be the original for now.
Then I echo
’d out both variables with an extra space because I wanted to see
them together as a group. I knew for sure I wanted to compare them at a glance.
That ended up looking like this:
Build a SAAS App With Flask: Part 1
Build a SAAS App With Flask: Part 1
Build a SAAS App With Flask: Part 2
Build a SAAS App With Flask: Part 2
Build a SAAS App With Flask: Part 3
Build a SAAS App With Flask: Part 3
4. Doing the title replacement for each file:
#!/usr/bin/env bash
for file in /tmp/tcc/posts_few/*.md; do
[ -f "${file}" ] || break
original_title="$(grep -oPR --no-filename '^title: "\K[^"]+' "${file}")"
converted_title="${original_title}"
sed -i "s/${original_title}/${converted_title}/g" "${file}"
echo "${original_title}"
echo "${converted_title}"
echo
done
The only addition here is the sed
command which does an in place edit (-i
)
on the file to replace the original title with the converted title. sed
is
great for doing a find / replace.
I was hoping to get no syntax errors and see the exact same output as before
which is exactly what happened. Technically the g
isn’t needed at the end
since we’re only replacing 1 occurrence of the string, but I put it in there
at the end out of habit.
5. Repeating step 4 but with all of the blog posts instead of a few:
#!/usr/bin/env bash
for file in /tmp/tcc/posts_all/*.md; do
[ -f "${file}" ] || break
original_title="$(grep -oPR --no-filename '^title: "\K[^"]+' "${file}")"
converted_title="${original_title}"
sed -i "s/${original_title}/${converted_title}/g" "${file}"
echo "${original_title}"
echo "${converted_title}"
echo
done
The only thing that changed was the path. Now it’s acting on all of the posts.
I expected the same output as before except there would be more titles printed out, but 2 unexpected things happened after I ran it.
First, I noticed there was way too much output for my liking. Having 218 post titles get output twice means there’s over 500 lines of output including the extra line break. I didn’t realize it would be so much until I saw it. It made it too hard to skim for differences.
Secondly, as I was going through the output, I saw a few cases of sed
throwing an error. That error was sed: -e expression #1, char 47: unknown option to s'
.
Here’s the 3 blog post titles that were throwing the error:
- Should You Install Docker With the Docker Toolbox or Docker for Mac / Windows?
- Docker Tip #41: Should You Use Virtualenv / RVM in Your Docker Images?
- Enable HTTP/2 with nginx on Debian Jessie / Stretch and Ubuntu 16
It didn’t take long to see the pattern here. They all have /
in their title.
I got kind of lucky here because prior experience let me know that sed
supports using any character as a separator. Most people use /
but you can
use anything. For clarity, the issue here is the /
in the title isn’t escaped
so sed
thinks it’s a separator.
6. Fixing the sed error and limiting the output:
#!/usr/bin/env bash
for file in /tmp/tcc/posts_all/*.md; do
[ -f "${file}" ] || break
original_title="$(grep -oPR --no-filename '^title: "\K[^"]+' "${file}")"
converted_title="${original_title}"
sed -i "s@${original_title}@${converted_title}@g" "${file}"
if [ "${original_title}" == "${converted_title}" ]; then
echo "${converted_title}"
else
echo
echo "${original_title}"
echo "${converted_title}"
echo
fi
done
The first thing I did was limit the output. If both titles were the same I only output the title by itself. I still wanted to see this output so I could keep tabs on the progress of the script. Without this, I wouldn’t know how far along the script was.
Then if the titles were different, I added an extra space. This will naturally group the titles that are different so it will be really easy to see what changed at a glance.
Then, I swapped the sed
separator character from /
to @
. I decided to go
with @
because I was 100% sure none of the titles used that character.
After running the script again, I was greeted with less output and no sed
errors. Yay!
As for the output, every title was printed by itself because nothing changed. That makes sense because the original and converted titles are still the same value.
7. Converting the titles for real with the tcc Python script:
#!/usr/bin/env bash
for file in /tmp/tcc/posts_all/*.md; do
[ -f "${file}" ] || break
original_title="$(grep -oPR --no-filename '^title: "\K[^"]+' "${file}")"
converted_title="$(tcc "${original_title}")"
sed -i "s@${original_title}@${converted_title}@g" "${file}"
if [ "${original_title}" == "${converted_title}" ]; then
echo "${converted_title}"
else
echo
echo "${original_title}"
echo "${converted_title}"
echo
fi
done
The only thing I changed was converted_title
. Instead of it being set to the
original title, now it’s using the tcc
script I
wrote to interface with the titlecaseconverter.com website.
All I had to do now was run it, sit back and relax. I was expecting most of the titles not to change, but then occasionally see grouped up titles that did change.
And that’s exactly what happened:
# A change where the tcc script fixed the word "with" (part of the 28 differences).
Build a SAAS App With Flask Free Sample Videos
Build a SAAS App with Flask Free Sample Videos
# Another change where Docker commands got capitalized (part of the 19 differences).
Docker Tip #24: Difference between docker ps vs docker container ls
Docker Tip #24: Difference between Docker Ps vs Docker Container Ls
At this point I was happy and while it worked on my backup directory of posts, it’s another story to run it on the real directory of posts so I began thinking about other edge cases since up until now all of this from beginning to end was about 20 minutes of real life time.
8. Accounting for 2 edge cases:
#!/usr/bin/env bash
for file in /tmp/tcc/posts_all/*.md; do
[ -f "${file}" ] || break
original_title="$(grep -oPR --no-filename '^title: "\K[^"]+' "${file}")"
if [[ -n "${original_title}" ]]; then
converted_title="$(tcc "${original_title}")"
sed -i 's@"${original_title}"@"${converted_title}"@g' "${file}"
if [ "${original_title}" == "${converted_title}" ]; then
echo "${converted_title}"
else
echo
echo "${original_title}"
echo "${converted_title}"
echo
fi
fi
done
The first edge case was if for whatever reason I ended up with an empty
original_title
, I didn’t want anything else to happen so I used Bash to
make sure original_title
wasn’t empty before I did anything. The -n
makes
sure a variable isn’t empty.
The other edge case was much more subtle and potentially deadly. Since I ran
into issues with /
being in the title I started to think about other
characters that might be an issue.
So I scanned through all of my titles manually and made a mental list of
interesting characters I saw that did work. That list included: :
, ?
, #
,
$
, .
, (
, )
, ,
, '
, &
and -
.
I noticed I never once used !
in a title, so I changed one of my titles to
test for that just in case I wanted to use this script again in the future
and !
was good to go, but then I tried ` and it failed. I got another error
with sed
.
` has a special meaning in Bash. Anything inside of backticks will be
evaluated as a command. Technically this could happen if you decided to put
inline code into Markdown. This probably wouldn’t happen for blog post titles
but it could happen in an h3
or h4
.
I didn’t know how to solve that off the top of my head so I Googled around for
how to escape backticks with sed
and ended up finding the answer. The key is
to use single quotes to instruct Bash to treat backticks as literal strings:
sed -i 's@"${original_title}"@"${converted_title}"@g' "${file}"
But I still wanted to double quote the title variables since they have spaces in them, and what you see above is the end result of getting it to work.
Running It against My Real Blog Posts
Once I saw it work with the latest iteration of the script I was confident it would work since it was the same exact data in the end.
While this wasn’t pushing to production directly, it still had that feel to it where you’re sweating a bit when you press the big red button.
But, it all worked out in the end and after about 90 seconds or so the script churned through all of my blog post titles and fixed them without any script errors.
Then I spent the remainder of the time adjusting the 19 posts that had programming terms.
It’s kind of eerie how setting a time limit works out. It took almost exactly
30 minutes and since it took that long, I decided to ignore converting all of
the h3
and h4
titles for now.
But what about dealing with h3
and h4
headers?
h4
shouldn’t be too bad and after a minute I did come up with grep -oPR '^#### \K[^.*]+' .
but h3
is much more difficult since they are not
presented as ### Hello
in my Markdown. They are part of a toc:
property in
the YAML front-matter.
That was mentioned in part 1. For example:
toc:
- "Looking for the Other Parts of This Series?"
- "Baseline Features"
So now I would have to look for lines that are bullets and quoted, but I know I’ve been lazy with the quotes in some my blog posts and in the past I also used single quotes. Instead of spending the whole day trying to figure that out I wrote these 2 blog posts instead.
Priorities! I suppose the lesson there is you have to pick your battles. Maybe another time.
What are some of your Bash, sed and grep accomplishments? Let me know below.