Updated on August 27, 2024 in #linux

Solving Random Business Problems with Command Line Tools

solving-random-business-problems-with-command-line-tools.jpg

In this case we'll create a quick and dirty sitemap by looking at files in a specific directory.

Quick Jump:

Chances are you won’t have to solve this exact problem but maybe this helps demonstrate how you can combine a few general tools together to solve your specific problem.

Let’s treat this one as a real business use case. If you only care about the code based solution then feel free to skip to it.

In this case, the problem is you have an upcoming penetration test and the company doing the test is asking for a list of URLs that your sites have because they don’t have a crawling tool to get this list themselves.

They requested a list of URLs in a new line separated text file, such as:

https://www.example.com/hello
https://www.example.com/world
https://www.example.com/login

The web framework (CodeIgniter 3) the apps are built with doesn’t create a sitemap by default and 95% of the URLs are behind a login. You’re not a primary developer on the project but you know there’s likely going to be over 100-150+ URLs.

You don’t want to bug the dev team yet because they are working on other features, but you may want to coordinate with them later once you have an assembled list so they can hand tune it based on experience.

How long is this going to take? How will you approach this problem? Does your answer change if you only have a few days to come up with a solution while juggling other tasks?

# To Crawl or Not to Crawl

Given the time constraints I didn’t want to investigate crawling tools. That’s one area of web development I don’t have a lot of experience with. Maybe I could have gotten away with quickly putting something together with Scrapy, perhaps next time.

The good news is we didn’t need a perfect solution. It’s not important that we have a 100% accurate list of every single URL that exists within the site.

What we were really after is the 80%+. For example, if a controller has a bunch of different actions which are URL endpoints, we really only need to identify the index page because the pen tester can discover most of the other pages from there.

We are also going to give them a live demo of the site beforehand to call out the most important areas to test which have the highest impact / importance to us.

In my mind, I classified this problem as a side quest. Is this something I can complete in less than an hour with “good enough” accuracy and move on? Turns out yes, it was more more like 15-20 minutes using a few built-in Unix CLI tools.

# Looking for Patterns

Thankfully the application does have structure. There is a controllers directory with a bunch of files. Some of them have uppercase characters and some do not.

Each of those controller files have a bunch of functions and most of those functions end up being URL endpoints. The good news is just about every controller has an index page – we can ignore the contents of the files themselves.

What I did notice was that some of the files likely aren’t meant to be accessible URLs. For example a couple of files started with test_, or errors. What that means is we probably want a way to filter out some results.

A High Level Solution

Our strategy can be something like this:

Get a list of file names in the controllers directory
Remove the .php file extension from the file name
Convert the file name into lowercase
Prefix the file name with the base URL
Filter out the files we don’t want to include
Sort the results alphabetically
Write the results to a file

You could describe the above as a pipeline and the command line is really good at providing tools to pipe together. The ordering of steps 2 and 6 don’t matter too much. I mean the most efficient solution on paper would be to filter things first but depending on which tools you use, you might complete steps 1-4 in 1 command.

In any case, those are the steps we need to perform so let’s convert it to code.

# Creating the Script

In our case we have (3) sites and we want to create (3) text files.

Each site has its own project directory with the same directory structure so I figured the script could be called like ./urls myapp or ./urls anotherapp.

It would look for myapp or anotherapp on disk to where the project is created and then dump out a myapp-urls.txt file in the same directory as the script.

Boilerplate

Here’s the basics to implement the above:

#!/usr/bin/env bash

set -o errexit
set -o pipefail
set -o nounset

app="${1}"
project_path="sites/${app}"

# Our pipeline of commands will go here soon.
urls="placeholder"

echo "${urls}" > "${app}-urls.txt"

Get a List of File Names

The final script will be provided at the end, but instead of duplicating everything in each step, let’s just focus on the code responsible for each step.

Also, while working on the script I didn’t write out the urls file to begin with. I output things to the screen so I can get quick feedback.

controllers_path="src/application/controllers"

# We only care about files (not directories) and avoid recursing into other directories.
find "${project_path}/${controllers_path}" -maxdepth 1 -type f

An example of the output looks like this:

src/application/controllers/Another_thing.php
src/application/controllers/errors.php
src/application/controllers/Yep.php

Converting File Names into URLs

This is the most complicated part of the script because it’s tying together quite a few things but we’ll cover it in more detail after seeing the output.

find ... -execdir sh -c 'printf "https://'"${app}"'.example.com/%s\n" "${0%.*}"' {} ";"

The output now looks like this:

https://myapp.example.com/./Another_thing
https://myapp.example.com/./Yep
https://myapp.example.com/./errors

Your output might be slightly different. For example the GNU version of find will add ./ to the file name so if you’re on macOS you may not see that, but we’ll clean that up in a later step.

We’re getting close. We’re using -execdir to execute a command for each file that find returns. In this case we’re using shell’s printf to modify the file name a bit and give us a new line separated list of URLs.

'"${app}"' is broken up with double quotes instead of the surrounding single quotes because we want $app to be interpolated. If we didn’t do this then literally ${app} would get output. Shell supports having adjacent quotes like this to allow for what we just did.

Within the printf, %s is the string, AKA. the file name and "${0%.*}" is a bit of shell magic to remove the file extension.

There’s a lot of things going on with {} ";" to which I’ll defer to this StackOverflow post if you want more details. If not, just know either that or {} + needs to exist at the end.

Filtering Out Unwanted Files

Next up, let’s remove our unwanted results before doing any additional processing.

... | grep -E -v "\/?(test_|errors)"

Here’s our latest output:

https://trade.example.com/./Another_thing
https://trade.example.com/./Yep

In this case the errors file got removed. We’re using -v to do a reverse match. Basically everything except the pattern that matches will be returned.

We’re using a basic OR regex to match on a few things. The \/? handles an optional forward slash. In our example it wouldn’t make a difference but it helps if you’re matching against strings that don’t occur directly after the /.

Normalizing the Dot Slash

Since only the GNU version of find adds the ./, we can handle it in a way that works for everyone by using sed to do a find / replace to remove it.

... | sed "s|\./||"

Here’s our new output:

https://trade.example.com/Another_thing
https://trade.example.com/Yep

We’re using | for sed’s delimiter because it avoid needing to escape the / if we stuck to using the usual / delimiter.

Making Everything Lower Case

At this point we’re just adding more commands to our pipeline.

... | tr "[:upper:]" "[:lower:]"

Here’s our new output:

https://trade.example.com/another_thing
https://trade.example.com/yep

That’s a quick way to make sure everything becomes lowercase.

Sorting Everything Alphabetically

Lastly, this was to make it easier to skim, but it’s certainly not necessary.

... | sort

And here’s the final output:

https://trade.example.com/another_thing
https://trade.example.com/yep

We didn’t bother using -u for a unique sort since the file names are already unique.

Putting It All Together

Here’s the script, complete with writing the results to a file:

#!/usr/bin/env bash

set -o errexit
set -o pipefail
set -o nounset

app="${1}"
project_path="sites/${app}"
controller_path="src/application/controllers"

urls="$(find "${project_path}/${controller_path}" -maxdepth 1 -type f \
  -execdir sh -c 'printf "https://'"${app}"'.example.com/%s\n" "${0%.*}"' {} ";" \
  | grep -E -v "(test_|errors)" \
  | sed "s|\./||" \
  | tr "[:upper:]" "[:lower:]" \
  | sort)"

echo "${urls}" > "${app}-urls.txt"

The video below shows everything get built up in stages.

# Demo Video

Timestamps

0:08 – Going over the business use case
1:35 – Should we write a crawler?
2:25 – Looking for patterns
3:38 – High level strategy
4:42 – Boilerplate script
6:12 – Getting a list of file names
8:08 – Convert file names into URLs
10:31 – Filtering unwanted files
11:40 – Normalizing the dot slash
13:21 – Make everything lower case
13:49 – Sorting things for good measure
14:28 – Putting it all together

What was the last problem you solved using CLI tools? Let me know below.

Like you, I'm super protective of my inbox, so don't worry about getting spammed. You can expect a few emails per year (at most), and you can 1-click unsubscribe at any time. See what else you'll get too.

Learn Docker With My Newest Course