Solving Random Business Problems with Command Line Tools
In this case we'll create a quick and dirty sitemap by looking at files in a specific directory.
Chances are you won’t have to solve this exact problem but maybe this helps demonstrate how you can combine a few general tools together to solve your specific problem.
Let’s treat this one as a real business use case. If you only care about the code based solution then feel free to skip to it.
In this case, the problem is you have an upcoming penetration test and the company doing the test is asking for a list of URLs that your sites have because they don’t have a crawling tool to get this list themselves.
They requested a list of URLs in a new line separated text file, such as:
https://www.example.com/hello
https://www.example.com/world
https://www.example.com/login
The web framework (CodeIgniter 3) the apps are built with doesn’t create a sitemap by default and 95% of the URLs are behind a login. You’re not a primary developer on the project but you know there’s likely going to be over 100-150+ URLs.
You don’t want to bug the dev team yet because they are working on other features, but you may want to coordinate with them later once you have an assembled list so they can hand tune it based on experience.
How long is this going to take? How will you approach this problem? Does your answer change if you only have a few days to come up with a solution while juggling other tasks?
# To Crawl or Not to Crawl
Given the time constraints I didn’t want to investigate crawling tools. That’s one area of web development I don’t have a lot of experience with. Maybe I could have gotten away with quickly putting something together with Scrapy, perhaps next time.
The good news is we didn’t need a perfect solution. It’s not important that we have a 100% accurate list of every single URL that exists within the site.
What we were really after is the 80%+. For example, if a controller has a bunch of different actions which are URL endpoints, we really only need to identify the index page because the pen tester can discover most of the other pages from there.
We are also going to give them a live demo of the site beforehand to call out the most important areas to test which have the highest impact / importance to us.
In my mind, I classified this problem as a side quest. Is this something I can complete in less than an hour with “good enough” accuracy and move on? Turns out yes, it was more more like 15-20 minutes using a few built-in Unix CLI tools.
# Looking for Patterns
Thankfully the application does have structure. There is a controllers directory with a bunch of files. Some of them have uppercase characters and some do not.
Each of those controller files have a bunch of functions and most of those functions end up being URL endpoints. The good news is just about every controller has an index page – we can ignore the contents of the files themselves.
What I did notice was that some of the files likely aren’t meant to be
accessible URLs. For example a couple of files started with test_
, or
errors
. What that means is we probably want a way to filter out some results.
A High Level Solution
Our strategy can be something like this:
- Get a list of file names in the controllers directory
- Remove the
.php
file extension from the file name - Convert the file name into lowercase
- Prefix the file name with the base URL
- Filter out the files we don’t want to include
- Sort the results alphabetically
- Write the results to a file
You could describe the above as a pipeline and the command line is really good at providing tools to pipe together. The ordering of steps 2 and 6 don’t matter too much. I mean the most efficient solution on paper would be to filter things first but depending on which tools you use, you might complete steps 1-4 in 1 command.
In any case, those are the steps we need to perform so let’s convert it to code.
# Creating the Script
In our case we have (3) sites and we want to create (3) text files.
Each site has its own project directory with the same directory structure so I
figured the script could be called like ./urls myapp
or ./urls anotherapp
.
It would look for myapp
or anotherapp
on disk to where the project is
created and then dump out a myapp-urls.txt
file in the same directory as the
script.
Boilerplate
Here’s the basics to implement the above:
#!/usr/bin/env bash
set -o errexit
set -o pipefail
set -o nounset
app="${1}"
project_path="sites/${app}"
# Our pipeline of commands will go here soon.
urls="placeholder"
echo "${urls}" > "${app}-urls.txt"
Get a List of File Names
The final script will be provided at the end, but instead of duplicating everything in each step, let’s just focus on the code responsible for each step.
Also, while working on the script I didn’t write out the urls file to begin with. I output things to the screen so I can get quick feedback.
controllers_path="src/application/controllers"
# We only care about files (not directories) and avoid recursing into other directories.
find "${project_path}/${controllers_path}" -maxdepth 1 -type f
An example of the output looks like this:
src/application/controllers/Another_thing.php
src/application/controllers/errors.php
src/application/controllers/Yep.php
Converting File Names into URLs
This is the most complicated part of the script because it’s tying together quite a few things but we’ll cover it in more detail after seeing the output.
find ... -execdir sh -c 'printf "https://'"${app}"'.example.com/%s\n" "${0%.*}"' {} ";"
The output now looks like this:
https://myapp.example.com/./Another_thing
https://myapp.example.com/./Yep
https://myapp.example.com/./errors
Your output might be slightly different. For example the GNU version of find
will add ./
to the file name so if you’re on macOS you may not see that, but
we’ll clean that up in a later step.
We’re getting close. We’re using -execdir
to execute a command for each file
that find
returns. In this case we’re using shell’s printf
to modify the
file name a bit and give us a new line separated list of URLs.
'"${app}"'
is broken up with double quotes instead of the surrounding single
quotes because we want $app
to be interpolated. If we didn’t do this then
literally ${app}
would get output. Shell supports having adjacent quotes like
this to allow for what we just did.
Within the printf
, %s
is the string, AKA. the file name and "${0%.*}"
is
a bit of shell magic to remove the file extension.
There’s a lot of things going on with {} ";"
to which I’ll defer to this
StackOverflow
post
if you want more details. If not, just know either that or {} +
needs to
exist at the end.
Filtering Out Unwanted Files
Next up, let’s remove our unwanted results before doing any additional processing.
... | grep -E -v "\/?(test_|errors)"
Here’s our latest output:
https://trade.example.com/./Another_thing
https://trade.example.com/./Yep
In this case the errors
file got removed. We’re using -v
to do a reverse
match. Basically everything except the pattern that matches will be returned.
We’re using a basic OR regex to match on a few things. The \/?
handles an
optional forward slash. In our example it wouldn’t make a difference but it
helps if you’re matching against strings that don’t occur directly after the
/
.
Normalizing the Dot Slash
Since only the GNU version of find adds the ./
, we can handle it in a way that
works for everyone by using sed
to do a find / replace to remove it.
... | sed "s|\./||"
Here’s our new output:
https://trade.example.com/Another_thing
https://trade.example.com/Yep
We’re using |
for sed
’s delimiter because it avoid needing to escape the
/
if we stuck to using the usual /
delimiter.
Making Everything Lower Case
At this point we’re just adding more commands to our pipeline.
... | tr "[:upper:]" "[:lower:]"
Here’s our new output:
https://trade.example.com/another_thing
https://trade.example.com/yep
That’s a quick way to make sure everything becomes lowercase.
Sorting Everything Alphabetically
Lastly, this was to make it easier to skim, but it’s certainly not necessary.
... | sort
And here’s the final output:
https://trade.example.com/another_thing
https://trade.example.com/yep
We didn’t bother using -u
for a unique sort since the file names are already
unique.
Putting It All Together
Here’s the script, complete with writing the results to a file:
#!/usr/bin/env bash
set -o errexit
set -o pipefail
set -o nounset
app="${1}"
project_path="sites/${app}"
controller_path="src/application/controllers"
urls="$(find "${project_path}/${controller_path}" -maxdepth 1 -type f \
-execdir sh -c 'printf "https://'"${app}"'.example.com/%s\n" "${0%.*}"' {} ";" \
| grep -E -v "(test_|errors)" \
| sed "s|\./||" \
| tr "[:upper:]" "[:lower:]" \
| sort)"
echo "${urls}" > "${app}-urls.txt"
The video below shows everything get built up in stages.
# Demo Video
Timestamps
- 0:08 – Going over the business use case
- 1:35 – Should we write a crawler?
- 2:25 – Looking for patterns
- 3:38 – High level strategy
- 4:42 – Boilerplate script
- 6:12 – Getting a list of file names
- 8:08 – Convert file names into URLs
- 10:31 – Filtering unwanted files
- 11:40 – Normalizing the dot slash
- 13:21 – Make everything lower case
- 13:49 – Sorting things for good measure
- 14:28 – Putting it all together
What was the last problem you solved using CLI tools? Let me know below.