Learn Docker With My Newest Course

Dive into Docker takes you from "What is Docker?" to confidently applying Docker to your own projects. It's packed with best practices and examples. Start Learning Docker →

Validate File Types by Reading the First Few Bytes of a File

validate-file-types-by-reading-the-first-few-bytes-of-a-file.jpg

Easily confirm if a file is really a jpg, png, gz or whatever file you want. We'll use the od command line tool or Python and Ruby.

Quick Jump:

Let’s say you support user uploads and you want to confirm someone is really uploading a jpg or png file. You can do that in a few different ways server side:

  • Check the file extension
    • This is really easy to fake because anyone can simply rename an exe file to jpg
  • Check the MIME type header
    • This is a little better but you can still pretty easily spoof this
      • Make an HTTP request and set whatever MIME type header you want

While very few things are bullet proof, you can check the first few and / or last bytes of a file to see if it matches known hex byte sequences for specific file types.

Here’s a couple of hexadecimal sequences, you can find more by Googling:

  • jpg | jpeg - Starts with FF D8 and ends with FF D9
  • png - Starts with 89 50 4E 47 0D 0A 1A 0A
  • gz - Starts with 1F 8B

All 3 solutions we’ll cover will run in constant O(1) time which means it will be equally as fast on a 2 KB file or 20 GB file, even for reading the last bytes of a file.

# Using the Built-In od Command

od (octal dump) is a built-in command that exists on Linux and macOS.

To grab the first 2 bytes you can do:

od -t x1 -N 2 myfile.jpg
# Normally I'm a fan of using long flag names but they don't work on macOS,
# the short flag names work for both GNU and BSD versions of od.
#
# -t | --format
# -N | --read-bytes

0000000 ff d8
0000002

You can chop out just the bytes by piping the above to od -t x1 -N 2 myfile.jpg | head -n 1 | cut -d " " -f 2- to give us the output of ff d8.

You can output bytes in different formats too. Check out the docs.

To grab the last 2 bytes you can do:

tail -c 2 myfile.jpg | od -t x1 | head -n 1 | cut -d " " -f 2-

ff d9

In this case we’re taking advantage of tail’s -c flag to get the last 2 bytes of the file. Then from there we do exactly what we did to get the first 2 bytes except we don’t need to use --read-bytes since we have the 2 bytes from tail.

Thankfully this also runs in constant time because tail can seek to the end the file, at least the GNU version does. I’m not sure if the BSD version does but if you’re on macOS you can try it on a large file to see if it finishes quickly.

You can compare the difference in speed on a large file by running this pipeline instead:

od -t x1 myfile.jpg | tail -n 2 | head -n 1 | rev | cut -d " " -f -2 | rev

ff d9

The above gives the same result except it’s linear time O(n). In this case od needs to read the whole file before passing it to tail. It works but it would be wildly inefficient on a large file. You probably shouldn’t ever use this version.

Typically if I’m doing byte analysis it’s for something web related in which case I’ll reach for a programming language such as Python or Ruby. Both languages offer ways to seek to a specific point in the file which makes it run in constant time too.

Although I have used od a few times to quickly verify a file by looking at the first few bytes.

# Using Python

To grab the first 2 bytes you can do:

with open("myfile.jpg", "rb") as file:
    print(file.read(2).hex())

ffd8

rb reads the file as a binary, .hex() is available in Python 3.5+.

To grab the last 2 bytes you can do:

with open("myfile.jpg", "rb") as file:
    file.seek(-2, 2)
    print(file.read(2).hex())

ff9d

We’re doing the same thing except we’re first seeking to the end of the file and then going back 2 bytes. The 2 arg (which is the same as os.SEEK_END) for seek is going to the end of the file and the -2 arg is going back 2 bytes. Then we read the next 2 bytes like before.

# Using Ruby

To grab the first 2 bytes you can do:

puts File.new("myfile.jpg").read(2).unpack("H*")

ffd8

We’re unpacking it as a hex string. There’s quite a few other formats you can use if you want which are all documented for the unpack method.

To grab the last 2 bytes you can do:

file = File.new("myfile.jpg")
file.seek(-2, :END)
puts file.read(2).unpack("H*")

ffd9

We’re seeking to the end of the file, going back 2 bytes and then reading them. Instead of :END you can also use IO::SEEK_END – both do the same thing.

The video below goes over running these commands and scripts.

# Demo Video

Timestamps

  • 0:16 – File types typically have well known starting bytes
  • 0:51 – All 3 ways will be constant time O(1)
  • 1:15 – Using the od (octal dump) built-in command
  • 4:52 – Constant vs linear time with getting the last bytes
  • 8:57 – Python version
  • 11:54 – Ruby version
  • 15:19 – Using a library to help detect file types

When was the last time you had to check bytes of a file? Let me know below.

Never Miss a Tip, Trick or Tutorial

Like you, I'm super protective of my inbox, so don't worry about getting spammed. You can expect a few emails per year (at most), and you can 1-click unsubscribe at any time. See what else you'll get too.



Comments