Validate File Types by Reading the First Few Bytes of a File
Easily confirm if a file is really a jpg, png, gz or whatever file you want. We'll use the od command line tool or Python and Ruby.
Let’s say you support user uploads and you want to confirm someone is really uploading a jpg or png file. You can do that in a few different ways server side:
- Check the file extension
- This is really easy to fake because anyone can simply rename an exe file to jpg
- Check the MIME type header
- This is a little better but you can still pretty easily spoof this
- Make an HTTP request and set whatever MIME type header you want
- This is a little better but you can still pretty easily spoof this
While very few things are bullet proof, you can check the first few and / or last bytes of a file to see if it matches known hex byte sequences for specific file types.
Here’s a couple of hexadecimal sequences, you can find more by Googling:
- jpg | jpeg - Starts with
FF D8
and ends withFF D9
- png - Starts with
89 50 4E 47 0D 0A 1A 0A
- gz - Starts with
1F 8B
All 3 solutions we’ll cover will run in constant O(1) time which means it will be equally as fast on a 2 KB file or 20 GB file, even for reading the last bytes of a file.
# Using the Built-In od Command
od
(octal dump) is a built-in command that exists on Linux and macOS.
To grab the first 2 bytes you can do:
od -t x1 -N 2 myfile.jpg
# Normally I'm a fan of using long flag names but they don't work on macOS,
# the short flag names work for both GNU and BSD versions of od.
#
# -t | --format
# -N | --read-bytes
0000000 ff d8
0000002
You can chop out just the bytes by piping the above to od -t x1 -N 2 myfile.jpg | head -n 1 | cut -d " " -f 2-
to give us the output
of ff d8
.
You can output bytes in different formats too. Check out the docs.
To grab the last 2 bytes you can do:
tail -c 2 myfile.jpg | od -t x1 | head -n 1 | cut -d " " -f 2-
ff d9
In this case we’re taking advantage of tail
’s -c
flag to get the last 2
bytes of the file. Then from there we do exactly what we did to get the first 2
bytes except we don’t need to use --read-bytes
since we have the 2 bytes from
tail
.
Thankfully this also runs in constant time because tail
can seek to the end
the file, at least the GNU version does. I’m not sure if the BSD version does
but if you’re on macOS you can try it on a large file to see if it finishes
quickly.
You can compare the difference in speed on a large file by running this pipeline instead:
od -t x1 myfile.jpg | tail -n 2 | head -n 1 | rev | cut -d " " -f -2 | rev
ff d9
The above gives the same result except it’s linear time O(n). In this case od
needs to read the whole file before passing it to tail. It works but it would
be wildly inefficient on a large file. You probably shouldn’t ever use this
version.
Typically if I’m doing byte analysis it’s for something web related in which case I’ll reach for a programming language such as Python or Ruby. Both languages offer ways to seek to a specific point in the file which makes it run in constant time too.
Although I have used od
a few times to quickly verify a file by looking at
the first few bytes.
# Using Python
To grab the first 2 bytes you can do:
with open("myfile.jpg", "rb") as file:
print(file.read(2).hex())
ffd8
rb
reads the file as a binary, .hex()
is
available in Python
3.5+.
To grab the last 2 bytes you can do:
with open("myfile.jpg", "rb") as file:
file.seek(-2, 2)
print(file.read(2).hex())
ff9d
We’re doing the same thing except we’re first seeking to the end of the file
and then going back 2 bytes. The 2
arg (which is the same as os.SEEK_END
)
for seek
is going to the end of the file and the -2
arg is going back 2
bytes. Then we read the next 2 bytes like before.
# Using Ruby
To grab the first 2 bytes you can do:
puts File.new("myfile.jpg").read(2).unpack("H*")
ffd8
We’re unpacking it as a hex string. There’s quite a few other formats you can use if you want which are all documented for the unpack method.
To grab the last 2 bytes you can do:
file = File.new("myfile.jpg")
file.seek(-2, :END)
puts file.read(2).unpack("H*")
ffd9
We’re seeking to the
end of the file, going back 2 bytes and then reading them. Instead of :END
you can also use IO::SEEK_END
– both do the same thing.
The video below goes over running these commands and scripts.
# Demo Video
Timestamps
- 0:16 – File types typically have well known starting bytes
- 0:51 – All 3 ways will be constant time O(1)
- 1:15 – Using the od (octal dump) built-in command
- 4:52 – Constant vs linear time with getting the last bytes
- 8:57 – Python version
- 11:54 – Ruby version
- 15:19 – Using a library to help detect file types
When was the last time you had to check bytes of a file? Let me know below.