When you saw that today's topic was file compression, I'm sure you dropped everything, cleared your schedule, and got ready to sit back for an exciting read. Or ... more likely ... you’re about to hit the delete button. Hold on, don’t give up on me yet. Understanding file compression probably won’t help you all that much in your day to day life; it won’t help with an RFP or assessing a software solution; it won’t help you work with your IT teams. This falls squarely into the “gee wiz” category. If you stick with me, you might learn a little something that helps you speak geek better than most.
Before we dive into the guts of a file compression algorithm I want to cover a little basic computer science vocabulary.
Bit. A bit is the smallest unit of data a computer deals with. It can have two values: 0 or 1. Sometimes you will see them represented as a false (0) or a true (1),
Byte. A byte is a sequence of eight bits. That’s it, just a sequence of zeros and ones. For example: 10011011.
Character: A byte, or sequence of bytes, that represents some letter or symbol. For example, the letter A may be represented as 01000001 - a sequence of eight bits (a byte) that we all agree should represent the letter A.
Encoding: A system, or standard, for converting bytes into symbols or characters. Unless the whole world can agree upon how to convert a sequence of bits into a letter or symbol, we’ll have a heck of a time sharing information. So the world has agreed upon certain ways to encode bytes into symbols. One of the most basic encoding standards is the ASCII standard. 01000001 represents a capital A in ASCII encoding so if you hand any computer in the world the sequence 01000001 and tell it that it is an ASCII character, that computer will recognize it as A.
Puzzles and Trees
If you look up file compression algorithms you will find an extraordinary amount of math. Understanding the math is necessary of you are going to build software to implement the algorithms. But if you just want to understand the basic concepts of how they work, you don’t need complicated math.
Time for an example. Let’s say we have a text file containing the words: add all apples. If we encode this using the ASCII standard it becomes:
Including spaces, we’ve got 14 bytes of data. But with so few characters, we really don’t need a full byte to represent each character, we’re just doing that to comply with standards. Let's list each character that we're using and how many times it occurs in the string.
[space] - 3 occurrences
a - 3 occurrences
l - 3 occurrences
d - 2 occurrences
p - 2 occurrences
e - 1 occurrences
s - 1 occurrences
Now what if we make up our own encoding mechanism that assigns fewer bits to characters that occur more often.
[space] - 0
a - 1
d - 00
l - 01
p - 10
e - 11
s - 000
Here's our new file: 100000101010110100111000
We just reduced our file size to 24 bits - or three bytes.
Not quite that easy
Unfortunately, life isn’t always that simple. For a compression algorithm to work you need to be able to reverse the process to get the original file back. In our example above, we can’t do that. For example, if we see two positive bits … 11 in the compressed file, is that two a’s or is that a single s?There is no way to find out.
Back in 1951 a student at MIT, David Huffman, developed an algorithm to handle this problem as part of a class project.Rather than just create a lookup table of shortened codes to translate bytes to characters, he encoded all the information into a graph, specifically a tree. I made an example of a tree below.
To determine the value for any given character, simply follow the path from the root at the top to the apropriate node at the bottom. For example, to get the letter "a" we simply follow the path 0-0, to get the letter 'p' we follow the path 1-1-0-1. Our new character map looks like this:
a - 00
[space] - 01
d - 10
l - 1100
p - 1101
s - 1110
e - 1111
Using this coding our file now looks like this:
42 bits, or a little more than 5 bytes. Not as good as our first attempt, but this time it is reversible (we call this lossless compression). We just need to start at the left and read each bit and follow the tree until we find a character.
The first bit is a 0, so we start at the top of the tree and move left, following the zero. The next bit is also a zero so we move left again and that brings us to an ‘a’ … first letter is a. The next bit is 1, starting back at the top of the tree we follow the 1 to the right. The next bit is 0, so we follow the 0 to the left and end up at ‘d’, so now we have “ad”. The animation below shows how we can continue this process until we have recreated our original text: add all apples.
Often we think of file compression as something we do to a file after we save it, such as when we zip a file or files to make them smaller to send over email. But compression algorithms are built into much of what we do now. Image formats (png, gif, bmp, etc.) all have their own specific compression algorithms that are built into the format - that way the images take up less space on your drive and, more importantly, use less bandwidth with transmitted. Speaking of bandwidth ... ever stream a video? File compression is at the heart of all of your video streaming services, making it possible to efficiently transmit videos to your tv through the internet.
Hug a computer scientist
You have them to thank for making it possible to get all that content to your computer, phone, and smart tv without breaking the bank. The alorith we reviewed today, Huffman Coding, is just one of many many compression algorithms. Some are general purpose, others are designed specifically for certain types of data such as audio or video. Where they were once made for convenience - saving some space on your hard drive - they are now an integral part of how we move information around the internet.
Need help with a project? Need an extra pair of eyes on our RFP? Need an expert perspective on your technology strategy? Give me a call.
Compression techniques involving an encoder to compress and a decoder to decompress. Used in image and video compression. The word is made up of encode and decode ... codec.
Some compression algorithms don't decompress a file with 100% accuracy. Information is lost. Before you write this off as a bad idea, consider this. JPEG, MP3, and WMV formats are leverage lossy compression - some of the most important media formats all use compression formats that lose data.
Morse code is an early form of data compression. The codes used for more commonly used letters are assigned shorter codes than the lesser used characters. This makes transmitting Morse code mor eefficient.