Olive Branch Technology
  • Services
    • Management Consulting >
      • Professional Developmet
    • Big Data/Analytics
    • Software Engineering
  • Dollar Dashboards
  • News and Information
  • Contact
  • Learning

The Compiler

Technical knowledge for non-technical professionals

File Compression - more interesting than you think

5/17/2018

0 Comments

 
When you saw that today's topic was file compression, I'm sure you dropped everything, cleared your schedule, and got ready to sit back for an exciting read. Or ... more likely ... you’re about to hit the delete button. Hold on, don’t give up on me yet. Understanding file compression probably won’t help you all that much in your day to day life; it won’t help with an RFP or assessing a software solution; it won’t help you work with your IT teams. This falls squarely into the “gee wiz” category. If you stick with me, you might learn a little something that helps you speak geek better than most.

Some vocabulary
Before we dive into the guts of a file compression algorithm I want to cover a little basic computer science vocabulary.

Bit. A bit is the smallest unit of data a computer deals with. It can have two values: 0 or 1. Sometimes you will see them represented as a false (0) or a true (1),

Byte. A byte is a sequence of eight bits. That’s it, just a sequence of zeros and ones. For example: 10011011.

Character: A byte, or sequence of bytes, that represents some letter or symbol.  For example, the letter A may be represented as ‭01000001‬ - a sequence of eight bits (a byte) that we all agree should represent the letter A.

Encoding: A system, or standard, for converting bytes into symbols or characters. Unless the whole world can agree upon how to convert a sequence of bits into a letter or symbol, we’ll have a heck of a time sharing information. So the world has agreed upon certain ways to encode bytes into symbols. One of the most basic encoding standards is the ASCII standard. 01000001 represents a capital A in ASCII encoding so if you hand any computer in the world the sequence 01000001 and tell it that it is an ASCII character, that computer will recognize it as A.

Puzzles and Trees
If you look up file compression algorithms you will find an extraordinary amount of math. Understanding the math is necessary of you are going to build software to implement the algorithms. But if you just want to understand the basic concepts of how they work, you don’t need complicated math.

Time for an example. Let’s say we have a text file containing the words: 
add all apples.  If we encode this using the ASCII standard it becomes:

01100001 01100100 01100100 00100000 01100001 01101100 01101100 00100000 01100001 01110000 01110000 01101100 01100101 01110011

Including spaces, we’ve got 14 bytes of data. But with so few characters, we really don’t need a full byte to represent each character, we’re just doing that to comply with standards. Let's list each character that we're using and how many times it occurs in the string.


[space]  - 3 occurrences
a - 3 occurrences
l - 3 occurrences
d - 2 occurrences
p - 2 occurrences
e - 1 occurrences
s - 1 occurrences

Now what if we make up our own encoding mechanism that assigns fewer bits to characters  that occur more often.

[space]  - 0
a - 1
d - 00
l - 01
p - 10
e - 11
s - 000

Here's our new file: 100000101010110100111000

We just reduced our file size to 24 bits - or three bytes.

Not quite that easy
Unfortunately, life isn’t always that simple. For a compression algorithm to work you need to be able to reverse the process to get the original file back. In our example above, we can’t do that.  For example, if we see two positive bits … 11 in the compressed file, is that two a’s or is that a single s?There is no way to find out.

Back in 1951 a student at MIT, David Huffman, developed an algorithm to handle this problem as part of a class project.Rather than just create a lookup table of shortened codes to translate bytes to characters, he encoded all the information into a graph, specifically a tree. I made an example of a tree below.
To determine the value for any given character, simply follow the path from the root at the top to the apropriate node at the bottom. For example, to get the letter "a" we simply follow the path 0-0, to get the letter 'p' we follow the path 1-1-0-1. Our new character map looks like this:

a - 00
[space] - 01
d - 10
l - 1100
p - 1101
s - 1110 
e - 1111

Using this coding our file now looks like this:

001010010011001100010011011101110011111110

42 bits, or  a little more than 5 bytes.  Not as good as our first attempt, but this time it is reversible (we call this lossless compression). We just need to start at the left and read each bit and follow the tree until we find a character.

The first bit is a 0, so we start at the top of the tree and move left, following the zero. The next bit is also a zero so we move left again and that brings us to an ‘a’ … first letter is a. The next bit is 1, starting back at the top of the tree we follow the 1 to the right. The next bit is 0, so we follow the 0 to the left and end up at ‘d’, so now we have “ad”.  The animation below shows how we can continue this process until we have recreated our original text: add all apples.
It's everywhere!
Often we think of file compression as something we do to a file after we save it, such as when we zip a file or files to make them smaller to send over email. But compression algorithms are built into much of what we do now. Image formats (png, gif, bmp, etc.) all have their own specific compression algorithms that are built into the format - that way the images take up less space on your drive and, more importantly, use less bandwidth with transmitted. Speaking of bandwidth ... ever stream a video? File compression is at the heart of all of your video streaming services, making it possible to efficiently transmit videos to your tv through the internet.

Hug a computer scientist
You have them to thank for making it possible to get all that content to your computer, phone, and smart tv without breaking the bank. The alorith we reviewed today, Huffman Coding, is just one of many many compression algorithms. Some are general purpose, others are designed specifically for certain types of data such as audio or video. Where they were once made for convenience - saving some space on your hard drive - they are now an integral part of how we move information around the internet.
Need help with a project? Need an extra pair of eyes on our RFP? Need an expert perspective on your technology strategy? Give me a call.

$10 Word

CODEC

Compression techniques involving an encoder to compress and a decoder to decompress. Used in image and video compression. The word is made up of encode and decode ... codec.

Lossy Compression

Some compression algorithms don't decompress a file with 100% accuracy. Information is lost. Before you write this off as a bad idea, consider this. JPEG, MP3, and WMV formats are leverage lossy compression - some of the most important media formats all use compression formats that lose data.

Morse Code?

Morse code is an early form of data compression. The codes used for more commonly used letters are assigned shorter codes than the lesser used characters. This makes transmitting Morse code mor eefficient.

0 Comments

    Archives

    June 2018
    May 2018
    April 2018
    February 2018

    Categories

    All

    RSS Feed

Home

Management Consulting

Big Data & Analytics

Software Engineering

Conta​ct

Copyright © 2016, Olive Branch Technology
  • Services
    • Management Consulting >
      • Professional Developmet
    • Big Data/Analytics
    • Software Engineering
  • Dollar Dashboards
  • News and Information
  • Contact
  • Learning