Saturday, October 31, 2009
When an "A" is not an "A"
Why do some audio codecs need so much more data than others? Consider this:
Text files and image files take completely different approaches to how they save information. A text file doesn’t save how your words look, it saves what the words are. An image file is the opposite: it preserves the look of things, and doesn’t care with what they represent.
If a text file only needs to distinguish among 256 different characters – and that give you ten alphabets’ worth of special characters – then each letter needs eight bits. One byte per character. That’s pretty efficient, but it only works because we’ve limited its abilities: it’s only allowed to store text, not pictures, and we have to agree beforehand on what those 256 characters are.
Image files, on the other hand, only assume that the source is an image, any image. A good example is .jpg format (JPEG is “Joint Photographic Experts Group,” the party animals who first agreed on that standard). So it starts with an arbitrary image, a checkerboard of any number of pixels, and then follows an algorithm to boil that down as best it can by removing redundancies. But it’s always limited by its inability to make many simplifying assumptions. It can’t take the easy road that a .txt file can, just assuming the source is a letter; it could be a fondue pot.
Let’s try a simple case: I want to send the letter “A” somewhere.
Here’s your eight bits: 01000001 in binary. A bunch of people called ASCII agreed on this lookup table many decades ago – it’s just another kind of Morse code.
Since I want to send a clear picture of the letter “A,”
I’ll start with an image space about 160 x 180 pixels. That’s 2880 pixels for a really clear letter (see how nice it looks?). Then, depending on how aggressive I’m willing to be, I can boil that down anywhere from 2x to maybe 20x. Add overhead for arbitrary color (three colors, remember), and this one-byte letter is now tens or hundreds of bytes.
What did I get in return for all this extra space? I can send any picture I want.
Speech codecs are like that too - they offer the same tradeoffs. I’ll talk about that next.
Footnote: Text coding is a slippery slope. The move to multilingual documentation has pushed a transition to double-byte coding. 256 choices is no longer enough, we now need 256^2, or 65,536 choices to handle the possibilities of Chinese language, Farsi, and all the rest. See what happened? By growing the number of choices, we grow the “codebook” and so make the coding more versatile but less efficient. This is a form of what is called, in its most general form, “vector coding.” Vector coding can be used for image compression, and then the codebook can dynamically adapt to the content of the image. It’s more efficient – fewer bits, but more complex – more calculation.