Saturday, October 31, 2009

When an "A" is not an "A"

Why do some audio codecs need so much more data than others? Consider this:

Text files and image files take completely different approaches to how they save information.  A text file doesn’t save how your words look, it saves what the words are.   An image file is the opposite:  it preserves the look of things, and doesn’t care with what they represent.   

If a text file only needs to distinguish among 256 different characters – and that give you ten alphabets’ worth of special characters – then each letter needs eight bits.  One byte per character.  That’s pretty efficient, but it only works because we’ve limited its abilities: it’s only allowed to store text, not pictures, and we have to agree beforehand on what those 256 characters are.

Image files, on the other hand, only assume that the source is an image, any image.  A good example is .jpg format (JPEG is “Joint Photographic Experts Group,” the party animals who first agreed on that standard).  So it starts with an arbitrary image, a checkerboard of any number of pixels, and then follows an algorithm to boil that down as best it can by removing redundancies.  But it’s always limited by its inability to make many simplifying assumptions.  It can’t take the easy road that a .txt file can, just assuming the source is a letter; it could be a fondue pot.

Let’s try a simple case:  I want to send the letter “A” somewhere. 

Text file
Here’s your eight bits:  01000001 in binary.  A bunch of people called ASCII agreed on this lookup table many decades ago – it’s just another kind of Morse code. 

Jpg file
Since I want to send a clear picture of the letter “A,”

I’ll start with an image space about 160 x 180 pixels.  That’s 2880 pixels for a really clear letter (see how nice it looks?).   Then, depending on how aggressive I’m willing to be, I can boil that down anywhere from 2x to maybe 20x.   Add overhead for arbitrary color (three colors, remember), and this one-byte letter is now tens or hundreds of bytes. 

What did I get in return for all this extra space?   I can send any picture I want.  

Speech codecs are like that too - they offer the same tradeoffs.  I’ll talk about that next.

Footnote:  Text coding is a slippery slope.  The move to multilingual documentation has pushed a transition to double-byte coding.  256 choices is no longer enough, we now need 256^2, or 65,536 choices to handle the possibilities of Chinese language, Farsi, and all the rest.  See what happened?  By growing the number of choices, we grow the “codebook” and so make the coding more versatile but less efficient.  This is a form of what is called, in its most general form, “vector coding.”  Vector coding can be used for image compression, and then the codebook can dynamically adapt to the content of the image.  It’s more efficient – fewer bits, but more complex – more calculation.  

Thursday, October 29, 2009


Our ability to understand speech can be measured.  The most common method is to select random syllables and ask a group of listeners to write them down, then mess up the audio (muffle the sound by lowering the bandwidth, for example), and run the test again.  When this is done the result looks like this

This picture shows that as we pass more and more of the sound by raising the cut-off frequency, the listener’s accuracy also increases.  This makes sense; we all know it’s hard to understand someone who’s very muffled, and it gets easier when they speak clearly.  That’s why wideband audio in telephony is so important:  with normal phone fidelity – 3 kHz – people make mistakes on one out of ten syllables!   By raising the cut-off to 7kHz with HD Voice, the chart's red line shows this rises to nearly 100%.  And in real-world settings, like accented speech, speakerphones, people sitting too far away, the difference is even more dramatic.

Sunday, October 11, 2009

The Spice of Fidelity

It's been said that the higher sound frequencies aren't very important for speech because there isn't much energy there.  It's hard to argue the point, because it's true - well, part of it is true.  The amount of speech energy that a standard telephone carries is ever so much higher than the amount it cuts off - a hundred times, or more.  So yeah, it's not cutting off that much, there's not much up there.

But this argument is like those tricksters we find in politics, where a fallacy is linked to a truth and then presented as two truths.  There isn't much energy there?  Check.  Not important for speech?  Sound the alarm.  We have to ask: when did we start to assume that the value of those higher frequencies was proportional to their energy?  This is one of those assumptions that looks good on the surface, but we actually prove it fallacious all the time.

Let's talk soup.  If you make a quart of good chicken soup, you'll be using almost a quart of water.  Some noodles, some chicken, and some of this and that: oregano, thyme, marjoram perhaps, a touch of chile, and a nice pinch of salt.  This gives you rough a kilogram of soup.

How much do you suppose those seasonings weigh?  About three grams, mostly salt.  So we can leave them out, right?  They're just a tiny part of the total weight, so they can't be important.  Who cares about three-tenths of a percent?  It's a lot of trouble to get fresh spices anyway, so the economics don't add up.  We'll just leave it all out, and our chicken soup will be soup and chicken, and nobody will be the wiser.

Stop looking at me like that, I'm just proving a point.

Yes, you're right.  Leave out those herbs and the salt, and you've destroyed the succulence of the dish.  And yet they're way less than a hundredth of its weight.

This principle applies everywhere.  Oil on bearings?  It's milligrams on grams.  Perfume?  maybe five milligrams, dabbed behind the ear of a 50,000 milligram lady.  You've got a microscopic veneer of paint on your house, a hint of pigment that floods that deep color into a beautiful silk scarf, a half-carat diamond (1/300 of an ounce) making something suddenly very romantic out of a plain gold ring.  They're all very tiny proportions, but they're what characterizes in the finished product.

The audio that conventional telephones ignore is like that.  Not much energy, but it turns "failing" into "sailing" and an exhausted finish into a successful and energized meeting.

Set your own company's gourmet meeting chefs free.  Restore those HD Voice spices and seasonings so they can begin serving up tastier, more productive telephone conferences!