JR's UC Voice: 2009

Thursday, December 3, 2009

Telepresence: Better than Airbags!

Passenger car fatalities have dropped from 23,000 in 1985 to 17,800 in 2006, according to the US Department of Transportation. Every two months, more people are killed on our roads than died in the 9/11 attack.

The decrease in highway deaths across the years is doubtless due to a number of factors. Airbags alone are credited with saving 6377 lives since 1990 (of the 600,110 total road deaths, according to DOT) since then.

But here’s a proposal for saving many more lives: encourage telecommuting. What would this achieve? Let’s look at some figures.

Some workers cannot telecommute at all, due to the nature of their jobs. Others could telecommute virtually every day. Let’s assume that only 50% of the workforce could telecommute at all, and that of that 50%, the average would be 1 day per week. I think this is likely a very low estimate, but let’s go with it for now. So we now have the average American worker telecommuting ¼ day per week.

Let’s also assume that of the miles driven in a day, the commute represents an average of 50%. Again, probably low, but it’s a start. The workweek driving miles are now reduced by 1/8 (12.5%) of a day’s commute per 5-day workweek, averaged across all workers, or 2.5% for the week.

Fatality rates are based on miles driven. This means that as we reduce the number of miles driven, we also reduce the number of fatalities. Cut miles by 2.5%? You reduce deaths by 2.5%.

And what’s 2.5% of 600,000? 15,000 lives! Telecommuting, even if as sparsely applied as this, would save over twice as many lives as airbags!

This makes HD Voice, videoconferencing, and telepresence all potent tools for safety because they greatly ease the process of the telecommute. What are we waiting for?

Sunday, November 15, 2009

The Phone Network, an 80-year-old dog

Why has it been so hard to teach that old dog, the phone network, the new trick of sound quality that's as good as what we hear?

In 1930, the goal of the phone system was stated like this: “In the Bell system the general objective which has been set up for the transmitted frequency range for new designs of telephone message circuits is a range having a width of 2,500 cycles, extending from about 250 cycles to about 2,750 cycles.”

The performance of the public telephone network has not gotten much better - even a little worse, in some ways, over the years. As recently as 1984, the higher end was still about 2.7 kHz for long-distance calls, and 280 Hz at the lower end. It’s hard to take something like the global phone network, one with fundamental goals set eighty years ago, and change its underlying fidelity in any meaningful way.

This why telephones are moving to VoIP, Voice over Internet Protocol (or internet telephony), to make better fidelity possible for most people. The internet carries data, and since any kind of signal can be converted into data, this means the internet can carry almost any kind of signal. Did you get that?

1. The Internet Carries Data

2. Any Signal can be turned into Data

Therefore: The Internet Carries Any Signal

Signals can be HD Voice, desktop video, Immersive Videoconferencing, and most other kinds of information you may want to share. Because the Internet can be made arbitrarily versatile and fast, it can keep up with the needs of live human communication for a long time to come.

Reference: “Transmitted Frequency Range for Telephone Message Circuits,” W.H.Martin, Bell System Technical Journal July 1930 Ref. JR1/11

Thursday, November 5, 2009

What Is "HD Voice?"

There’s a spectrum for sound as there is for light, and it spans the range from low, booming bass like kettledrums and distant thunder, to high whistles and hisses like birdsong and squeaking hinges.

The “color” of sound is described by its frequency. A single note of music has a frequency, the number of times it vibrates in a second. You see this when a guitar string is strummed – the lowest string, you can almost watch it throb, while the higher strings moves so quickly they’re just a blur. The moving string makes the air move, and it’s those repeating cycles of the moving air, that’s what we hear because they move our eardrums.

It makes sense, then, that the frequency of a tone is measured in cycles per second, or cps. In 1960, this nice clear name was renamed the “Hertz” after Heinrich Hertz, thereby inconveniencing all posterity for the sake of a dead guy, but the term still means cycles per second. One of these is a hertz (1 Hz), a thousand of these is a kilohertz (1,000 Hz or 1 kHz), a million a megahertz (1 MHz). The human ear is usually described as hearing 20 Hz to 20 kHz, which is three orders of magnitude or ten octaves.

In between these extremes is the human voice. The sounds we make when we talk or sing lie mostly between 80 Hz at the low end, and 14 kHz at the higher end. Vowels and “smooth” sounds are mostly below 4 kHz, while a lot of the consonants that tell one word from another, like “fell” from “sell” are above 4 kHz.

The telephone, however, only carries the thin slice of frequencies from 300 Hz to 3300 Hz. That’s right, your ears can hear with five times the fidelity of your phone, which is why phones sound so muffled. This started accidentally in the twenties, because the metal wafers, carbon granule microphones, and cloth-insulated coils they used could do no better, but phones today haven’t gotten much better then they were back then.

That’s where HD Voice comes in. “HD Voice” means a phone that has extended its fidelity to at least 7 kHz – doubling the frequency response compared to conventional narrowphones, and dramatically improving the sound and the ease of use.

I’ll be writing more to give a fuller perspective on why that’s important, but in short: by restoring the missing four-fifths of speech that the telephone cuts out, HD Voice boosts accuracy, reduces fatigue, overcomes accents and background noise, and makes telling people apart easier and a lot more natural.

Saturday, October 31, 2009

When an "A" is not an "A"

Why do some audio codecs need so much more data than others? Consider this:

Text files and image files take completely different approaches to how they save information. A text file doesn’t save how your words look, it saves what the words are. An image file is the opposite: it preserves the look of things, and doesn’t care with what they represent.

If a text file only needs to distinguish among 256 different characters – and that give you ten alphabets’ worth of special characters – then each letter needs eight bits. One byte per character. That’s pretty efficient, but it only works because we’ve limited its abilities: it’s only allowed to store text, not pictures, and we have to agree beforehand on what those 256 characters are.

Image files, on the other hand, only assume that the source is an image, any image. A good example is .jpg format (JPEG is “Joint Photographic Experts Group,” the party animals who first agreed on that standard). So it starts with an arbitrary image, a checkerboard of any number of pixels, and then follows an algorithm to boil that down as best it can by removing redundancies. But it’s always limited by its inability to make many simplifying assumptions. It can’t take the easy road that a .txt file can, just assuming the source is a letter; it could be a fondue pot.

Let’s try a simple case: I want to send the letter “A” somewhere.

Text file

Here’s your eight bits: 01000001 in binary. A bunch of people called ASCII agreed on this lookup table many decades ago – it’s just another kind of Morse code.

Jpg file

Since I want to send a clear picture of the letter “A,”

I’ll start with an image space about 160 x 180 pixels. That’s 2880 pixels for a really clear letter (see how nice it looks?). Then, depending on how aggressive I’m willing to be, I can boil that down anywhere from 2x to maybe 20x. Add overhead for arbitrary color (three colors, remember), and this one-byte letter is now tens or hundreds of bytes.

What did I get in return for all this extra space? I can send any picture I want.

Speech codecs are like that too - they offer the same tradeoffs. I’ll talk about that next.

Footnote: Text coding is a slippery slope. The move to multilingual documentation has pushed a transition to double-byte coding. 256 choices is no longer enough, we now need 256^2, or 65,536 choices to handle the possibilities of Chinese language, Farsi, and all the rest. See what happened? By growing the number of choices, we grow the “codebook” and so make the coding more versatile but less efficient. This is a form of what is called, in its most general form, “vector coding.” Vector coding can be used for image compression, and then the codebook can dynamically adapt to the content of the image. It’s more efficient – fewer bits, but more complex – more calculation.

Thursday, October 29, 2009

LET’S MEASURE HD VOICE!

Our ability to understand speech can be measured. The most common method is to select random syllables and ask a group of listeners to write them down, then mess up the audio (muffle the sound by lowering the bandwidth, for example), and run the test again. When this is done the result looks like this

This picture shows that as we pass more and more of the sound by raising the cut-off frequency, the listener’s accuracy also increases. This makes sense; we all know it’s hard to understand someone who’s very muffled, and it gets easier when they speak clearly. That’s why wideband audio in telephony is so important: with normal phone fidelity – 3 kHz – people make mistakes on one out of ten syllables! By raising the cut-off to 7kHz with HD Voice, the chart's red line shows this rises to nearly 100%. And in real-world settings, like accented speech, speakerphones, people sitting too far away, the difference is even more dramatic.

Sunday, October 11, 2009

The Spice of Fidelity

It's been said that the higher sound frequencies aren't very important for speech because there isn't much energy there. It's hard to argue the point, because it's true - well, part of it is true. The amount of speech energy that a standard telephone carries is ever so much higher than the amount it cuts off - a hundred times, or more. So yeah, it's not cutting off that much, there's not much up there.

But this argument is like those tricksters we find in politics, where a fallacy is linked to a truth and then presented as two truths. There isn't much energy there? Check. Not important for speech? Sound the alarm. We have to ask: when did we start to assume that the value of those higher frequencies was proportional to their energy? This is one of those assumptions that looks good on the surface, but we actually prove it fallacious all the time.

Let's talk soup. If you make a quart of good chicken soup, you'll be using almost a quart of water. Some noodles, some chicken, and some of this and that: oregano, thyme, marjoram perhaps, a touch of chile, and a nice pinch of salt. This gives you rough a kilogram of soup.

How much do you suppose those seasonings weigh? About three grams, mostly salt. So we can leave them out, right? They're just a tiny part of the total weight, so they can't be important. Who cares about three-tenths of a percent? It's a lot of trouble to get fresh spices anyway, so the economics don't add up. We'll just leave it all out, and our chicken soup will be soup and chicken, and nobody will be the wiser.

Stop looking at me like that, I'm just proving a point.

Yes, you're right. Leave out those herbs and the salt, and you've destroyed the succulence of the dish. And yet they're way less than a hundredth of its weight.

This principle applies everywhere. Oil on bearings? It's milligrams on grams. Perfume? maybe five milligrams, dabbed behind the ear of a 50,000 milligram lady. You've got a microscopic veneer of paint on your house, a hint of pigment that floods that deep color into a beautiful silk scarf, a half-carat diamond (1/300 of an ounce) making something suddenly very romantic out of a plain gold ring. They're all very tiny proportions, but they're what characterizes in the finished product.

The audio that conventional telephones ignore is like that. Not much energy, but it turns "failing" into "sailing" and an exhausted finish into a successful and energized meeting.

Set your own company's gourmet meeting chefs free. Restore those HD Voice spices and seasonings so they can begin serving up tastier, more productive telephone conferences!

Thursday, September 24, 2009

From consonants to culture

I did not see this one coming.

I've been talking for years about HD Voice (I'll post another thing about what that is) and how it becomes essential when you have a phone session that's anything less than perfect: accents, speakerphones, room noise, soft talkers, identification, technical talk or detail of any sort. But a CEO was telling a story today that added a new dimension for confusion: culture.

He spoke to us with a noticeable European accent, although nothing particularly thick. And he told us how some months ago, he was leading a discussion, by speakerphone, with his company's Asia team. He got to the core of the discussion and launched into a long, involved explanation of the most critical points. He had some (but not too many) slides, he crefully guided the team through his spreadsheets, he illustrated his lecture with other stories and lessons, and finally reached the triumphant end of a well thought-out and clearly delivered exposition, energized and ready for the flood of excited questions he knew would be coming.

Instead, there was silence. The distant chirping of crickets. The room held a long, uncomfortable pause, and finally a quiet, respectful voice hesitatingly stuttered a question that made it instantly clear that they not only didn't understand what he had said, they didn't even understand what the subject was!

As he drew them out with questions of his own, the light dawned. He realized that their culture was one in which they would never interrupt with questions. By their traditions, many of them would never ask, even afterward - the greater offense would be to question such a person as him, so they accepted their lot and held silent.

This CEO realized then that any tool that could help them better talk together was going to be essential if this partnership were to succeed. That's when he had the link converted to wideband audio, HD Voice, for the next meeting.

So I'm adding "culture" to the master list of why-HD-is-essential. It's another element of how people talk, and it can get in the way when they can't talk clearly.

JR's UC Voice