Talking to Machines: Automatic Speech Recognition
Even if we don’t realize it, most of us talk to machines on a fairly regular basis. If you have an iPhone, you’ve probably asked Siri about restaurants in your area or, quite possibly, why she’s not better at answering your questions (the answer was probably not helpful). Those without iPhones still talk to machines whenever they call a large organization and use voice prompts to navigate the answering system.
What is this strange, wonderful, and sometimes frustrating technology? It’s called automatic speech recognition, and it’s only going to be used more and more as we move forward, so you’d better learn how it works and how best to deal with it.
There’s More Than One Kind of ASR
All ASR technology operates under the principle of creating a wave form when you speak into a microphone, filtering out background noise, and breaking the wave form down into individual phonemes, such as the hard “k” sound in “key.” Using the first phoneme in a given word, computers use both context and statistical probability to determine what you actually said. But one of the most important things to learn about automatic speech recognition technology is that there are two different types of ASR – and you can’t interact with them in the same way.
Natural Language Programming (NLP) ASR. Siri is the perfect example of this kind of ASR technology. Through the use of statistical inference and “machine learning,” NLP software looks through real-world examples of speech that have been programmed in to make an educated guess at what you’re saying in addition to “remembering” specific phrases and combinations of words that you tend to use together frequently.
Direct Dialog ASR. This type of ASR is the one that most of us have probably been using for longer. It’s the one that you encounter when you call in to businesses and have to talk your way through the voice prompts of their phone system. Direct dialog ASR technology works by having a specific set of phonemes programmed in, and will only respond to those sounds. For example, it will get you to the right place if you say “representative,” but you might have trouble if you just say “I want to talk to a person!” Direct Dialog technology isn’t designed to learn and evolve – you have to adapt to it.
Automatic Voice Recognition Still Has Limitations
Under ideal circumstances, automatic voice recognition software is accurate 96 percent of the time. That’s pretty amazing. Unfortunately, human beings are rarely in “ideal circumstances” when using this kind of technology, so most of us have probably experienced a level of accuracy that’s far less than 96 percent. Here are several things that can cause problems with ASR.
Background noise. If you’re walking around looking for a place to eat and you’re next to a busy road, there’s a good chance that all of those cars rushing by may cause your ASR program to have difficulty understanding what you’re saying.
Multiple speakers. Ever tried asking Siri a question when you’re in a crowd? ASR programs are designed to single out voices and minimize background noise. Unfortunately, it’s not as good at distinguishing between two or more speakers, especially if they are relatively close to the device microphone.
Indistinct speaking. People who aren’t able to enunciate properly, speak with an accent, or have other issues that make them difficult to understand cause problems for ASR programs.
Low quality hardware. Even if you’re using the most advanced ASR program in the world, you’re likely to have trouble if the microphone that you speak into is cheaply made.
Despite all this, ASR technology will only get better and better as time goes on. To learn more about this fascinating technology, check out this infographic from West Interactive.