News & Analysis
Architecture eyes network, local speech
Phil Shinn, Speech Scientist, HeyAnita Inc., Los Angles, Alan Schwartz, Vice President, Business Development, SpeechWorks, Los Angeles
3/4/2002 7:06 AM EST
Speech technology and telematics are teaming up to enable car drivers and passengers to browse Internet-based content with their voices, plugging them into customized information and entertainment.
Speech recognition systems are based on complex statistical models that are able to characterize the properties of the sounds of the language to be recognized. This is an especially challenging problem because of the large degree of variability introduced along the way. Things such as accents, speaking style, voice quality, background noise, telephone handset variability and transmission channel differences all affect the properties of the signals that reach the speech recognition system. It is the job of the statistical models, which are trained on large amounts of real speech, to take into account all of the factors described above in order to judge the probability that a given segment of speech is a particular phoneme.
A speech recognition system comprises a number of components that work together to deliver effective applications. For example, a caller speaks and the speech recognition system captures the utterance and then digitizes them.
In order to support "barge-in," which allows the user to interrupt the prompt, the system must perform echo cancellation to remove the echo of the outgoing prompt from the signal, and must support speech detection in the presence of noise and any residual echo from the outgoing prompt. In order for barge-in to be effective, these algorithms must detect speech and cut off the outgoing prompt very quickly, ideally within 100 milliseconds from the beginning of the user's speech.
Spectral representation describes the way the caller's spoken words have been broken into individual frequency components.
Spectral representation first converts the signal into the spectral domain (energies over time in 128 different frequency bands) and then maps that signal onto a nonlinear spectral scale, which mimics the way the human ear works. A number of techniques are used in the spectral-representation stage to reduce the variability caused by noise and channel conditions.
Statistical models translate the spectral representation into phonemes. Complex statistical models are able to distinguish dynamic properties of the individual speech sounds, thus increasing overall accuracy.
Phonetic modeling measures the properties of the speech signal and determines the most probable distribution of each of the phonetic units, using complex statistical models that are trained on large amounts of previously collected speech data. This training sets the millions of parameters in the statistical model to allow the model to best match the characteristics of each of the speech sounds. The previous stages are able to take a speech signal and produce a network of possible segmentations, each with associated probabilities of every possible speech sound. The job of the search stage is to compare that with every possible thing the user might have said and then find the best match.
Deploying successful speech applications in the car is challenging because of the noisy acoustic environment, such as road bumps, wind, radio, windshield wipers, turn signal indicators, engine noise, other cars, tire whine and cabin resonances. In addition, the hands-free microphone and speaker systems are acoustically quite different from the sorts of microphones and speakers for which most speech systems are designed.
The microphones in the car, for example, generally are not close-talking microphones, and therefore pick up more background noise. They also have echo suppression algorithms that limit input sensitivities when the system is speaking to the driver. This created problems when users interact with the system, especially when users try to speak when the system is speaking to the user.
Companies have been looking for ways to improve performance for the in-car environment. SpeechWorks, for instance, has recently built speech recognition models and processing methodologies specifically tuned to cell phone use in the car.
Working with engineers at HeyAnita, SpeechWorks launched an acoustic-modeling project. The engineers first conducted an audio data test collected from within a moving car, then created new acoustic models and measured performance on a held-out data set. Performance was dramatically improved.
Algorithm developed
Further analysis of barge-in errors showed how the car's turning signals had a strong interaction with the barge-in algorithms. The result: An endpointing algorithm was developed for the in-car environment.
SpeechWorks is attempting to improve in-car performance using a new breed of speech recognition solutions called Distributed Speech Recognition. DSR is an architecture that allows applications to combine local speech processing that occurs entirely on the device with remote access to network-based speech services. With DSR, signal processing, including noise reduction, occurs on the device, which then sends data over a digital network to a network-based speech service. The network-based speech service processes the resulting signal to determine the user's request and responds to the caller using a voice output or visual display, or both.
Industry efforts are under way to standardize the DSR approach. For example, ETSI, the European Telecom Standards body, has already issued a first version of a standard called Aurora. An updated version of Aurora is expected this year.



