Design Article

Speech Recognition Technology Backgrounder

Andrew W. Davis

11/17/1996 12:00 AM EST

 
Unsure about some of the speech recognition lingo?

Click here for a speech recognition mini-glossary.
 
While algorithm developers are continuously working to improve the accuracy and speed of their recognizers, DSP chip vendors are also evaluating architectures that lend themselves more readily to speech problems, and designers of products ranging from TV and VCR programmers to cellular telephones, desktop computers, dedicated dictation machines, and vehicle control systems are investigating or deploying speech systems as the next generation interface.

Automatic Speech Recognition (ASR) is also an exciting business and profit opportunity for VARs and OEMs to provide their customers with a new user interface, one which promises natural speech as a substitute for pointing, clicking, and keyboard or keypad entries. The concept of speech recognition is simple. The users speech pattern is analyzed by a computer, matched with an existing database, and then replaced by the ASCII ones and zeros which the computer and all its programs understand so well.

ASR technology is not new. The fundamentals of speech and speech recognition are as old as the computer industry itself. In the late 60s, Stanley Kubrik's movie 2001: A Space Odyssey featured the HAL computer with incredible ASR and TTS capabilities. While HAL remains a technological pipe-dream even today, the fact is that tremendous advances have been made in the ASR field in the past few years, advances which have improved performance and dropped implementation costs dramatically. In the past five years, while the performance available from desktop computers has done up by a factor of 100-200, speech algorithms have improved to actually reduce the computational power needed to process a given vocabulary by a factor of 20. The combination result is a 4000:1 improvement factor. So, even though we have not yet approached the science-fiction ideal of unconstrained continuous-speech dictation, there is no question that that is the direction in which we are heading.

Speech developers are still searching for the holy grail of machine interfaces, a recognition engine which understands any speaker, interprets natural speech patterns, remains impervious to background noise, and has an infinite vocabulary with contextual understanding. However, practical product designers, OEMs, and VARs can indeed use today's speech recognition engines to make major improvements to today's mainstream markets and applications. Selecting such an engine for any product requires understanding how the speech technologies impact performance and cost factors, and how these factors fit in with the intended application.

ASR is commonly described as converting speech to text. The reverse process, in which text is converted to speech (TTS), is known as speech synthesis. Speech synthesizers often produce results that are not very natural sounding. Speech synthesis is different from voice processing, which involves digitizing, compressing (not always), recording, and then playing back snippets of speech. Voice processing results are very natural sounding, but the technology is limited in flexibility and is disk storage-space-intensive compared to speech synthesis.


Speech Recognition Technology Overview

Every speech recognition product can be evaluated along a multi-dimensional price-performance-features matrix:

  • Speaker-Independent vs. Speaker-Dependent Technology
    This is probably the "great divide" in the speech recognition field. Speaker independent means that in theory any individual can speak commands to the computer, without having to "train" the system for his voice. The cornerstone of this technology is that rather than having one person train the system, hundreds or thousands of people do so as part of the recognition engine development cycle. The speech product is delivered to the end user in a working stage; the computer performs statistical matches between what any speaker says and the "canned" library of speech patterns. Speaker-independent approaches are the only ones that make sense when an ASR interface is being used for access or use by the general public. You could not expect every user who calls-in to an information service to go through the bother of first training the recognition engine on his voice. Apple, IBM, Microsoft and other computer vendors are pursuing this technology for office applications.

    Speaker dependent means that operators have to "train" the system by speaking the words to be used. This entails an extra set-up operation at the end user site; every individual operator of the speech system must train the system to recognize his/her own voice, speaking the words that are to be recognized in the final application. This can take from ten minutes to three hours, but need be done only once per operator. Fortunately, there are many applications where the list of words to be recognized, and hence trained-on, is small. The benefit accrued from this training task is that speaker dependent systems provide on-line performance advantages in speed, accuracy, and industry-specific vocabulary applications. Speaker-dependent systems are impervious to accents and also work better in high noise environments like factories.

    Some speaker independent systems are now marketed as "adaptive." While they work "out-of-the-box" and require no training, they also "learn" and modify themselves to recognize the patterns of a specific user. These adaptive systems are not suited to systems used by dozens of hundreds of users (such as those accessed over the phone system) since they would be in constant learning mode.

  • Discrete vs. Continuous Recognition
    Discrete systems force the operator to pause between words or phrases whereas continuous products aim to recognize a natural speech cadence. However even continuous systems face a practical limit on operator speed. In fact, the distinction between discrete and continuous speech recognition products is not sharp; there is a continuum of "discreteness" in the products and technologies available. Sometimes people refer to products in the middle as "connected; this means that at least 50 milliseconds of silence must separate words, whereas discrete systems may require up to five times as long. Often the "discreteness" of a system is dictated by its recognition engine's response time; it would be too confusing to speak with 25 millisecond gaps if the recognizer required 35 milliseconds to recognize each word. In applications like voice response, where short answers are generally uttered in response to programmed questions, discrete recognition is appropriate.

  • Small Vocabulary vs. Large Vocabulary
    Like the issue of "discreteness", vocabulary size is also continuous. The real issue is how big a vocabulary is needed by the application andd how much of the vocabulary can be made active at one time. For example, an office dictation application might require a vocabulary of 10,000 words while an industrial inspection task might require only 200 words, and a simple IVR program 50 words. The maximum number of words active at one time can depend on memory available, accuracy required, and response time needed by the different applications.

  • Portable vs. Non-Portable Hardware Systems
    Some voice recognition applications such as manufacturing inspection or environmental monitoring require portable hardware. This places size, power, and memory limitations on the design that make necessary other compromises. Other systems can be mainframe based, or take advantage of stationary desktop computers to use more powerful processors and memory intensive algorithms to achieve higher levels of ASR performance.

  • Close vs. Remote Talking
    In close talk applications, the speaker is close to the microphone or headset. In far talk applications the speaker is remote from the microphone; echo and noise conditions are special problems the recognizer must overcome. Telephony applications are a special set of close-talk, because although the speaker is in fact close to the microphone, the variability of microphone quality is high, and the connection between the speaker and the recognition engine is via the phone system, which presents an entirely different set of bandwidth and noise problems.

Figure 1:  Five major factors in designing a speech recognition system

  Speaker Independent Speaker Dependent
Recognition Accuracy Moderate High
Operation in High Noise Environment Poor Good
Operator Familiarization Required Maybe Yes
Vocabulary Training Required No Yes
Suitability for Continuous Speech Low-Moderate High
Vocabulary Size Large Medium
Vocabulary Flexibility Maybe Yes
Memory Requirements Higher Lower
Recognition Speed Lower Higher
Support for Accents and Dialects No Yes

Table 1:  Comparison of speaker independent and speaker dependent attributes

One useful feature of ASR systems is referred to as 'cut through', the ability to recognize a response when it overlaps or even precedes a prompt; this is helpful in many applications where the operator is cycling through the same routines or familiar with the questions and doesn't need to wait for the entire prompt before answering.

The five major attributes described above all boil down to one issue in operation: recognition accuracy. Speech recognizers make three types of errors. Substitution errors occur when the machine substitutes an incorrect word for the spoken word. You said "apple" and it recognized "orange." A rejection error occurs when the recognizer does not classify a spoken word but rejects it instead. You said "seven" and the recognizer responded, "sorry, I didn't understand you." A spurious response error occurs when the recognizer classifies a sound, noise or invalid word (not part of the acceptable vocabulary) as a valid word. For example, the recognizer identifies a door slam as a "yes" or a cough as a "no."

Accuracy or error rate is largely a function of computational resources. For example, to achieve the same level of accuracy, you need more CPU and algorithm horsepower with a continuous speech system vs. a discrete system, and you need more horsepower as the size of the active vocabulary increases. You also need more horsepower if frequency bandwidth is limited and if line echo is present. Available horsepower, of course, is a function of budget, which is a function of target application.


Three Major Markets for Speech Recognition

The different types of technologies used in speech recognition are aimed at three broad classes of applications. Each class has different price, performance, and feature sets.

  • Office
    Many voice systems intended for office applications are offered as enhancements to the basic desktop PC. Most of these are based on speaker-independent technology and use headsets or desktop microphones. Some products require add-in boards with special digital signal processing chips; others use the chips available on sound cards; still others use software-only techniques that employ the host CPU for the speech algorithms. This approach is guaranteed to become more popular and more practical as the processing power of desktop computers continues its evolution, as Pentium and PowerPC chips replace older architectures. The resultant differences include speed of speech recognition and vocabulary size. Voice recognition is also used in many devices to aid the physically handicapped: keyboard and mouse replacements for data entry, wheelchair and appliance control, etc. Office products are also available as dedicated hardware and software systems for customers in dentistry, mammography, radiology, pathology, legal, and other identified niches. These solutions are often sold as automated dictation systems.

  • Industrial
    Industrial applications usually represent a different design center for the technology; devices must work in rugged, often noisy environments; they must be accurate to avoid early operator frustration; and they must easy to use, since the voice operator is always busy doing something else, such as driving a forklift, or working a shipping/receiving line. Industrial voice products are usually also portable. Some use radios to send data to host PCs or mainframe computers. Horsepower requirements also impact design for portability. Hence, for many industrial applications where workers require portable, battery -powered voice terminals, speaker dependent technology is a better choice. Speaker dependent speech requires less CPU power, is suitable for continuous speech, and easily accommodates job-specific vocabulary.

  • Telephony
    Voice recognition over the telephone is a major application of the technology. ASR can be used where the caller does not have a touch-tone phone, still very common outside the U.S., or for car-phone services, where hands-free, eyes-free operation has important benefits, or for applications where the vocabulary doesn't map well onto the limited sixteen-pad touch-tone phone. (How would you order a blue button-down oxford-cloth dress shirt in size 15 1/2 x 34 using a touch-pad?)

    The reduced bandwidth of a telephone line, the poor quality of many telephone microphones, line problems such as echo, static, and background noise combine to create a far more difficult environment for telephone-based speech recognition than is typical of office or even industrial environments. Many of these factors (such as line quality and microphone quality) are usually outside the control of the speech system designer so the system must be designed to handle the worst cases. Hence many over-the-telephone voice recognition systems seem to be much more limited than those intended for office and industrial use. The problem of cellular telephone voice recognition is even more demanding. All telephone products employ high horsepower electronics and are typically shared over multiple phone lines.


Market and Technology Fit

The different types of speech recognizers can be visualized on a Venn diagram, where five axes correspond to the five attributes described above. The figure maps the design features of an ASR product optimized for telephony applications. For comparison, an ASR optimized for industrial inspection is also presented. Neither product is necessarily better than the other, they are just different, and intended for different operating environments.

Figure 2a:  ASR Engine Optimized for Telephony Application

Figure 2b:  ASR Engine Optimized for Industrial Inspection Application


Speech Recognition and Telephony

The Telephone Network
Probably the biggest market today for speech recognition systems is the general area known as telephony. Telephony-based applications of ASR are special. In a stand-alone or desktop configuration, the speaker is directly connected to the recognition engine and uses the same microphone all the time; in a telephony configuration, the public switched telephone network is in the middle and a wide variety of microphones are in use.

Figure 3:  Telephone access to a speech recognition system

Figure 4:  The phone system as seen by the recognition engine

Discrimination of words over the telephone network is a challenging task. Recognition success depends on how well and how consistently the speech signals can be analyzed and identified after having passed through a variety of different types and qualities of telephone microphones and switched network line connections. From the ASR engine perspective, a human voice produces a signal in the presence of background noise. This noise could be from office equipment or from a passing truck. A telephone microphone with unknown acoustic properties converts the speech and background noise into an electrical audio signal with various amounts of echo. Network noise is added and the signal is also filtered to cut out frequencies below 300 Hz and above 3300 Hz. The bandwidth filtering is designed to eliminate unwanted signals that could disturb conversations or cause errors in control signals; it also is the reason why music over the phone sounds "tinny" and not very rich. To the extent that speech recognizers rely on frequency analysis techniques (see below), the phone filters make the recognition task more difficult. In addition, variable equalizers in the system are used to bring different lines up to standard transmission standards. Hence, ASR engines for telephony need to be designed or optimized for robustness, for insensitivity to the wide range of speakers, microphones, and line qualities present.

Examples of telephony-based applications of speech recognition:

  • Home banking applications allow customers to query their account balances and the status of which checks have cleared, etc.
  • Shopping programs allow customers to specify products, enter credit card numbers, addresses, and other pertinent information. Shopping can be a 24 hour/day activity, from anywhere where there is a phone line.
  • Companies can fax back product literature in response to voice requests. Customers can quickly specify the exact literature they need without wading through layers of menu selections. The same approach works for technical support.
  • Queries can also be handled in response to voice prompts. Transportation companies can provide schedule information (What day are you traveling on? etc.) without requiring users to enter an endless chain of keypad responses. This same approach is being used by entertainment companies.
  • Advanced messaging systems will forward voice calls, email, and faxes after receiving voice commands.
  • In call routing, the caller may be asked to say what department or individual he is calling and then the call is automatically sent to the right extension.
  • Cellular phone services will provide voice activated calling after a user dials in to the central site, most likely via a speed dial button on the cellular phone. "Call Andrew Davis" will then be all that is needed for call completion. Mobile drivers will not have to take their eyes off the road.
  • Telephone companies can provide directory assistance for businesses as well as for residences. Users speak the name of the city and the name of the business; the computer responds with the telephone number.
  • Simple queries that result in simple decisions can be completely automated. "Will you accept the collect call from XXX" can be totally automated, reducing the costs of providing phone services.
  • Brokerage and stock exchanges are looking at voice activated systems to track orders and enter them into the system.


ASR System Architectures

Continuous progress in ASR algorithms has reduced the error rate, while making larger vocabularies more practical. At the same time, improvements in processor technology have made it possible to run many ASR tasks on general purpose computers, rather than on specialized digital signal processor (DSP) chips. For ASR tasks of a "personal" nature, in other words, single user, today's Pentium and PowerPC chips are quite capable of running speech recognition software with reasonable performance. This is known as native signal processing (NSP). Many systems today in fact ship with some ASR software bundled in via the operating system, via a separate applications program, or via an add-on enhancement to a word processor or spreadsheet application.

However, today's desktop hardware and operating systems are unable to handle ASR for large (more than one maybe, more than four definitely) numbers of users. Hence, for commercial applications (see below), dedicated speech recognition hardware (using DSP chips of some sort) is invariably preferred. Figure 5 shows a configuration where multiple end users might dial-in to an ASR-based application where the ASR resources are shared across a network. Note that the ASR engine itself may be shared across multiple users either via multitasking or via timesharing, but the ASR task is offloaded to a DSP subsystem(s). A DSP peripheral with appropriate software can provide the real-time speech performance needed by today's market applications while a general purpose processor is best suited for managing the speech resources, user interface, and high level applications programs.

Figure 5:  Multiuser, network-based telephony ASR

Today's DSP chips have many features which maximize speech performance, including single cycle multiply-accumulate instructions and parallel fixed point, floating point operations. Some DSP chips can run more than one recognizer at the same time, and some DSP boards provide more than one DSP chip per slot. For large commercial ASR configurations, the robustness, speed, and accuracy of DSP-based systems makes them the preferred choice. Some speech recognition systems are dedicated cards that act as shared resources among multiple channels. Other speech systems are built directly onto a voice card.


How ASR Works

There are almost as many technical approaches to the science of speech recognition as there are researchers in the field and a detailed analysis of the theories is beyond the scope of this article. But it is useful to consider a brief technical overview. Suppose the task is to recognize the phrase "file print". Using one typical approach, the operator speaks into a microphone or headset; the analog voice signal is digitized, usually at 8000 samples per second and broken into blocks of approximately 200 samples; the blocks of digital data are then transformed from the time domain into the frequency domain by a DSP chip. The amplitudes of perhaps 20 frequency components represents the "profile" or template for the phrase "file print". In a speaker-independent system, the profile is developed by looking at thousands of speech samples from many people and deriving some sort of average. (From a practical point of view, it is possible to buy pre-recorded samples of thousands of people uttering phrases in a "standardized", phonetically-balanced American English and other major languages of the world. This eliminates one source of tedium for the developer of recognition engines.) The deployed system may work fine in Topeka, but will probably have trouble in some neighborhoods in Boston, and may be totally unsuitable for Sydney. An obvious point: the accuracy of the speaker independent engine will improve the closer the user audience is to the training base. Adaptive systems are now available which allow the initial library to be expanded or modified to account for specific user speech patterns.

With a speaker-dependent system, the profile is specific to the individual operator. During the training session, the speech system will require the operator to repeat every needed phrase several times until the system is reasonably sure that it has a consistent pattern for that operator speaking that phrase. So if the end objective is to have the speech system recognize every one of the 5000 stocks listed on the NYSE and AMEX, the training requirements may be unacceptable. Training for the 26 individual alphabetic characters is of course no problem.

Once the speech recognition engine has digitized the incoming speech and calculated the required parameters, the system must compare the inputs to a "library" of known phrases and find the best match. There are many approaches to quickly searching for matching templates, the details of which are outside the scope of this article. These problems are akin to data base searches; the goal of course is to eliminate as many false paths as quickly as possible and proceed down the right search tree to the correct endpoint. Newer algorithms to do this take advantage of the higher processing power available today in order to achieve faster, more accurate searches.

The above description describes "whole word" recognition, which is the basis for all first generation speech products and for some of today's recognizers as well. Some second generation products use a sub-word technology known as "phonetic recognition." These support much larger vocabulary sizes because the language is broken down in to a small number of phonetic elements, the building blocks of all other words. A set of 47 subwords represents all the sounds of the English language. With this technology, words in the recognition vocabularies are generated by combining sub-word models in appropriate order, avoiding the large data collections of whole-word models.





Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form