Design Article
A Text-Dependent Approach to Speaker Identification
A. Sankaranarayanan
9/16/2002 12:00 AM EDT
![]() |
|
ABOUT THE AUTHOR
A. Sankaranarayanan received a Bachelors' degree in Electronics and Telecommunication Engineering at the University of Mumbai and plans to pursue graduate studies in Electrical Engineering. His area of interest is speech signal processing.
|
||
Although digital fingerprint identification and iris scanning are extremely accurate indicators of an individual's identity, speaker identification is an upcoming technique. Speaker identification systems are popular in spite of their poorer accuracy vis--vis the other techniques previously mentioned because they are the least expensive to build (they can be implemented on any general-purpose computer) and are also non-invasive in nature. Speaker identification systems may be classified in two categories based on their principle of operation.
- Text-dependent systems, which make use of a fixed utterance for test and training, and rely on specific features of the test utterance in order to effect a match.
- Text-independent systems, which make use of different utterances for test and training, and rely on long-term statistical characteristics of speech for making a successful identification.
Text-dependent systems require less training than text-independent systems and are capable of producing good results with a fraction of the test speech sample required by a text-independent system.
Speech-Production Model
The development of a text-dependent speaker identification system requires a thorough understanding of the nature of speech and the model of speech production. At a relatively high level, speech may be thought of as being composed of a string of phonemes (basic sound units). The English language consists of approximately 42 phonemes.
Speech is produced by the flow of air through the various articulators such as the vocal tract, lips, tongue, and nose. Air is forced out of the lungs through the trachea and the glottis, where it passes through the vocal cords. The vocal cords, if tense, vibrate like an oscillator, but if relaxed, do not vibrate and simply let the air pass through. The air stream then passes through the pharynx cavity and, depending on the position of a movable flap called the velum, exits either through the oral cavity (mouth), or the nasal cavity (nostrils). In the former case, the tongue and the teeth may modify the flow of the air stream as well. Different positions of these articulators give rise to different types of sounds. All sounds can be divided into the following broad categories.
- Voiced sounds are produced whenever the vocal cords are tensed and vibrate. Vowels ('a', 'e', 'i', 'o', and 'u') and diphthongs fall in this category of sounds. The frequency of vibration of the vocal cords is called the pitch. Moreover, the vocal-tract configuration for these sounds results in a resonant structurethe vocal-tract resonance frequencies are known as formants.
- Unvoiced sounds are produced when the vocal cords are relaxed and, therefore, do not vibrate. Fricatives (sounds such as 'shh' and 'f') and aspirated sounds (whispered speech) are examples of unvoiced sounds. Turbulent airflow occurs either at the mouth (fricatives) or at the glottis (aspirated sounds) to produce speech that exhibits a distinct lack of periodicity. The spectrum of unvoiced sounds usually lacks resonant peaks and has a broadband structure.
- Plosive sounds are produced when there is a build-up of pressure due to constriction at some point in the vocal tract followed by a sudden release, which leads to transient excitation. This may occur with or without vocal cord excitation. Examples of plosive sounds include the 'p' in 'pin' (an unvoiced plosive) and 'b' in 'bin' (a voiced plosive).
A powerful tool for analysis of speech is the source-filter model (Figure 1 shows a simplified version) of human speech production.
This model is an approximate representation of the excitation source and the vocal tract. Although not very accurate for some types of sounds (especially unvoiced sounds), it provides a useful way to quantify several parameters that you can use for speaker identification.
The model in Figure 1 assumes two sourcesthe switch alternates between the glottal pulse generator (for voiced sounds) and the random noise generator (for unvoiced sounds). These sources are filtered by the vocal tract (represented by the time-varying filter). The figure omits some details (such as the mouth radiation model) for simplicity.
- The glottal pulse generator represents the vibration of the vocal cords and is the active source for production of voiced sounds such as vowels. It is also known as the buzz source. The period of the impulse train generated by this source is known as the pitch period or fundamental frequency of the utterance. The output frequency spectrum is rich in harmonics of the fundamental frequency.
- The random noise generator is responsible for generating the random turbulence and pressure build-up waveform for unvoiced sounds such as the fricatives. It is sometimes called a hiss source. The frequency spectrum of this source is relatively flat; this explains the broadband nature of unvoiced sounds.
- You can represent the dynamic nature of the speech articulators constituting the human vocal tract by a time-varying digital-filter labeled in Figure 1 as the vocal-tract filter model. The parameters (coefficients) associated with this filter vary over a period of about 5 to 20 milliseconds, depending on the nature of the utterance, in step with the changing configuration of the vocal tract. Since you can model the vocal tract as a tube whose shape changes with time, it exhibits resonance at specific frequencies (formants). Peaks in the frequency response of the vocal-tract filter represent these formants.
The source-filter model assumes that it is possible to separate the excitation source from the vocal-tract filter, and also assumes an all-pole (autoregressive) vocal-tract filter. These assumptions are not entirely accurate for many speech sounds. Nevertheless, this model forms a very useful basis for understanding the nature of speech production and for quantifying several parameters that characterize speech.
Speaker-Identification Features
The source-filter model discussed in the previous section provides useful parameters for identifying a speaker. One such quantity is the pitch period or fundamental frequency of speech. Pitch varies from one individual to another; pitch frequency is high for female voices and low for male voices. This suggests that pitch might be a suitable parameter to distinguish one speaker from another, or at least to narrow down the set of probable matches.
Analysis of the frequency spectrum of the test utterance also provides valuable information about speaker identify. The spectrum contains both pitch harmonics and vocal-tract resonant peaks, making it possible to identify the speaker with a high probability of being correct.
You can also use the vocal-tract filter parameters (filter coefficients) to good effect for speaker identification. This is due to the fact that different speakers have different vocal-tract configurations for the same utterance.
In any text-dependent speaker identification system, an important decision is the choice of test utterance. As discussed in the previous section, the source-filter model is most accurate at representing voiced sounds, such as the vowels. Vowels have a definite, consistent pitch period. The vocal-tract configuration for vowel-utterances exhibits a clear formant (resonant) structure. The frequency spectrum corresponding to vowel-utterances therefore contains a wealth of information that can be used for speaker identification. The prototype speaker identification system built by the author (to be described later in this paper) makes use of the vowels ('a', 'e', 'i', 'o', and 'u') for the test utterance.
Pitch-Period Estimation
A number of algorithms exist for pitch-period estimation. The two broad categories of pitch-estimation algorithms are time-domain algorithms and frequency-domain algorithms. Time-domain algorithms attempt to determine pitch period directly from the speech waveform (examples include the Gold-Rabiner algorithm and the autocorrelation algorithm). Frequency-domain algorithms use some form of spectral analysis to determine the pitch period (an example is the method of cepstral truncation).
Although frequency-domain algorithms may yield higher accuracy, time-domain algorithms have the advantage that they can be implemented with minimal difficulty on a general-purpose digital computer. A computationally efficient algorithm due to Gold and Rabiner
makes use of parallel processors to produce pitch period estimates that are quite reliable. A brief description of the algorithm follows.
The algorithm begins by passing the speech signal through a low-pass filter with a cutoff frequency of 600-800 Hz, which removes the higher harmonics of pitch frequency that might interfere with accurate pitch estimation. This is acceptable, since the pitch frequency rarely increases above 500 Hz, even for a high-pitched female voice.
The filtered speech signal is processed to generate six impulse trains. These impulse trains come from the local maxima and minima of the speech waveform; their function is to retain the periodicity of the speech signal while discarding features irrelevant to the process of pitch detection. The reason for using six impulse trains is that the algorithm must function with few errors even under extreme conditions (in the presence of harmonics). In many cases, only two or three of the six impulse trains will indicate the correct pitch periodthe rest will be incorrect. However, the redundancy built into the algorithm ensures that it is able to determine the fundamental frequency with a low probability of error even in these cases.
The six impulse trains are fed to six identical pitch extractors. Each pitch extractor latches on to an impulse and holds it for a blanking interval, during which subsequent impulses are ignored. After the blanking interval, the latched value begins to decay exponentially. The decay period ends when the pitch extractor encounters an impulse that is greater in amplitude than the instantaneous amplitude of the decaying value. The time period between the initial impulse latch and the end of the decay phase is the new pitch-period estimate. The current average pitch estimate is calculated as the mean of the previous average pitch estimate and the new pitch period estimate. New values for the blanking interval and exponential-decay constant are empirically determined from the current average pitch estimate.
The final pitch-period estimate is determined from the current and previous pitch estimates (and the sums of the current and previous pitch estimates) of each of the six pitch extractors through a process of consensus. This ensures accuracy of the algorithm.
The algorithm occasionally picks the wrong pitch-period estimate; this problem manifests itself in the form of impulsive noise that occurs randomly in the pitch-estimate array and can cause serious errors during comparison. A low-pass filter will remove these impulses, but will 'spread' or 'blur' the noise over the pitch contour. A median filter, however, produces the desired result of removing most of the impulsive noise while retaining the original pitch contour (Figure 2). For most purposes, a three- or five-point median filter is suitable for eliminating noise in the pitch estimates.
Spectral Analysis: Wavelets
Spectral analysis of speech is complicated by the fact that the speech signal is non-stationary, in other words, it has a time-varying frequency spectrum depending on the utterance. However, the speech articulators vary relatively slowly and it is not incorrect to assume that short segments (about 10-20 milliseconds) of speech are stationary. This leads to the idea of short-time techniques, in which analysis is carried out with such spectrally invariant segments (windows) of speech. The short-time Fourier transform is one of the most popular techniques in this category. The short-time Fourier transform results in a spectrogram or time-frequency plot, which illustrates the temporal variation of the spectral components of speech.
Although popular, the short-time Fourier transform is limited by the uncertainty principle of spectral analysis, which states that the product of uncertainty in time and in frequency has a finite lower bound. In other words, resolution in time and frequency cannot be increased independently of one anotheran increase in time resolution (a smaller window) results in a decrease in frequency resolution (spectral leakage) and vice-versa. The short-time Fourier transform uses nominally fixed window widths with the consequence that it can only provide fixed resolution in time and frequency.
Recently, we've seen the emergence of a new technique known as the wavelet transform for spectral analysis of non-stationary signals. It makes use of special time functions known as wavelets, and provides the flexibility in time-frequency resolution unobtainable with the classical short-time Fourier transform. With wavelets, it is possible to analyze a signal at several levels of resolution, making it possible to capture transient, high-frequency bursts with poor frequency resolution and also slowly varying characteristics with high-frequency resolution. Therefore, it is possible to trade off frequency resolution for better time resolution (for analyzing transients) and time resolution for better frequency resolution (for analyzing slow variations), a facility not afforded by the short-time Fourier transform.
The CWT (Continuous Wavelet Transform) is given by the following equation.

f(t) is the non-stationary time signal to be analyzed. The function y(t) is called the mother wavelet. The mother wavelet is an oscillatory function having zero mean; most of its energy is confined in a small region near the origin. The parameter a is referred to as the scale or dilation. The scale specifies the time duration or 'stretch' of the wavelet; a large value of scale indicates poor time resolution and increased frequency resolution and vice-versa. The parameter b is known as the translation. The translation specifies the position of the wavelet on the time axis. Both parameters are continuous.
You can use a continuous-time convolution operation to interpret the CWT given by Equation 1. The scale parameter specifies an infinite number of impulse responses with which to convolve the signal f(t). This interpretation is equivalent to passing the signal f(t) through a bank of (infinite) analog filters, each having an impulse response specified by one value of scale (Figure 3). The filters are of the band-pass variety (this is expected, since the mother wavelet has zero mean) and have the special property that their Q-factors (center frequency to bandwidth ratio) are equal.
The CWT is of little computational value. For implementation on a digital computer, you must discretize the scale and translation parameters. The discretization is usually dyadic, meaning scale and translation parameters are integral powers of two. This leads to a representation of the continuous-time function as a linear combination of dyadically scaled and translated wavelets known as the DWT (Discrete Wavelet Transform). There is a further complication. Although the DWT discretizes the scale and translation parameters, it still applies to a continuous-time function. Digital computers, on the other hand, work with a discrete version of the time signal itself (obtained by sampling the continuous-time signal at the Nyquist rate).
The above considerations lead to a modified form of the DWT that digital filters can implement. Samples of the discrete-time signal are considered to be the approximation coefficients of the signal at the highest (finest) possible level of resolution (labeled the 0th level of resolution). These represent the entire digital frequency range from 0 to p radians. A process of high-pass filtering using a half-band filter and down sampling
produces the detail coefficients at the next (coarser) level of resolution (the first level). The detail coefficients represent the frequency range between p/2 and p radians. Similarly, the approximation coefficients at the first level of resolution are obtained by passing the signal through a low-pass filter and down sampling the result. These coefficients contain spectral information in the range 0 to p/2 radians. Continuing in this fashion, you can use the approximation coefficients at this coarser level to generate approximation and detail coefficients at further coarser levels (levels 2, 3, ...). At each level, the spectrum of the approximation coefficients is divided in two by the low-pass and high-pass filtering operations; thus the DWT is reduced to a form of dyadic sub-band filtering (Figure 4 illustrates a three-level decomposition).
This process is carried out recursively with a bank of digital filters till the required level of frequency resolution is achieved (for a speech-signal band limited to ~ 6 KHz, a seven-level analysis is usually sufficient). The process of generating the approximation and detail coefficients at the kth level of resolution given the approximation coefficients at the (k-1)st level is summarized by the schematic of Figure 5.
In Figure 5, ak(n) and bk(n) are the approximation and detail coefficients respectively at resolution level k. ak-1(n) are the approximation coefficients at the (k-1)st level of resolution. h(n) is the low-pass (approximation) filter and g(n) is the high-pass (detail) filter. The exact nature (impulse response) of these filters depends on the wavelet chosen.
Linear Predictive Analysis
LPA (Linear Predictive Analysis) is a powerful and popular technique for estimating the vocal-tract filter coefficients (predictor coefficients) which, as already mentioned, are useful for speaker identification since different speakers have different vocal-tract configurations for the same utterance. The basic premise of LPA is that you can approximate the current sample of the speech signal (within reasonable accuracy limits) as a linear combination of past samples of speech. The difference between the predicted sample and the actual sample is known as the prediction error. You can determine a set of predictor coefficients by minimizing the mean-squared error. Thus, the theory of LPA is intimately tied to the source-filter model of speech production.
The number of coefficients used to characterize the time-varying vocal-tract filter is known as the order of the predictor. As already mentioned, the filter is treated as an all-pole system, also known as an autoregressive model. This imposes certain limitations on the filter in that it is able to accurately model only voiced sounds, and introduces significant prediction error for unvoiced sounds. Moreover, the transfer function of the filter requires zeros for accurately modeling nasals, a facility the autoregressive model does not afford. In spite of these limitations, autoregressive LPA provides a sufficiently accurate model for speaker identification, especially if the test utterance comprises vowels.
The vocal-tract filter is a time-varying system. A new set of predictor coefficients must, therefore, be evaluated once every 10-20 milliseconds. The LPA algorithm typically sections the speech signal into windows of length 10-20 milliseconds, with an overlap of about 5-10 milliseconds. A set of linear equations (p equations, where p is the predictor order) results from minimizing the mean-squared error between the predicted and actual samples within the window. You can solve this set of equations using one of two techniques: the autocorrelation method or the covariance method.
Although the latter results in faster convergence, the former guarantees a stable predictor and is more often used. The matrix form of these equations for the autocorrelation method is given by Equation 2.
In Equation 2, R(k) represents the short-time autocorrelation function of the speech signal, and (a1, a2, ..., ap) represent the p predictor coefficients. The solution of this set of linear equations can be found using the usual matrix inversion technique, but a computationally efficient iterative solution due to Levinson and Durbin
is often employed. This algorithm exploits the special properties of the autocorrelation matrix in Equation 2 (the matrix is symmetric, has equal elements along the diagonal, and is said to possess the Toeplitz property).
You can obtain a reasonably accurate estimate of the vocal-tract filter using a tenth- or twelfth-order predictor. The transfer function and frequency response of the vocal-tract filter can be easily determined once the predictor coefficients have been evaluated. Figure 6 shows the vocal-tract response for a 20-millisecond frame of the voiced utterance 'a' for two speakers. The spectrum is smooth and shows no harmonic ripple due to pitch. A clear formant structure is visible; the location as well as amplitude of these formants is different, thus vindicating the effectiveness of LPA for speaker identification.
Distance Metrics
During the training phase, the features described in the previous sections must be extracted from the training utterance and stored in a database (the collection of features extracted will henceforth be referred to as a profile). The test phase involves creation of a profile from the test utterance (which is the same as the training utterance in a text-dependent speaker-identification system) and comparison of this profile with those stored in the database. The profile in the database that is 'closest' to the test profile (subject to some independent threshold) is then declared a match. The measure of 'closeness' between two profiles is provided by suitable distance metrics. Different features within the profile may use different distance metrics.
The squared-Euclidean distance is eminently suitable for computing the distance between pitch estimates of the two profiles. The squared-Euclidean distance between two N-dimensional vectors (denoting the pitch vectors) {a1, a2, ..., aN} and {b1, b2, ..., bN} is given by Equation 3.

Pitch vectors extracted from speech will almost certainly be of different lengths and the larger vector will have to be truncated to the size of the smaller one before Equation 3 is applied. Normalization of the distance is also usually performed to avoid variability in pitch vector length.
The DWT coefficients contain spectral information in dyadic sub-bands whose location and extent depend on the level of resolution. One possible method for comparing the two sets of DWT coefficients follows. For both DWTs, the fraction of normalized (per sample) energy in each scale is evaluated, and the ratio of the corresponding fractional energy in each DWT is taken (for similar DWTs, this ratio should be close to unity; it is inverted if less than unity). These ratios are weighted by a non-linear (decreasing) function of the type an, where 0.92 < a < 0.96. This is because ratios of fractional energies at higher scales are in greater error due to a smaller number of samples; assigning lower weights to these scales reduces the error in the final distance measure. The logarithm of each weighted ratio is then accumulated. For DWTs of two utterances by the same speaker, this distance is close to zero.
LPA provides only an approximate estimate of the vocal-tract frequency response. Due to noise as well as the inexactness of the linear prediction model, the predictor coefficients obtained from two speech samples of the same utterance by the same individual will vary. The Itakura distance
provides an estimate of the distance between two sets of linear predictor coefficients. The mathematical expression for this distance metric is given by Equation 4.

In Equation 4, a and â are the two predictor coefficient vectors being compared. R is the autocorrelation matrix corresponding to the profile stored in the database (see Equation 3). This distance metric is accumulated for each frame of speech (after an initial adjustment to make the number of LPA frames equal). The final distance may be normalized to account for speech-rate variability.
The final distance between two profiles is a weighted sum of the three distance metrics previously discussed. Weighting is necessary, since not all features are equally effective at identifying a speaker. The pitch estimates of two individuals may be similar, in which case the squared-Euclidean distance would be small. By contrast, DWT and LPA coefficients are much better at identifying a speaker, yielding relatively small distances for a match and large distances for a mismatch.
Performance Criteria
The performance of a speaker-identification system is described in terms of three parameters:
- A false acceptance occurs when the system incorrectly identifies an unregistered individual as an enrolled one, or when one registered individual is mistaken for another. The FAR (False Acceptance Ratio) is the ratio of the number of false acceptances to the total number of trials. You can reduce the FAR by setting a strict (low) threshold.
- A false rejection occurs when the system incorrectly refuses to identify an individual who is registered with the system. The FRR (False Rejection Ratio) is the ratio of the number of false rejections to the total number of trials. You can minimize the FRR by setting the threshold to a liberal (high) value.
- The equal error rate is defined as the error rate offered by the system when the FAR and FRR are made equal to each other. You can obtain an equal error rate by plotting FAR/FRR curves for threshold values.
The requirements for low FAR and FRR are seen to be conflicting, and both parameters cannot be simultaneously lowered. However, a low FAR is vital for good speaker identification systems (otherwise security of the system would be jeopardized), and most systems are biased for good FAR performance at the expense of FRR.
Prototype System
The author has developed a small-scale prototype speaker identification system based on the principles described in the previous sections of this paper. The entire system has been developed using object-oriented concepts in the C++ language. An important design objective was to ensure a modular and highly portable system.
The prototype system uses a fixed training and test utterance comprising the English vowels ('a', 'e', 'i', 'o', and 'u') for reasons discussed earlier. A sampling rate of 11,025 Hz is used, limiting the maximum analog frequency to ~ 5.5 KHz, which is sufficient to preserve all required information. In the training phase, feature-extraction algorithms are used to create a profile from the speech sample. The Gold-Rabiner algorithm is used to estimate pitch; pitch post-processing makes use of a five-point median filter. Extraction of spectral information is accomplished using a seven-level DWT, yielding a peak frequency resolution of ~ 40 Hz at the lower end of the spectrum. The DWT makes use of a filter bank corresponding to the Daubechies (D2) wavelet. LPA is performed on the speech signal after first-order pre-emphasis (high-pass filtering) to account for the 6 dB/octave roll-off characteristic of the vocal tract. A twelfth-order predictor is used. Profiles thus created are stored in a local disk database. In the test phase, the same features are used to create a profile from the test utterance. The test profile is then compared with the profiles in the database. The profiles in the database are indexed on overall average pitch, and a modified binary-search algorithm is used to retrieve the profiles more efficiently than a sequential search. The profile in the database that yields the smallest distance to the test profile is chosen (subject to an independent threshold) as the match. The system is adaptive; in other words, it is capable of tracking slight changes in speech patterns over multiple test utterances. A successful match causes the profile in the database to be updated upon request.
The system was tested with a group of fifteen speakers consisting of nine males and six females. Ten of the fifteen speakers were enrolled in the database. Three values of threshold (STRICT, NORMAL, and LIBERAL) were used to evaluate the performance of the system. Three trials were conducted for every individual for each value of threshold. The system performance characteristics, FAR and FRR, were determined for each threshold. The point of intersection of the FAR and FRR curves yielded the equal error rate. The system was found to yield a very low error rate (FAR and FRR) for registered individuals. The error rate (FAR) was, however, quite considerable for individuals not registered with the system. Tests also indicated that the system was resistant to minor changes in the utterance rate and intonation.
Acknowledgements




