THE METHOD OF VOICE IDENTIFICATION
The method by which a voice is identified is a multifaceted
process requiring the use of both aural and visual senses. In the typical voice
identification case the examiner is given several recordings; one or more recordings of
the voice to be identified and one or more recorded voice samples of one or more suspects.
It is from these recordings the examiner must make the determination about the identity of
the unknown voice.
The first step is to evaluate the recording of the unknown
voice, checking to make sure the recording has a sufficient amount of speech with which to
work and that the quality of the recording is of sufficient clarity in the frequency range
required for analysis.1 The volume of the recorded voice signal must be significantly
higher than that of the environmental noise. The greater the number of obscuring events,
such as noise, music, and other speakers, the longer the sample of speech must be. Some
examiners report that they reject as many as sixty percent of the cases submitted to them
with one of the main reasons for rejection being the poor quality of the recording of the
unknown voice.
Once the unknown voice sample has been determined to be
suitable for analysis, the examiner then turns his attention to the voice samples of the
suspects. Here also, the recordings must be of sufficient clarity to allow comparison,
although at this stage, the recording process is usually so closely controlled that the
quality of recording is not a problem.
The examiner can only work with speech samples which are
the same as the text of the unknown recording. Under the best of circumstances the
suspects will repeat, several times, the text of the recording of the unknown speaker and
these words will be recorded in a similar manner to the recording of the unknown speaker.
For example, if the recording of the unknown speaker was a bomb threat made to a recorded
telephone line then each of the suspects would repeat the threat, word for word, to a
recorded telephone line. This will provide the examiner with not only the same speech
sounds for comparison but also with valuable information about the way each speech sound
completes the transition to the next sound.
There are those times when a voice sample must be obtained
without the knowledge of the suspect. It is possible to make an identification from a
surreptitious recording but the amount of speech necessary to do the comparison is usually
much greater. If the suspect is being engaged in conversation for the purpose of obtaining
a voice sample, the conversation must be manipulated in such a way so as to have the
suspect repeat as many of the words and phrases found in the text of the unknown recording
as possible.
The worst exemplar recordings with which an examiner must
work are those of random speech. It is necessary to obtain a large sample of speech to
improve the chances of obtaining a sufficient amount of comparable speech.
As in any other form of identification analysis, as the
quality of the evidence with which the examiner has to work declines, the greater the
amount of evidence and time necessary to complete the analysis, and the less likely the
chance for a positive conclusion.
Once the evidence has been determined to be sufficient to
perform the analysis, the examiner then begins the two step process of voice sample
comparison; one aural (listening) and the other spectrographic (visual). These are two
different but interwoven and equally important analytical methods which the examiner
combines to reach the final conclusion. The first step is an aural comparison of the voice
samples.2 Here the examiner compares both single speech sounds and series of speech sounds
of the known and unknown samples. At this stage the examiner is conducting a number of
tasks; comparing for similarities and differences, screening out less useful portions of
the samples, and indexing the samples for further analysis. An example of the initial
aural comparison is the screening of the samples for pronunciation similarities or
discrepancies such as the word "the" may be said with a short "a"
sound or a long "e" sound. If the word is not pronounced in the same manner it
loses comparison value.
Once the examiner has located those portions to be used for
the analysis, a more detailed aural comparison is undertaken. This comparison can be
accomplished in many different ways. One of the most commonly used methods of aural
comparison is re-recording a speech sound sample of the unknown followed immediately by a
re-recording of the same speech sounds of the suspect. This is repeated several times so
that the final product is a recording of specific speech sounds, in alternating order, by
the unknown speaker followed by the suspect. Such comparisons have been greatly
facilitated by the use of audio digital recording equipment which allows for the digital
recording, storage, and repeated playback of only the desired speech sounds to be
examined.
During the aural comparison the examiner studies the
psycholinguistic features of the speakers voice. There are a large number of qualities and
traits which are examined from such general traits as accent and dialect to inflection,
syllable grouping and breath patterns. The examiner also scrutinizes the samples for signs
of speech pathologies and peculiar speech habits.
The second step in the voice identification process is the
spectrographic analysis of the recorded samples. The sound spectrograph is an automatic
sound wave analyzer with a high quality, fully functional tape recorder. The speech
samples to be analyzed are recorded on the sound spectrograph. The recording is then
analyzed in two and one half second segments. The product is a spectrogram, a graphic
display of the recorded signal on the basis of time and frequency with a general
indication of amplitude.
The spectrograms of the unknown speaker are then visually
compared to the spectrograms of the suspects. Only those speech sounds which are the same
are compared.3 The comparisons of the spectrograms are based on the displayed patterns
representing the psychoacoustical features of the captured speech. The examiner studies
the bandwidths, mean frequencies, and trajectory of vowel formants; vertical striations,
distribution of formant energy and nasal resonances; stops, plosives and fricatives;
interformant features, the relation of all features present as affected during
articulatory changes and any peculiar acoustic patterning.4 The examiner looks not only
for similarities but also for differences. The differences are closely examined to
determine if they are due to pronunciation differences or if they are indicative of
different speakers.
When the analysis is complete the examiner integrates his
findings from both the aural and spectrographic analyses into one of five standard
conclusions; a positive identification, a probable identification, a positive elimination,
a probable elimination, or no decision. In order to arrive at a positive identification
the examiner must find a minimum of twenty speech sounds which possess sufficient aural
and spectrographic similarities. There can be no differences either aural or
spectrographic for which there can be no accounting.
The probable identification conclusion is reached when
there are less then twenty similarities and no unexplained differences. This conclusion is
usually reached when working with small samples, random speech samples or recordings of
lower quality. The result of positive elimination is rendered when twenty differences
between the samples are found that can not be based on any fact other than different
voices having produced the samples. A probable elimination decision is usually reached
when working with limited text or a recording of lower quality. The no decision conclusion
is used when the quality of the recording is so poor that there is insufficient
information with which to work or when there are too few common speech sounds suitable for
comparison.