APPENDIX 1
The following are summaries of studies of spectrographic
voice identification and an FBI survey of forensic cases..
Greenwald, M., "The Effects of Decreased Frequency
Bandwidth on Speaker Identification by Aural and Spectrographic Examination of Speech
Samples", Master Thesis, Michigan State University, 1979
Hall, M. C., "Spectrographic Analysis of Interspeaker
and Intraspeaker variables of Professional Mimicry", Master Thesis, Michigan State
University, 1975
Hazen, B., "Effects of Different Phonetic Contexts on
Spectrographic Speaker Identification", 54 J. Acoust. Soc. Am. 650, 1973
Hollien, H., & McGlone, R., "The Effect of
Disguise on Voiceprint Identification", In the Proceedings of the Carnahan Crime
Countermeasures Conference, University of Kentucky, University of Kentucky Press,
Lexington, KY, 1976
Kersta, L. G., "Voiceprint Identification", 196
Nature Magazine 1253, Dec. 29, 1962
Reich, et al., "Effects of Selected Vocal Disguises
upon Spectrographic Speaker Identification", 60 J. Acoust. Soc. Am. 919, 1976
Reich & Duke, "Effects of selected vocal disguises
upon speaker identification by listening", 66 J. Acoust. Soc. Am. 1023, 1979
Smrkovski, L. L., "Collaborative Study of Speaker
Identification by the Voiceprint Method", 58 J. AOAC 453, 1975
Smrkovski, L. L., "Study of Speaker Identification by
Aural and Visual Examination of Non-Contemporary Speech Samples", 59 J. AOAC 927,
1976
Stevens, et al., "Speaker Authentication and
Identification: A Comparison of Spectrographic and Auditory Presentations of Speech
Material", 44 J. Acoust. Soc. Am. 1596, 1968
Tosi, et al., "Experiment on Voice
Identification", 15 J. Acoust. Soc. Am. 2030, 1972
Tosi & Greenwald, "Voice Identification by
Subjective Methods of Minority Group Voices", Paper presented at the 6th Meeting of
the International Association of Voice Identification, New Orleans, La., 1978
Young, M. A.,& Campbell, R. A., "Effects of
Context on Talker Identification", 42 Acoust. Soc. Am. 1250,1967
KERSTA
1962
Examiners: 8 high school girls Training duration: 1 week
Method: visual Speaker population: 123
Number of words: 10 words excerpted from sentences Context
type: isolated random context
Temporal sequence: contemporary Type of trial: closed
Total number of trials: 2000
Type of decision: forced decisions limited sample limited
time random context no aural examination examiners lacked sufficient experience Results:
closed trials range of errors for false ID - 0.35 to 1.0% 10 words excerpted 0.00 to 2.0%
YOUNG & CAMPBELL
1967
Examiners: 7 PhD candidates in ASC 3 assistant professors
in ASC Training duration: 1 week
Method: visual Speaker population: 5 adult males
Number of words: 2 words (you/it) in isolation &
excerpted from 4 short sentences Context type: 1 word in isolation 2 words from random
context
Temporal sequence: contemporary Type of trial: closed
Total number of trials: 1046
Type of decision: forced decisions limited sample random
context no aural examination examiners not trained Results: closed trials range of errors
for false ID - "you" in isolation: 10.4 to 18.0% 'it' in isolation: 22.7 to
33.0% "you/it" from random context in trial 1 of 15: mean error: 62.7%
STEVENS
1968
Examiners: college students 6 in the open trials 4 in the
closed trials Training duration: 1 week
Method: aural vs.visual but not combined Speaker
population: 24 males
Number of words: catalogue of 11 words in different random
order - only 1 word used in most trials Context type: 1 to 4 words
Temporal sequence: non-contemporary (1 week) Type of trial:
closed & open
Total number of trials: 216
Type of decision: forced decisions limited sample (1 to 4
words) random context no aural examination examiners not trained Results: open trials:
range of errors for false ID for 4 examiners/1 word visual trials - 31.0 to 47.0% aural
trials - 6.0 to 8.0% closed trials: range of errors for false ID - 1 - 4 discrete words
visual trials 20.0 to 30.0% aural trials 5.0 to 18.0%
TOSI ET AL
1968 - 1970
Examiners: 29 of various backgrounds Training duration: 1
month
Method: visual Speaker population: 250 males randomly
selected from a population of 25,000
Number of words: 6 & 9 words Context type: isolated,
fixed and random context
Temporal sequence: contemporary & noncontemporary (1
month) Type of trial: closed & open
Total number of trials: 34,992
Type of decision: forced decisions, but allowed to rate
confidence level limited sample limited time no aural examination examiners lacked
sufficient experience Results: range of errors for all trials false ID - 0.51 to 6.43%
when only 'fairly & almost' certain decisions are combined, the error of false ID
reduces to 2.4%
HAZEN
1972
Examiners: college students (7 panels of 2) Training
duration: 5 lectures and 3 practice sessions
Method: visual Speaker population: 60 males
Number of words: 5 words in the same context, 5 words
physically excerpted from random conversation Context type: fixed and random context
Temporal sequence: contemporary Type of trial: closed &
open
Total number of trials: 280
Type of decision: forced decisions limited sample (5 words)
no aural examination random & fixed context examiners lacked sufficient experience
used the most dissimilar spectrographic utterances compared sounds from totally different
words studying changing phonetic context examiners could not evaluate effects of
coarticulation due to questionable word boundaries Results: closed trials errors for false
ID - fixed context range:10.0 to 30.0% mean: 20.0% random context range:50.0 to 90.0%
mean: 74.29% open trials errors for false ID - fixed context range:16 to 66% mean: 42.86%
random context range:66 to 100% mean: 83%
SMRKOVSKI
1974
Examiners: 7 police & private Training duration: more
than 2 years experience/less than 2 years experience
Method: combined aural and visual Speaker population: 7
male & female
Number of words: 38 to 54 words Context type: fixed context
Temporal sequence: noncontemporary (1 week) Type of trial:
open
Total number of trials: 84
Type of decision: no forced decisions allowed 1 to 5
conclusions no limited time aural & visual examination trained and experienced
examiners Results: open trials trainees w/less than 2 yr experience: false ID - 0.0% false
elim. 5.0% no decision 25.0% 0.35 to 1.0% examiners w/more than 2 yr experience: false ID
- 0.0% false elim. 0.0% no decision 22.0%
SMRKOVSKI
1975
Examiners: 12 scientists, police and private Training
duration: novice: no training trainee: < 2 yr Professional: > 2 yr
Method: combined visual and visual Speaker population: 20
male & female
Number of words: 9 words Context type: fixed context
Temporal sequence: noncontemporary Type of trial: open
Total number of trials: 120
Type of decision: no forced decisions allowed 1 to 5
conclusions no limited time aural & visual examination compared words in context -
trainees, novices and experienced examiners Results: open trials: errors novices false ID
5.0% false elim 25.0% no decision 2.5% trainee false ID 0.0% false elim 0.0% no decision
2.5% Professional false ID 0.0% false elim 0.0% no decision 7.5%
HALL
1975
Examiners: 4 professional and 20 college graduates Training
duration: IAVI certified voice identification examiner
Method: combined visual and visual / visual only Speaker
population: professional mimic and 6 celebrity voices
Number of words: mimic (mean of 25 sec.), celebrities (mean
of 35 min.) Context type: quasi-fixed and random context
Temporal sequence: contemporary/ noncontemporary Type of
trial: open
Total number of trials: aural (20/examiner) visual
(200/examiner) Type of decision: same, different or undecided 5 IAVA classifications
Results: Interspeaker variability does not exist between a
mimicked, disguised voice and the nature voice of the subject mimicked. Intraspeaker
variabilities are minute and not significant when comparing mimics' voice and the nature
voice of the mimic. Aurally: The smaller signal-to-noise ratio within the recording and
the more similar the context, the greater the percentage of accuracy in distinguishing
between speakers. AURAL EXAMINATION: Grand means: RIGHT WRONG UNDEC. Grad. students 0.74
0.18 0.08 Professional 0.92 0.082 0.0
HOLLIEN/McCLONE
1975-76
Examiners: 5 faculty 1 graduate student Training duration:
"the authors were familiar with the 'voiceprint' method of speaker
identification"
Method: visual only (spectrograms were cut & mounted)
Speaker population: 25 faculty and graduate students of the University of Florida
Number of words: 7 words Context type: "I do not set
the same store"
Temporal sequence: contemporary Type of trial: open
Total number of trials: 25/examiner
Type of decision: record a match/ indicate none was
possible Results: ". . . even skilled auditors such as these were unable to match
correctly the disguised speech to the reference (normal) samples as much as 25% of the
time . . . these groups were able to disguise their voices in such manners that their
identification by the 'voiceprint' technique became little more than a matter of
chance."
REICH ET AL
1976
Examiners: 2 PhD candidates in speech science 2 PhD
candidates in speech pathology Training duration: 3 courses in speech science plus
previous experience with speech spectrograms: 4 weeks at 10-15 hr/wk
Method: visual only (words excerpted and mounted) Speaker
population: 40 adult males (mean: 27.3 yrs)
Number of words: 9 words Context type: fixed context
Temporal sequence: noncontemporary Type of trial: open
Total number of trials: 105 (7 matching tasks w/15 known
& 15 unknown)
Type of decision: 1 to 5 certainty scale Results: The
examiners were able to match speakers with a moderate degree of accuracy (55.67%) when
there was no attempt to vocally disguise. Disguised speech significantly interfered with
speaker identification. Further research is needed . . . in which the examiners may listen
to the voice as well as view the spectrograms.
ROTHMAN
1977
Examiners: 30 listeners 6 visual examiners Training
duration: none
Method: Study I: Aural Study II: Visual (0 to 8kHz) Speaker
population: 12
Number of words: four - 2 second speech segments Context
type: random context
Temporal sequence: contemporary/ noncontemporary (1wk) Type
of trial: open
Total number of trials: 5 visual 38 aural
Type of decision: same/different for each contemporary and
noncontemporary Results: 94% correct identifications were obtained for contemporary speech
segments. 42% correct identifications were obtained for noncontemporary speech segments.
58.45% correct identifications were obtained when comparing different speakers. All
examiners in pretest visual achieved 100% correct matching. Aural method is clearly
superior to the spectrographic or 'voiceprint' method
McGLONE, HOLLIEN & HOLLIEN
1977
Examiners: 4 phoneticians Training duration: experienced
Method: visual measurement of format fundamental frequency
to obtain for Speaker population: 23 adult males
Number of words: 7 words ("I do not set the same
store" Context type: fixed (normal & disguised) context
Temporal sequence: contemporary Type of trial:
Total number of trials: 46/phonetician
Type of decision: Results: A great amount of variability in
the fo was found between normal and disguised speech. The mean bandwidth differences (f1,
f2, f3) for the group were large and also demonstrated considerable variability. Phonetic
means also differed.
HOULIHAN - Study I
1977
Examiners: 21 undergraduate students Training duration:
series of lectures & discussions on phonetics, acoustics, and sound spectrography and
speaker identification
Method: visual only Speaker population: 9 female, 5 male
undergraduates - homogenous age and geographic background
Number of words: 9 words Context type: fixed context: 5
voice conditions (normal, lowered, falsetto, whispered and muffled)
Temporal sequence: contemporary Type of trial: open
Total number of trials: 18 matches
Type of decision: same/different Results: correct
identifications: F- voice M-voice normal 100% 95% lowered 85% 95% falsetto 95% 90%
whispered 5% 98% muffled 75% 100% range: 39 to 70% correct mean: 58.8% Std.D.: 8.7%
HOULIHAN - Study II
1977
Examiners: 7 students from Experimental phonetics Training
duration: completion of Exp. I with feedback
Method: visual only Speaker population: 8 female, 8 male
(mean age: 25.3 yrs)
Number of words: 8 words Context type: fixed context:
"There's a bomb in the main post office"
Temporal sequence: contemporary Type of trial: closed
Total number of trials: 16/examiner
Type of decision: instructed to consider the sets in a
particular order. All examiners considered undisguised before disguised Results: correct
identifications: F-voice M-voice normal 71% 100% lowered 85% 100% falsetto 100% 67%
whispered 71% 71% muffled 85% 100% The results suggest that minimally trained examiners
have little difficulty with spectrographic identification in closed, contemporary,
undisguised trials. Results do not suggest that female voices are more difficult to
identify than male voices.
TOSI ET AL
1979
Examiners: professional and students Training duration:
IAVA certified voice examiners and 2 weeks of training, respectively
Method: aural only, visual only and aural/visual combined
Speaker population: Chicano (25 female and 25 male)
Number of words: four sentences approximately 2.4 seconds
in Spanish Context type: fixed context
Temporal sequence: noncontemporary Type of trial: open -
randomized
Total number of trials: 600/examiner
Type of decision: same, different, no opinion. qualified
percentage of self- confidence from 51 to 100% Results: Student and Professional examiners
for errors of elimination and identification had a mean percentile greater for noisy
samples than for quiet samples, however, professional examiners errors were due to aural
only examinations whereas spectrographic/aural examinations produced 0.0% errors. The 'no
opinion' option was used more by professional examiners.
REICH
1979
Examiners: 24 undergraduate students, 3 doctoral students,
3 professors of Speech and Hearing Science Training duration: brief lecture; 120
discrimination trials identical to the experiment
Method: aural only Speaker population: 40 adult males (mean
age: 27.3 yrs)
Number of words: 9 words (it, is, on, you, and, the, I, to,
me) Context type: fixed context
Temporal sequence: noncontemporary (2 weeks +) Type of
trial: open
Total number of trials: 18 matches
Type of decision: same/different (1 to 5 certainty)
Results: Both groups were able to discriminate speakers with moderately high degrees of
accuracy, 92% correct for undisguised. Disguised trials ranged from 59 to 81% depending on
the disguise. Recommended further research to study the combined aural/spectrographic
method.
GREENWALD
1979
Examiners: 3 professional, 5 trainees (less than 2 years
experience) Training duration: professionals: 8 yrs each trainees: < 2 yrs
Method: aural only, visual only and aural/visual combined
Speaker population: 12 female, 12 male; American Midwest dialect
Number of words: 24 words Context type: fixed context
Temporal sequence: noncontemporary Type of trial: open
Total number of trials: 192 discrimination types Type of
decision: the five IAVI alternatives
Results: Professional examiners produced no errors of false
identification or elimination. 1536 decisions by all eight examiners. Effect of restricted
bandwidths (240-2K, 240-2.5K, 240-3K, and 240-4K) does not increase the errors but does
increase the percentage of 'no decisions'. Training of the examiner is very important on
error rate. Trainees produced errors as follows: 6.1% false identification and 4.1% false
elimination for all trials. However, at 240-4khz., 0.0% errors of false identification of
elimination.
KOENIG - FBI SURVEY
1986
Examiners: Federal Bureau of Investigation voice
identification examiners Training duration: minimum of 2 yrs experience, completion of at
least 100 actual voice comparison cases, formal approval by other trained examiners
Method: combined aural/visual method Speaker population:
actual criminal cases
Number of words: varied with each case Context type:
Temporal sequence: noncontemporary Type of trial: open
Total number of trials: 2000 forensic comparisons
Type of decision: very similar very dissimilar no decision
(low confidence) Results: number percent no/low conf. 1304 65.2 elimination 378 18.9
identification 318 15.9 errors false elim. 2 0.53 false id. 1 0.31