Owl Investigations home - Table of Contents

VIII

APPENDIX 1

The following are summaries of studies of spectrographic voice identification and an FBI survey of forensic cases..

Greenwald, M., "The Effects of Decreased Frequency Bandwidth on Speaker Identification by Aural and Spectrographic Examination of Speech Samples", Master Thesis, Michigan State University, 1979

Hall, M. C., "Spectrographic Analysis of Interspeaker and Intraspeaker variables of Professional Mimicry", Master Thesis, Michigan State University, 1975

Hazen, B., "Effects of Different Phonetic Contexts on Spectrographic Speaker Identification", 54 J. Acoust. Soc. Am. 650, 1973

Hollien, H., & McGlone, R., "The Effect of Disguise on Voiceprint Identification", In the Proceedings of the Carnahan Crime Countermeasures Conference, University of Kentucky, University of Kentucky Press, Lexington, KY, 1976

Kersta, L. G., "Voiceprint Identification", 196 Nature Magazine 1253, Dec. 29, 1962

Reich, et al., "Effects of Selected Vocal Disguises upon Spectrographic Speaker Identification", 60 J. Acoust. Soc. Am. 919, 1976

Reich & Duke, "Effects of selected vocal disguises upon speaker identification by listening", 66 J. Acoust. Soc. Am. 1023, 1979

Smrkovski, L. L., "Collaborative Study of Speaker Identification by the Voiceprint Method", 58 J. AOAC 453, 1975

Smrkovski, L. L., "Study of Speaker Identification by Aural and Visual Examination of Non-Contemporary Speech Samples", 59 J. AOAC 927, 1976

Stevens, et al., "Speaker Authentication and Identification: A Comparison of Spectrographic and Auditory Presentations of Speech Material", 44 J. Acoust. Soc. Am. 1596, 1968

Tosi, et al., "Experiment on Voice Identification", 15 J. Acoust. Soc. Am. 2030, 1972

Tosi & Greenwald, "Voice Identification by Subjective Methods of Minority Group Voices", Paper presented at the 6th Meeting of the International Association of Voice Identification, New Orleans, La., 1978

Young, M. A.,& Campbell, R. A., "Effects of Context on Talker Identification", 42 Acoust. Soc. Am. 1250,1967

 

KERSTA

1962

 

Examimers:

8 high school girls

Training duration:

1 week

 

 

 

 

Method:

visual

Speaker population:

123

 

 

 

 

Number of words:

10 words excerpted from sentences

Context type:

isolated

random context

 

 

 

 

Temporal sequence:

contemporary

Type of trial:

closed

 

 

 

 

Total number of trials:

2000

 

 

 

 

 

 

Type of decision:

forced decisions

limited sample

limited time

random context

no aural examination

examiners lacked sufficient experience

Results:

closed trials

range of errors for false ID -

0.35 to 1.0%

10 words excerpted

0.00 to 2.0%

 

 

YOUNG & CAMPBELL

1967

 

Examimers:

7 PhD candidates in ASC

3 assistant professors in ASC

Training duration:

1 week

 

 

 

 

Method:

visual

Speaker population:

5 adult males

 

 

 

 

Number of words:

2 words (you/it) in isolation & excerpted from 4 short sentences

Context type:

1 word in isolation

2 words from random context

 

 

 

 

Temporal sequence:

contemporary

Type of trial:

closed

 

 

 

 

Total number of trials:

1046

 

 

 

 

 

 

Type of decision:

forced decisions

limited sample

random context

no aural examination

examiners not trained

Results:

closed trials

range of errors for false ID -

"you" in islation:

10.4 to 18.0%

‘it' in isolation:

22.7 to 33.0%

"you/it" from ramdom context in trial 1 of 15:

mean error: 62.7%

 

 

 

STEVENS

1968

 

Examimers:

college students

6 in the open trials

4 in the closed trials

Training duration:

1 week

 

 

 

 

Method:

aural vs.visual but not combined

Speaker population:

24 males

 

 

 

 

Number of words:

catalogue of 11 words in different random order - only 1 word used in most trials

Context type:

1 to 4 words

 

 

 

 

Temporal sequence:

non-contemporary

(1 week)

Type of trial:

closed & open

 

 

 

 

Total number of trials:

216

 

 

 

 

 

 

Type of decision:

forced decisions

limited sample

(1 to 4 words)

random context

no aural examination

examiners not trained

Results:

open trials:

range of errors for false ID for 4 examiners/1 word visual trials -

31.0 to 47.0%

aural trials -

6.0 to 8.0%

closed trials:

range of errors for false ID -

1 - 4 discrete words visual trials

20.0 to 30.0%

aural trials

5.0 to 18.0%

 

 

 

TOSI ET AL

1968 - 1970

 

Examimers:1 month

 

 

 

 

Method:

visual

Speaker population:

250 males randomly selected from a population of 25,000

 

 

 

 

Number of words:

6 & 9 words

Context type:

isolated, fixed and

random context

 

 

 

 

Temporal sequence:

contemporary & noncontemporary (1 month)

Type of trial:

closed & open

 

 

 

 

Total number of trials:

34,992

 

 

 

 

 

 

Type of decision:

forced decisions, but allowed to rate confidence level

limited sample

limited time

no aural examination

examiners lacked sufficient experience

Results:

range of errors for all trials false ID -

0.51 to 6.43%

when only ‘fairly & almost' certain decisions are combined, the error of false ID reduces to 2.4%

 

 

 

HAZEN

1972

 

Examimers:

college students

(7 panels of 2)

Training duration:

5 lectures and 3 practice sessions

 

 

 

 

Method:

visual

Speaker population:

60 males

 

 

 

 

Number of words:

5 words in the same context, 5 words physically excerpted from random conversation

Context type:

fixed and random context

 

 

 

 

Temporal sequence:

contemporary

Type of trial:

closed & open

 

 

 

 

Total number of trials:

280

 

 

 

 

 

 

Type of decision:

forced decisions

limited sample (5 words)

no aural examination

random & fixed context

examiners lacked sufficient experience

used the most dissimilar spectrographic utterances

compared sounds from totally different words

studying changing phonetic context

examiners could not evaluate effects of coarticulation due to questionable word boundaries

Results:

closed trials

errors for false ID -

fixed context

range:10.0 to 30.0%

mean: 20.0%

randon context

range:50.0 to 90.0%

mean: 74.29%

open trials

errors for false ID -

fixed context

range:16 to 66%

mean: 42.86%

randon context

range:66 to 100%

mean: 83%

 

 

SMRKOVSKI

1974

 

Examimers:

7 police & private

Training duration:

more than 2 years experience/less than 2 years experience

 

 

 

 

Method:

combined aural and visual

Speaker population:

7 male & female

 

 

 

 

Number of words:

38 to 54 words

Context type:

fixed context

 

 

 

 

Temporal sequence:

noncontemporary (1 week)

Type of trial:

open

 

 

 

 

Total number of trials:

84

 

 

 

 

 

 

Type of decision:

no forced decisions

allowed 1 to 5 conclusions

no limited time

aural & visual examination

trained and experienced examiners

Results:

open trials

trainees w/less than 2 yr eperience:

false ID - 0.0%

false elim. 5.0%

no decision 25.0%

0.35 to 1.0%

examiners w/more than 2 yr eperience:

false ID - 0.0%

false elim. 0.0%

no decision 22.0%

 

 

SMRKOVSKI

1975

 

Examimers:

12 scientists, police and private

Training duration:

novice: no training

trainee: < 2 yr

Professional: > 2 yr

 

 

 

 

Method:

combined visual and visual

Speaker population:

20 male & female

 

 

 

 

Number of words:

9 words

Context type:

fixed context

 

 

 

 

Temporal sequence:

noncontemporary

Type of trial:

open

 

 

 

 

Total number of trials:

120

 

 

 

 

 

 

Type of decision:

no forced decisions

allowed 1 to 5 conclusions

no limited time

aural & visual examination

compared words in context - trainees, novices and experienced examiners

Results:

open trials:

errors novices

false ID 5.0%

false elim 25.0%

no decision 2.5%

trainee

false ID 0.0%

false elim 0.0%

no decision 2.5%

Professional

false ID 0.0%

false elim 0.0%

no decision 7.5%

 

HALL

1975

 

Examimers:

4 professional and 20 college graduates

Training duration:

IAVI certified voice identification examiner

 

 

 

 

Method:

combined visual and visual / visual only

Speaker population:

professional minic and 6 celebrity voices

 

 

 

 

Number of words:

mimic (mean of 25 sec.), celebrities (mean of 35 min.)

Context type:

quasi-fixed and random context

 

 

 

 

Temporal sequence:

contemporary/ noncontemporary

Type of trial:

open

 

 

 

 

Total number of trials:

aural (20/examiner)

visual (200/examiner)

Type of decision:

same, different or undecided

5 IAVA classifications

 

 

 

 

Results:

Interspeaker variability does not exist between a mimiced, disquised voice and the nature voice of the subject mimiced. Intraspeaker variabilities are minute and not signifiacnt when comparing mimics' voice and the nature voice of the mimic.

Aurally: The smaller signal-to-noise ratio within the recording and the more similar the context, the greater the percentage of accuracy in distinguishing bewteen speakers.

AURAL EXAMINATION:

Grand means: RIGHT WRONG UNDEC.

Grad. students 0.74 0.18 0.08

Professional 0.92 0.082 0.0

 

HOLLIEN/McCLONE

1975-76

 

Examimers:

5 faculty

1 graduate student

Training duration:

"the authors were familiar with the ‘voiceprint' method of speaker identification"

 

 

 

 

Method:

visual ony (spectrograms were cut & mounted)

Speaker population:

25 faculty and graduate students of the University of Florida

 

 

 

 

Number of words:

7 words

Context type:

"I do not set the same store"

 

 

 

 

Temporal sequence:

contemporary

Type of trial:

open

 

 

 

 

Total number of trials:

25/examiner

 

 

 

 

 

 

Type of decision:

record a match/ indicate none was possible

Results:

". . . even skilled auditors such as these were unable to match correctly the disguised speech to the reference (normal) samples as much as 25% of the time . . . these groups were able to disguise their voices in such manners that their identification by the ‘voiceprint' technique became little more than a matter of chance."

 

REICH ET AL

1976

 

Examimers:

2 PhD candidates in speech science

2 PhD candidates in speech pathology

Training duration:

3 courses in speech science plus previous experience with speech spectrograms: 4 weeks at 10-15 hr/wk

 

 

 

 

Method:

visual only (words excerpted and mounted)

Speaker population:

40 adult males (mean: 27.3 yrs)

 

 

 

 

Number of words:

9 words

Context type:

fixed context

 

 

 

 

Temporal sequence:

noncontemporary

Type of trial:

open

 

 

 

 

Total number of trials:

105 (7 matching tasks w/15 known & 15 unknown)

 

 

 

 

 

 

Type of decision:

1 to 5 certainty scale

Results:

The examiners were able to match speakers with a moderate degree of accuracy (55.67%) when there was no attemp to vocally disguise. Disguised speech significantly interfered with speaker identification. Further research is needed . . . in which the examiners may listen to the voice as well as view the spectrograms.

 

ROTHMAN

1977

 

Examimers:

30 listeners

6 visual examiners

Training duration:

none

 

 

 

 

Method:

Study I: Aural

Study II: Visual (0 to 8kHz)

Speaker population:

12

 

 

 

 

Number of words:

four - 2 second speech segments

Context type:

random context

 

 

 

 

Temporal sequence:

contemporary/ noncontemporary (1wk)

Type of trial:

open

 

 

 

 

Total number of trials:

5 visual

38 aural

 

 

 

 

 

 

Type of decision:

same/different for each contemporary and noncontemporary

Results:

94% correct identifications were obtained for contemporary speech segments. 42% correct identifications were obtained for noncontemporary speech segments. 58.45% correct identifications were obtained when comparing different speakers. All examiners in pretest visual achieved 100% correct matching. Aural method is clearly superior to the spectrographic or ‘voiceprint' method

McGLONE, HOLLIEN & HOLLIEN

1977

 

Examimers:

4 phoneticians

Training duration:

experienced

 

 

 

 

Method:

visual measurement of format fundamental frequency to obtain fo

Speaker population:

23 adult males

 

 

 

 

Number of words:

7 words ("I do not set the same store"

Context type:

fixed (normal & disguised) context

 

 

 

 

Temporal sequence:

contemporary

Type of trial:

 

 

 

 

 

Total number of trials:

46/phonetician

 

 

 

 

 

 

Type of decision:

 

Results:

A great amount of variablity in the fo was found between normal and disguised speech. The mean bandwidth differences (f1, f2, f3) for the group were large and also demonstrated considerable variability. Phonetic means also differed.

 

HOULIHAN - Study I

1977

 

Examimers:

21 undergraduate students

Training duration:

series of lectures & discussions on phonetics, acoustics, and sound spectrography and speaker identification

 

 

 

 

Method:

visual only

Speaker population:

9 female, 5 male undergraduates - homogenous age and geographic background

 

 

 

 

Number of words:

9 words

Context type:

fixed context: 5 voice conditions (normal, lowered, falsetto, whispered and muffled)

 

 

 

 

Temporal sequence:

contemporary

Type of trial:

open

 

 

 

 

Total number of trials:

18 matches

 

 

 

 

 

 

Type of decision:

same/different

Results:

correct identifications:

F-voice M-voice

normal 100% 95%

lowered 85% 95%

falsetto 95% 90%

whispered 5% 98%

muffled 75% 100%

range: 39 to 70% correct

mean: 58.8%

Std.D.: 8.7%

 

HOULIHAN - Study II

1977

 

Examimers:

7 students from Experimental phonetics

Training duration:

completion of Exp. I with feedback

 

 

 

 

Method:

visual only

Speaker population:

8 female, 8 male (mean age: 25.3 yrs)

 

 

 

 

Number of words:

8 words

Context type:

fixed context: "There's a bomb in the main post office"

 

 

 

 

Temporal sequence:

contemporary

Type of trial:

closed

 

 

 

 

Total number of trials:

16/examiner

 

 

 

 

 

 

Type of decision:

instructed to consider the sets in a particular order. All examiners considered undisguised before disguised

Results:

correct identifications:

F-voice M-voice

normal 71% 100%

lowered 85% 100%

falsetto 100% 67%

whispered 71% 71%

muffled 85% 100%

The results suggest that minimally trained examiners have little difficulty with spectrographic identification in closed, contemporary, undisguised trials. Results do not suggest that female voices are more difficult to identify than male voices.

 

TOSI ET AL

1979

 

Examimers:

professional and students

Training duration:

IAVA certified voice examiners and 2 weeks of training, respectively

 

 

 

 

Method:

aural only, visual only and aural/visual combined

Speaker population:

Chicano (25 female and 25 male)

 

 

 

 

Number of words:

four sentences approximately 2.4 seconds in Spanish

Context type:

fixed context

 

 

 

 

Temporal sequence:

noncontemporary

Type of trial:

open - randomized

 

 

 

 

Total number of trials:

600/examiner

 

 

 

 

 

 

Type of decision:

same, different, no opinion. qualified percentage of self-confidence from 51 to 100%

Results:

Student and Professional examiners for errors of elimination and identifcation had a mean percentile greater for noisy samples than for quiet samples, however, professional examiners srrors were due to aural only examinations whereas spectrographic/aural examinations produced 0.0% errors. The ‘no opinion' option was used more by professional examiners.

 

REICH

1979

 

Examimers:

24 undergraduate students, 3 doctoral students, 3 professors of Speech and Hearing Science

Training duration:

brief lecture; 120 discrimination trials identical to the experiment

 

 

 

 

Method:

aural only

Speaker population:

40 adult males (mean age: 27.3 yrs)

 

 

 

 

Number of words:

9 words (it, is, on, you, and, the, I, to, me)

Context type:

fixed context

 

 

 

 

Temporal sequence:

noncontemporary (2 weeks +)

Type of trial:

open

 

 

 

 

Total number of trials:

18 matches

 

 

 

 

 

 

Type of decision:

same/different (1 to 5 certainty)

Results:

Both groups were ablt to discriminate speakers with moderately high degrees of accuracy, 92% correct for undisguised. Disguised trials ranged from 59 to 81% depending on the disguise. Recommended further research to studythe combined aural/spectrographic method.

 

GREENWALD

1979

 

Examimers:

3 professional, 5 trainees (less than 2 years experience)

Training duration:

professionals: 8 yrs each

trainees: < 2 yrs

 

 

 

 

Method:

aural only, visual only and aural/visual combined

Speaker population:

12 female, 12 male; American midwest dialect

 

 

 

 

Number of words:

24 words

Context type:

fixed context

 

 

 

 

Temporal sequence:

noncontemporary

Type of trial:

open

 

 

 

 

Total number of trials:

192 discrimination types

Type of decision:

the five IAVI alternatives

 

 

 

 

Results:

Professional examiners produced no errors of false identification or elimination. 1536 decisions by all eight examiners. Effect of restricted bandwidths (240-2K, 240-2.5K, 240-3K, and 240-4K) does not increase the errors but does increase the percentage of ‘no decisions'. Training of the examiner is very important on error rate. Trainees produced errors as follows: 6.1% false identifcation and 4.1% false elimination for all trials. However, at 240-4khz., 0.0% errors of false identification of elimination.

 

KOENIG - FBI SURVEY

1986

 

Examimers:

Federal Bureau of Investigation voice identification examiners

Training duration:

minimum of 2 yrs experience, completion of at least 100 actual voice comparision cases, formal approval by other trained examiners

 

 

 

 

Method:

combined aural/visual method

Speaker population:

actual criminal cases

 

 

 

 

Number of words:

vaired with each case

Context type:

 

 

 

 

 

Temporal sequence:

noncontemporary

Type of trial:

open

 

 

 

 

Total number of trials:

2000 forensic comparisons

 

 

 

 

 

 

Type of decision:

very similar

very dissimilar

no decision (low confidence)

Results:

number percent

no/low conf. 1304 65.2

elimination 378 18.9

identification 318 15.9

errors

false elim. 2 0.53

false id. 1 0.31

 

 

Continue