Criteria for Acoustic-Phonetic Segmentation and Word Labelling in the Australian National Database of Spoken Language

Karen Croot and Belinda Taylor

Speech, Hearing and Language Research Centre,
Macquarie University,

1. Introduction

A database of spoken Australian English has been developed under the Australian National Database of Spoken Language (ANDOSL) initiative (Millar et al., 1990a,b). The purpose of the database is to provide a representative sample of spoken Australian English. This sample includes data from speakers born in Australia as well as speakers born in countries with other language backgrounds. The project is a collaboration between four Australian laboratories: the Speech, Hearing and Language Research Centre (SHLRC), Macquarie University, Australian Hearing Services, Sydney, the Department of Electrical Engineering, Sydney University and the Computer Sciences Laboratory, Australian National University.

Over the last couple of years, phoneticians at SHLRC have been responsible for the annotation of the native Australian speech data that has been collected as part of the ANDOSL project. The database contains speech data from 120 Australian born speakers. These speakers were evenly divided between the sexes and three age groups, 18 - 30, 31 - 45, and 46+, and there is an even representation of general, cultivated and broad Australian speech. The sampled speech data for each speaker consists of 200 sentences from the Spoken Corpus Recordings in British English (SCRIBE) materials, 25 vowels mainly in the hVd environment, 13 digits, 8 practice sentences, a conversational "map task" amd 32 map task locations.

The database also contains hand labelled word level annotation files for 92 native Australian speakers (50 sentences, the digits, and the hVd words per speaker). Segmentation and labelling is accomplished using the speech signal processing package Waves+. The conventions for word level labelling in ANDOSL are based on the criteria for acoustic phonetic labelling which were established at SHLRC and used to label data from ten adult Australian speakers at the acoustic phonetic level (Croot et al, 1992). This document describes the criteria for acoustic-phonetic and word-level labelling developed at SHLRC over the past four years.

2. Acoustic-Phonetic Labelling

Acoustic-phonetic labelling involves segmenting the speech signal in terms of phonetic characteristics such as stop occlusion, frication, nasality and glottalisation (Barry & Fourcin, 1992). In principle, segments which have the same acoustic-phonetic characteristics are given the same label, and each assigned label links the acoustic-phonetic segment to the phonemic entity it represents. The particular difficulties of placing temporally-based labels on a continuous speech signal are considered in more detail by Barry & Fourcin (1992).

Although the acoustic-phonetic labelling carried out at SHLRC employs the above principles in most respects, in three significant ways it does not. Firstly, some allophonic realisations of a phoneme are given the same label, despite the acoustic-phonetic differences, if these differences are entirely predictable from context. Thus the clear and dark allophones of /l/ are both assigned the label [l] for economy. Secondly, phenomena in the speech signal which are entirely predictable from context are not labelled as discrete segments because, if required, these may be retrieved subsequently using the database interrogation system mu+. Thus, as "linking" [w] between words always occurs in the transition from a rounded vowel to the following vowel it is not labelled as a separate segment. In contrast, "linking" [r] is always labelled because it is not obligatory and therefore not predictable (a glottal stop may occur instead in any context where a linking [r] does).

The final difference in our acoustic-phonetic labelling from the principles developed by Barry & Fourcin (1992) and outlined above is that on occasions we may give a single acoustic- phonetic segment a label which assigns it to two segments at the broad phonetic level. This occurs when there is no evidence in the physical speech signal for the separation of two perceptible phonetic events. These "combined" labels are frequently assigned to geminate nasals and fricatives, or in any case where two acoustic events sufficiently overlap so that there is no clear acoustic or spectral evidence for their separation (as in the labelling of dark [l] and the preceding vowel).

After acoustic-phonetic labelling, the string of broad phonetic labels can be semi-automatically derived by matching the orthographic string to the acoustic-phonetic labels using a text-to- speech system (McVeigh & Harrington, 1992; Harrington, Cassidy, Fletcher & McVeigh, 1993). At this broad phonetic level the distinction between segments which could not be separated at the acoustic-phonetic level is restored. An acoustic-phonetic segment cannot be assigned to more than 2 broad phonetic segments using this strategy, so if three acoustic events overlap in the one segment of the speech signal (as in three abutting stop closures with no evidence for their separation), one arbitrary boundary is placed.

The criteria for boundary placement in segmenting the speech signal have been adapted broadly from those used in the SCRIBE project (Hieronymus et al., 1990). Our label set utilises a machine-readable phonetic alphabet for the convenience of being able to enter labels from a standard keyboard, and so that labels files are compatible with software that recognises ASCII characters. Our full label set is given in the Appendix. The criteria for selecting a particular label and locating the boundaries of a segment are described below.

3. Acoustic-Phonetic Labelling Criteria

Our segmentation criteria are described in turn for stops, fricatives, nasals, liquids and glides, and vowels. Examples of specific labelling conventions in particular environments are also provided. A segment is defined by an initial boundary at the left side of the segment and a final boundary at the right.

The initial boundary of an utterance-initial segment is marked with [H#]. Pauses are labelled as [#], and the end of an utterance is indicated by the final boundary of the final segment.

3.1 Oral Stops: the Phonemes /p t k b d g/

The occlusion and release phases of an oral stop are labelled separately. The labels [p], [t], [k], [b], [d] and [g] are used for the occlusion phase according to place of articulation. The release is labelled [H] unless a vowel-like segment is produced after the initial frication as part of the release, in which case the label [HV] is used.

The initial boundary of the occlusion is placed where there is an abrupt drop in the energy of a preceding fricative or sonorant at all frequencies, or at the drop in energy above 300-500 Hz where voicing continues into the occlusion. The initial boundary of an utterance-initial stop is placed at the onset of voicing in the case of a voiced stop, or at an arbitrary point 60 msec before the beginning of the release phase in voiceless stops. Likewise, the final boundary of an unreleased, utterance-final stop is placed at the offset of voicing if the occlusion is voiced, or at an arbitrary 60 msec after the preceding segment.

The final boundary of the occlusion is placed at the beginning of the release. When the release is absent (typically after voiced stops, before fricatives, and when the constriction of the stop is incomplete), the final boundary is placed where a marked increase in energy at or above 300-500 Hz occurs at the onset of the following segment. The final boundary of the release is determined by the initial boundary of the following segment.

Note that where voiced and voiceless segments are adjacent, the change in voicing does not always coincide with the locations for boundary placement described above. In these cases we have adopted the convention of placing the occlusion boundaries around the area of low spectral energy rather than at points of voice onset or offset. Similarly in labelling fricatives (see below), the beginning or end of voicing is not the primary cue to boundary location.

In continuous speech (especially between fricatives), oral stops are frequently realised by partial constriction of the oral cavity rather than occlusion. Oral stops realised in this manner are labelled [pH] [tH] [kH] [bH] [dH] [gH].

Oral stops simultaneously produced with a glottal stop are labelled with the stop symbol plus the diacritic ^. The release, if present, is labelled as [H] or [HV]. Evidence of creaky voice in adjacent segments is also indicated by the diacritic ^.

A stop closure is often optional between a nasal and a fricative. If a closure is visible on the spectrogram, it is labelled according to place of articulation. English orthography may be misleading here, suggesting a closure should be present when it may not be, or failing to suggest a closure which is produced. Thus, "cents/sense" may be [s E n t s] or [s E n s], and "triumph" may be [t r ai V m p f] or [t r ai V m f].

Where there is any doubt about whether a segment has been produced or elided, it is always labelled as if produced; and even if there is no spectral or waveform evidence for a boundary, a double label should be used. Similarly, where geminate stop-closures occur, they are always labelled as double stops. For example, "bulb blew" may be transcribed as [b V l bb H l u:] or [b V l b H b H l u:] etc, but never as [b V l b H l u:].

3.2 Fricatives: The Phonemes /f v T D s z S Z h/

The initial and final boundaries of the fricatives [s],[z],[S] and [Z] are determined by the onset and offset of strong frication, which occurs above 4000 Hz for [s] and [z]; 3000 Hz for [S] and [Z].

The fricatives [f], [v], [T], [D] and [h] show decreased intensity relative to most other segments, and weak frication at 5000-7000 Hz for [f] and [v]; 5000 Hz and above for [T] and [D]; and above 1000 Hz for [h]. If the preceding segment is a sonorant, the boundary is placed where energy above 500 Hz or F2 decreases. The fricatives /f T D/ may also be realised as an occlusion and release, rather than a continuous stream of fricative noise, in which case the labels [f-], [T-] and [D-] are used for the occlusion and [fH], [TH] and [DH] for the release. When the voiced allophone of /h/ is produced, the label [hV] is used instead of [h]. Any transition from a fricative to a following nasal is labelled [On], [Om] or [ON] according to the place of articulation of the nasal.

Adjacent fricatives are segmented on the basis of differences in intensity and spectral frequency. If there is no evidence of change between two fricatives, the fricated segment is given a double label. Geminate fricatives are always labelled with a double label, e.g. "this seesaw" is labelled as [D I ss i: s o:].

3.3 Affricates: The Phonemes /tS/ and /dZ/

Affricates contain a closure phase and a frication phase. The labels and labelling criteria are identical to those used for alveolar stops and post-alveolar fricatives, i.e. [t] and [d] for the occlusion and [S] and [Z] for the frication phase.

3.4 Nasals: The Phonemes /m n N/

Nasals are among the easiest phonemes to identify and segment. The boundaries are placed where there is a marked drop in intensity relative to adjacent segments. Antiformants between F1 and F3 are also visible.

Syllabic nasals are labelled [=m], [=n] and [=N] when they constitute a separate syllable and there is no spectral evidence of a vowel segment associated with that syllable. Creaky nasals are labelled with the diacritic ^. Where a release occurs after the murmur it is labelled [mH] [nH] or [NH] according to the place of articulation of the nasal.

Adjacent nasals may be differentiated by changes in the formants and/or the intensity on the spectrogram. Where there is no discernible change between two nasals, a double label is used.

3.5 Liquids and Glides: The Phonemes /l r w j/

Liquids and glides are vowel-like in appearance although usually of lower intensity than vowels. They display long formant transitions and gradual changes in intensity, and consequently have no clear boundaries adjacent to other liquids or glides, or vowels. Liquids and glides containing creaky voice are labelled with the diacritic ^. Any transition between a fricative and subsequent glide or liquid (containing weak frication and a formant structure corresponding to that of the liquid/glide) is labelled using one of the labels [Or], [Ol], [Ow] or [Oj]. The boundary criteria for realisations of /r/, /l/, /w/ and /j/ are described in turn below.

When [r] is adjacent to vowels and [l], the initial and final boundaries are placed at the midpoints of the dip in F3 which is characteristic of [r]. As "linking" [r] between words is not obligatory in Australian English, (and therefore not predictable), it is always labelled. A trilled allophone of /r/ is labelled [rr].

The clear and dark allophones of /l/ are labelled [l] because, as previously mentioned, mu+ is able to retrieve the two variants according to context. Clear [l] usually shows a marked drop in intensity relative to adjacent segments, and boundaries are placed where the intensity changes. In labelling postvocalic (dark) [l], the initial boundary is placed at the midpoint of the transition from the vowel target to the target of the [l]. The label [=l] is assigned to the allophone of /l/ which is considered on phonotactic grounds to belong to a separate syllable but there is no acoustic evidence for a preceding schwa. Sometimes [l] is produced with a release which is labelled [lH].

/w/ is characterised by a dip in F2 and F3 and a corresponding drop in intensity such that F3 disappears and F2 sometimes disappears. The initial boundary is placed where F3 becomes weak, and the final boundary where it becomes strong again. In some instances, /w/ is realised with a voiceless fricated onset, which is labelled [wH]. "Linking" w between words is not labelled separately because it is predictable from context, but is included in the preceding vowel segment.

The glide [j] has a formant structure like that of [i:] but is frequently higher in intensity than adjacent vowels. Its boundaries are placed where the intensity changes, or, if there is no discernible change in intensity, at the point of greatest formant change. If none of these cues are apparent, the boundaries are placed at the midpoints of the formant transitions between the /j/ and adjacent vowels/liquids/glides. In some speakers, [j] may be weakly fricated.

3.6 Vowels

The acoustic-phonetic labels for short vowels are: [I U E @ O V A]. Long vowels are represented by the labels [i: u: e: @: o: a:], and diphthongs are assigned the labels [ei @u oi ai au i@ u@]. When vowels occur next to stop occlusions, fricatives and pauses, their boundaries are placed at the onset and offset of energy at F2 and above. Adjoining liquids and glides and other vowels, the vowel boundaries are placed at the midpoint of the formant transitions between targets. We use the table of formant values for Australian vowels reported by Bernard (1967) as a reference for vowel identification. When a glottal stop is produced between adjacent vowels, the occlusion is labelled with the label for the second vowel and the diacritic C. Vowels containing creaky voice are labelled with the diacritic ^.

4. Word Level Labelling

Word level labelling is based on the criteria for acoustic-phonetic labelling. A word label consists of the orthographic string for the word, placed where the final boundary of the final acoustic-phonetic segment of the word occurs. Although word level labelling is, on the whole, a relatively straightforward process, it is not entirely uncomplicated. Conventions for handling the problems arising from difficult boundaries, punctuation and spelling variations, pauses and mistakes streamline the labelling process and are necessary to ensure the consistency of the labels. Our conventions for word labelling are described below.

4.1 Difficult Boundaries

Sometimes it is unclear which word label an acoustic-phonetic segment belongs to, particularly when transitions, glottal stops, voiced aspiration, assimilation and intervocalic [r] occur at or across a word boundary. Double labels at the acoustic phonetic level also have to be handled at word level. The following rules are applied in these situations:

  1. Transitions from fricatives to liquids/glides are included with the liquid/glide. Transitions from fricatives to nasals are included with the nasals.

  2. Glottal stops at word boundaries are included in the label for the second word.

  3. Voiced aspiration [HV] is included in the previous word label except in utterance final position, when it is excluded from the label altogether.

  4. When assimilation occurs across a word boundary, and [tS] or [dZ] is produced, eg. "did you" and "texts you," the boundary is placed half way through the frication - [S] or [Z] respectively.

  5. Intervocalic [r] across a word boundary is included in the first word label, not the second.

  6. Where acoustic-phonetic labelling uses a double label, eg. adjacent stops, nasals, and fricatives, the word boundary is placed in the middle of the segment. However, if there is any evidence of a change between the segments, the word boundary is placed at that change.

4.2 Punctuation and Spelling

Word labels have been kept as simple as possible. Therefore, capitalisation is not used in any of the labels, except for the pronoun "I." Apostrophes are not included unless they indicate abbreviations, e.g. "we'll" and "he'd." Inverted commas are also not included. Hyphenated words are labelled as two separate words.

Words are spelled as they are on the sentence list which was read by the speakers. To avoid any confusion with acoustic-phonetic labels, "Mr" is labelled as "mister", "Mrs" is labelled as "mrs", and "h" (in sentence 132) is labelled as "aitch."

4.3 Pauses and Gaps

As for acoustic-phonetic labelling, pauses between words are marked with the label /#/.

4.4 Mistakes, Missing Words, Extra Words and Extra Sounds

There are five problem areas caused by speakers deviating from the sentences they are asked to read. Firstly, speakers may omit a word from a sentence. In this case, the label for the missing word is not transcribed. Secondly, speakers may repeat words or add extra words to the sentence, e.g. "she returned from [her] holiday." When this occurs, the extra word is labelled according to what the speaker says. A repeated word is labelled if it is complete; if not, the incomplete part is labelled as /+/.

A third problem arises when speakers change words in the sentence to another word, eg. "turned [a]round," and "it's" for "it is." Often this doesn't change the meaning of the sentence and the labels reflect whatever the speaker said. Unfortunately the labels will no longer match the original sentence list, but the priority in this case is given to providing accurate labels.

Problem number four occurs when speakers mispronounce words, eg. "cowl" for "karl." In this situation a correction is always given preference over a mispronunciation, with the mistake labelled as /+/. If the speaker does not make a correction, then the mispronounciation is taken as their realisation of the target word and is labelled accordingly.

The final problem relating to deviations from the sentence list occurs when speakers stutter. Our solution is to label the stutter as /+/, and the complete word with the appropriate word label.

5. Summary

The various phonemic or sub-phonemic segments in speech are cued by a range of physical and perceptual parameters, which interact and which overlap temporally. Labelling is based on the information provided by the time-amplitude waveform, the spectrogram, and the transcriber's perception of the segment in isolation and in word-, phrase-, and/or sentence- context. The spectrographic information is primary. The acoustic phonetic labelling system for ANDOSL is economical, transparent and conservative.

Economy: A minimally sufficient set of labels is used, rather than an exhaustively descriptive set. This limits the redundancy in labelling aspects of the signal which can be extracted automatically, and allows for greater speed in labelling and in training transcribers. For specific research questions, narrower labelling can be done if required.

Transparency: Transparent labelling criteria are used with the aim of achieving consistency between transcribers and repetition of transcriptions.

Conservativeness: Choice of label in cases of possible elision, assimilation, or substitution is conservative. If there is any uncertainty about whether a sound has been elided or not, it is labelled as present in the speech. When there is uncertainty about whether assimilation or substitution processes have changed the phonemic identity of a segment, the phoneme most predictable in that context is assumed to have been produced.

Word level labelling is based on the criteria for acoustic-phonetic labelling criteria, and is kept as simple as possible. To this end, conventions for word labelling for ANDOSL have been established and implemented.


Barry, W. J. & Fourcin, A.J. (1992) Levels of labelling. Computer Speech and Language, 6, 1-14.

Bernard, J. R. (1967) Length and the identification of Australian vowels. AUMLA 27, 37- 58.

Croot, K., Fletcher, J. & Harrington, J. (1992) Levels of segmentation and labelling in the Australian National Database of Spoken Language. Proceedings of the Fourth Australian International Conference on Speech Science and Technology. Brisbane, Australia.

Harrington, J., Cassidy, S., Fletcher, J. & McVeigh, A. (1993) The mu+ system for corpus based speech research. Computer Speech and Language, 7, 305 - 331.

Hieronymus, J., Alexander, M., Bennett, C., Cohen, I., Davies, D., Dalby, J., Laver, J., Barry, W., Fourcin, A. & Wells, J. (1990) Proposed speech segmentation criteria for the SCRIBE project. SCRIBE Project Report.

McVeigh, A. & Harrington, J. (1992) Acoustic, articulatory, and perceptual studies using the mu+ system for speech database analysis. Proceedings of the Fourth Australian International Conference on Speech Science and Technology. Brisbane, Australia.

Millar, J., Dermody, P., Harrington, J. & Vonwiller, J. (1990a) A national cluster of spoken language databases for Australia. Proceedings of the Third Australian International Conference on Speech Science and Technology. Melbourne, Australia.

Millar, J., Dermody, P., Harrington, J. & Vonwiller, J. (1990b) A national database of spoken language: concept, design, and implementation. Proceedings of the International Conference on Spoken Language Processing (ICSLP-90). Kobe, Japan.

This page, Copyright 1996 Belinda Taylor