Speaker-invariant and rhythm-sensitive representation of spoken words

Yousuke Ozaki, Keikichi Hirose, Nobuaki Minematsu, Donna Erickson

doi:10.1109/apsipa.2013.6694162

It is well-known that human speech recognition (HSR) is much more robust than automatic speech recognition (ASR) [1], [2]. Given that HSR's robustness to large acoustic variability is extremely high, it is reasonable for researchers to assume that humans are able to extract invariant patterns underlying input utterances [3]. Recently in developmental psychology, it was found that infants are very sensitive to distributional properties in the sounds of a language [4], [5]. Following this finding, the first author proposed a speaker-independent or invariant speech representation of each utterance, formed by using distributional properties in the sounds of that utterance [6], [7], [8]. This representation is called speech structure and was tested in isolated word recognition experiments [7], [8]. This paper introduces another kind of sensitivity into speech structure, that is sensitivity to language rhythm. Sonority-based syllable nucleus detection is implemented and we extract local and syllable-based structures as well as conventional global and holistic structures. Isolated word recognition experiments show that the recognition performance is improved with rhythmsensitive and local speech structures.

Speaker-invariant and rhythm-sensitive representation of spoken words

Description

Journal

Details 詳細情報について

Export

Report a problem