Speaker-invariant and rhythm-sensitive representation of spoken words

説明

It is well-known that human speech recognition (HSR) is much more robust than automatic speech recognition (ASR) [1], [2]. Given that HSR's robustness to large acoustic variability is extremely high, it is reasonable for researchers to assume that humans are able to extract invariant patterns underlying input utterances [3]. Recently in developmental psychology, it was found that infants are very sensitive to distributional properties in the sounds of a language [4], [5]. Following this finding, the first author proposed a speaker-independent or invariant speech representation of each utterance, formed by using distributional properties in the sounds of that utterance [6], [7], [8]. This representation is called speech structure and was tested in isolated word recognition experiments [7], [8]. This paper introduces another kind of sensitivity into speech structure, that is sensitivity to language rhythm. Sonority-based syllable nucleus detection is implemented and we extract local and syllable-based structures as well as conventional global and holistic structures. Isolated word recognition experiments show that the recognition performance is improved with rhythmsensitive and local speech structures.

収録刊行物

詳細情報 詳細情報について

問題の指摘

ページトップへ