Composite embedding systems for ZeroSpeech2017 Track1

Shinji Watanabet, Taku Kato, Hayato Shibata, Takahiro Shinozaki

doi:10.1109/asru.2017.8269012

This paper investigates novel composite embedding systems for language-independent high-performance feature extraction using triphone-based DNN-HMM and character-based end-to-end speech recognition systems. The DNN-HMM is trained with phoneme transcripts based on a large-scale Japanese ASR recipe included in the Kaldi toolkit from the Corpus of Spontaneous Japanese (CSJ) with some modifications. The end-to-end ASR system is based on a hybrid architecture consisting of an attention-based encoder-decoder and connectionist temporal classification. This model is trained with multi-language speech data using character transcripts in a pure end-to-end fashion without requiring phonemic representation. Posterior features, PCA-transformed features, and bottleneck features are extracted from the two systems; then, various combinations of features are explored. Additionally, a bypassed autoencoder (bypassed AE) is proposed to normalize speaker characteristics in an unsupervised manner. An evaluation using the ABX test showed that the DNN-HMM-based CSJ bottleneck features resulted in a good performance regardless of the input language. The pre-activation vectors extracted from the multilingual end-to-end system with PCA provided a somewhat better performance than did the CSJ bottleneck features. The bypassed AE yielded an improved performance over a baseline AE. The lowest error rates were obtained by composite features that concatenated the end-to-end features with the CSJ bottleneck features.

Composite embedding systems for ZeroSpeech2017 Track1

説明

収録刊行物

詳細情報詳細情報について

書き出し

問題の指摘

Composite embedding systems for ZeroSpeech2017 Track1

説明

収録刊行物

詳細情報 詳細情報について

書き出し

問題の指摘

詳細情報詳細情報について