An End-to-End Approach to Joint Social Signal Detection and Automatic Speech Recognition

Masato Mimura, Tatsuya Kawahara, Kazuyoshi Yoshii, Hirofumi Lnaguma, Koji Inoue

doi:10.1109/icassp.2018.8462578

Social signals such as laughter and fillers are often observed in natural conversation, and they play various roles in human-to-human communication. Detecting these events is useful for transcription systems to generate rich transcription and for dialogue systems to behave as we do such as synchronized laughing or attentive listening. We have studied an end-to-end approach to directly detect social signals from speech by using connectionist temporal classification (CTC), which is one of the end-to-end sequence labelling models. In this work, we propose a unified framework that integrates social signal detection (SSD) and automatic speech recognition (ASR). We investigate several reference labelling methods regarding social signals. Experimental evaluations demonstrate that our end-to-end framework significantly outperforms the conventional DNN-HMM system with regard to SSD performance as well as the character error rate (CER).

An End-to-End Approach to Joint Social Signal Detection and Automatic Speech Recognition

説明

収録刊行物

詳細情報詳細情報について

書き出し

問題の指摘

An End-to-End Approach to Joint Social Signal Detection and Automatic Speech Recognition

説明

収録刊行物

詳細情報 詳細情報について

書き出し

問題の指摘

詳細情報詳細情報について