Which One Sounds More Human-like? —Comparison of Synthetic Speech Trained on Japanese Daily Conversational Data with and without Disfluency

書誌事項

公開日
2026
DOI
  • 10.5715/jnlp.33.186
公開者
一般社団法人 言語処理学会

この論文をさがす

説明

<p>As real-world conversational speech is full of disfluencies, it is vital to include them when aiming to create a speech synthesis that reflects real human interactions. While recent studies have developed models capable of automatically inserting disfluencies, this study aims to clarify how five specific types of manually inserted disfluencies commonly observed in Japanese conversational speech—fillers, word-internal prolongations, phrase-final abrupt rising intonation, phrase-final abrupt rising-and-falling intonation, and phrase-final copulas with sentence-final particles—affect the perceived human-likeness of synthetic speech. First, we train an existing AI-based speech synthesis system using daily conversational data from a female Japanese speaker in the Kinki dialect. We then synthesised speech of the same linguistic content with and without disfluencies and asked the survey respondents which sample sounded more human-like. The results showed that a significantly higher percentage of respondents evaluated disfluent synthetic speech as more human-like. We also analysed the relationship between the answers and the respondents' gender, age, and place of upbringing to investigate whether any of these attributes influenced the tendency of the answers. The results suggest that there is an association between the respondents' answers and the regions where they grew up, particularly in response to specific stimulus sounds.</p>

収録刊行物

  • 自然言語処理

    自然言語処理 33 (1), 186-206, 2026

    一般社団法人 言語処理学会

参考文献 (11)*注記

もっと見る

詳細情報 詳細情報について

問題の指摘

ページトップへ