Which One Sounds More Human-like? —Comparison of Synthetic Speech Trained on Japanese Daily Conversational Data with and without Disfluency

Bibliographic Information

Published
2026
DOI
  • 10.5715/jnlp.33.186
Publisher
The Association for Natural Language Processing

Search this article

Description

<p>As real-world conversational speech is full of disfluencies, it is vital to include them when aiming to create a speech synthesis that reflects real human interactions. While recent studies have developed models capable of automatically inserting disfluencies, this study aims to clarify how five specific types of manually inserted disfluencies commonly observed in Japanese conversational speech—fillers, word-internal prolongations, phrase-final abrupt rising intonation, phrase-final abrupt rising-and-falling intonation, and phrase-final copulas with sentence-final particles—affect the perceived human-likeness of synthetic speech. First, we train an existing AI-based speech synthesis system using daily conversational data from a female Japanese speaker in the Kinki dialect. We then synthesised speech of the same linguistic content with and without disfluencies and asked the survey respondents which sample sounded more human-like. The results showed that a significantly higher percentage of respondents evaluated disfluent synthetic speech as more human-like. We also analysed the relationship between the answers and the respondents' gender, age, and place of upbringing to investigate whether any of these attributes influenced the tendency of the answers. The results suggest that there is an association between the respondents' answers and the regions where they grew up, particularly in response to specific stimulus sounds.</p>

Journal

References(11)*help

See more

Details 詳細情報について

Report a problem

Back to top