- Integration of CiNii Books functions for fiscal year 2025 has completed
- Trial version of CiNii Research Knowledge Graph Search feature is available on CiNii Labs
- 【Updated on November 26, 2025】Regarding the recording of “Research Data” and “Evidence Data”
- Start the collection of all publicly IRDB content
- Incorporate Research Data from KAKEN
Which One Sounds More Human-like? —Comparison of Synthetic Speech Trained on Japanese Daily Conversational Data with and without Disfluency
-
- Mokhtari Akiko
- Toyama Prefectural University
-
- Arai Jun
- Kwansei Gakuin University
-
- Sadanobu Toshiyuki
- Kyoto University
-
- Hatano Hiroaki
- Kobe University
-
- Campbell Nick
- The University of Dublin
Bibliographic Information
- Published
- 2026
- DOI
-
- 10.5715/jnlp.33.186
- Publisher
- The Association for Natural Language Processing
Search this article
Description
<p>As real-world conversational speech is full of disfluencies, it is vital to include them when aiming to create a speech synthesis that reflects real human interactions. While recent studies have developed models capable of automatically inserting disfluencies, this study aims to clarify how five specific types of manually inserted disfluencies commonly observed in Japanese conversational speech—fillers, word-internal prolongations, phrase-final abrupt rising intonation, phrase-final abrupt rising-and-falling intonation, and phrase-final copulas with sentence-final particles—affect the perceived human-likeness of synthetic speech. First, we train an existing AI-based speech synthesis system using daily conversational data from a female Japanese speaker in the Kinki dialect. We then synthesised speech of the same linguistic content with and without disfluencies and asked the survey respondents which sample sounded more human-like. The results showed that a significantly higher percentage of respondents evaluated disfluent synthetic speech as more human-like. We also analysed the relationship between the answers and the respondents' gender, age, and place of upbringing to investigate whether any of these attributes influenced the tendency of the answers. The results suggest that there is an association between the respondents' answers and the regions where they grew up, particularly in response to specific stimulus sounds.</p>
Journal
-
- Journal of Natural Language Processing
-
Journal of Natural Language Processing 33 (1), 186-206, 2026
The Association for Natural Language Processing
- Tweet
Details 詳細情報について
-
- CRID
- 1390870529360232448
-
- ISSN
- 21858314
- 13407619
-
- Text Lang
- en
-
- Data Source
-
- JaLC
- Crossref
-
- Abstract License Flag
- Disallowed
