{"@context":{"@vocab":"https://cir.nii.ac.jp/schema/1.0/","rdfs":"http://www.w3.org/2000/01/rdf-schema#","dc":"http://purl.org/dc/elements/1.1/","dcterms":"http://purl.org/dc/terms/","foaf":"http://xmlns.com/foaf/0.1/","prism":"http://prismstandard.org/namespaces/basic/2.0/","cinii":"http://ci.nii.ac.jp/ns/1.0/","datacite":"https://schema.datacite.org/meta/kernel-4/","ndl":"http://ndl.go.jp/dcndl/terms/","jpcoar":"https://github.com/JPCOAR/schema/blob/master/2.0/"},"@id":"https://cir.nii.ac.jp/crid/1390870529360232448.json","@type":"Article","productIdentifier":[{"identifier":{"@type":"DOI","@value":"10.5715/jnlp.33.186"}},{"identifier":{"@type":"URI","@value":"https://www.jstage.jst.go.jp/article/jnlp/33/1/33_186/_pdf"}}],"dc:title":[{"@language":"en","@value":"Which One Sounds More Human-like? —Comparison of Synthetic Speech Trained on Japanese Daily Conversational Data with and without Disfluency"}],"dc:language":"en","description":[{"type":"abstract","notation":[{"@language":"en","@value":"<p>As real-world conversational speech is full of disfluencies, it is vital to include them when aiming to create a speech synthesis that reflects real human interactions. While recent studies have developed models capable of automatically inserting disfluencies, this study aims to clarify how five specific types of manually inserted disfluencies commonly observed in Japanese conversational speech—fillers, word-internal prolongations, phrase-final abrupt rising intonation, phrase-final abrupt rising-and-falling intonation, and phrase-final copulas with sentence-final particles—affect the perceived human-likeness of synthetic speech. First, we train an existing AI-based speech synthesis system using daily conversational data from a female Japanese speaker in the Kinki dialect. We then synthesised speech of the same linguistic content with and without disfluencies and asked the survey respondents which sample sounded more human-like. The results showed that a significantly higher percentage of respondents evaluated disfluent synthetic speech as more human-like. We also analysed the relationship between the answers and the respondents' gender, age, and place of upbringing to investigate whether any of these attributes influenced the tendency of the answers. The results suggest that there is an association between the respondents' answers and the regions where they grew up, particularly in response to specific stimulus sounds.</p>"}],"abstractLicenseFlag":"disallow"}],"creator":[{"@id":"https://cir.nii.ac.jp/crid/1410870529360232451","@type":"Researcher","foaf:name":[{"@language":"en","@value":"Mokhtari Akiko"}],"jpcoar:affiliationName":[{"@language":"en","@value":"Toyama Prefectural University"}]},{"@id":"https://cir.nii.ac.jp/crid/1410870529360232450","@type":"Researcher","foaf:name":[{"@language":"en","@value":"Hatano Hiroaki"}],"jpcoar:affiliationName":[{"@language":"en","@value":"Kobe University"}]},{"@id":"https://cir.nii.ac.jp/crid/1410870529360232448","@type":"Researcher","foaf:name":[{"@language":"en","@value":"Arai Jun"}],"jpcoar:affiliationName":[{"@language":"en","@value":"Kwansei Gakuin University"}]},{"@id":"https://cir.nii.ac.jp/crid/1410870529360232449","@type":"Researcher","foaf:name":[{"@language":"en","@value":"Campbell Nick"}],"jpcoar:affiliationName":[{"@language":"en","@value":"The University of Dublin"}]},{"@id":"https://cir.nii.ac.jp/crid/1410870529360232452","@type":"Researcher","foaf:name":[{"@language":"en","@value":"Sadanobu Toshiyuki"}],"jpcoar:affiliationName":[{"@language":"en","@value":"Kyoto University"}]}],"publication":{"publicationIdentifier":[{"@type":"PISSN","@value":"13407619"},{"@type":"LISSN","@value":"13407619"},{"@type":"EISSN","@value":"21858314"}],"prism:publicationName":[{"@language":"en","@value":"Journal of Natural Language Processing"},{"@language":"ja","@value":"自然言語処理"},{"@language":"en","@value":"Journal of Natural Language Processing"},{"@language":"ja","@value":"自然言語処理"}],"dc:publisher":[{"@language":"en","@value":"The Association for Natural Language Processing"},{"@language":"ja","@value":"一般社団法人　言語処理学会"}],"prism:publicationDate":"2026","prism:volume":"33","prism:number":"1","prism:startingPage":"186","prism:endingPage":"206"},"reviewed":"false","url":[{"@id":"https://www.jstage.jst.go.jp/article/jnlp/33/1/33_186/_pdf"}],"availableAt":"2026","foaf:topic":[{"@id":"https://cir.nii.ac.jp/all?q=Disfluency","dc:title":"Disfluency"},{"@id":"https://cir.nii.ac.jp/all?q=Japanese%20Daily%20Conversational%20Data","dc:title":"Japanese Daily Conversational Data"},{"@id":"https://cir.nii.ac.jp/all?q=Speech%20Synthesis","dc:title":"Speech Synthesis"},{"@id":"https://cir.nii.ac.jp/all?q=Auditory%20Survey","dc:title":"Auditory Survey"}],"relatedProduct":[{"@id":"https://cir.nii.ac.jp/crid/1360026342427501568","@type":"Article","relationType":["references"],"jpcoar:relatedTitle":[{"@value":"Fastpitch: Parallel Text-to-Speech with Pitch Prediction"}]},{"@id":"https://cir.nii.ac.jp/crid/1360292619482917760","@type":"Article","relationType":["references"],"jpcoar:relatedTitle":[{"@value":"Filled pauses as cues to the complexity of upcoming phrases for native and non-native listeners"}]},{"@id":"https://cir.nii.ac.jp/crid/1360302867622413696","@type":"Article","relationType":["references"],"jpcoar:relatedTitle":[{"@value":"It’s the way that you, er, say it: Hesitations in speech affect language comprehension"}]},{"@id":"https://cir.nii.ac.jp/crid/1360307817933251072","@type":"Article","relationType":["references"],"jpcoar:relatedTitle":[{"@value":"Though this be hesitant, yet there is method in ’t: Effects of disfluency patterns in neural speech synthesis for cultural heritage presentations"}]},{"@id":"https://cir.nii.ac.jp/crid/1360307819151315712","@type":"Article","relationType":["references"],"jpcoar:relatedTitle":[{"@value":"Toward expressive and disfluent speech synthesis"}]},{"@id":"https://cir.nii.ac.jp/crid/1360581633346118144","@type":"Article","relationType":["references"],"jpcoar:relatedTitle":[{"@value":"Speech Synthesis Evaluation — State-of-the-Art Assessment and Suggestion for a Novel Research Program"}]},{"@id":"https://cir.nii.ac.jp/crid/1360865819394002688","@type":"Article","relationType":["references"],"jpcoar:relatedTitle":[{"@value":"Fluency and Disfluency"}]},{"@id":"https://cir.nii.ac.jp/crid/1360870766034842752","@type":"Article","relationType":["references"],"jpcoar:relatedTitle":[{"@value":"Expressive Speech Processing and Prosody Engineering: An Illustrated Essay on the Fragmented Nature of Real Interactive Speech"}]},{"@id":"https://cir.nii.ac.jp/crid/1360870767188571008","@type":"Article","relationType":["references"],"jpcoar:relatedTitle":[{"@value":"Teaching disfluency in Japanese language education and its effects on communication"}]},{"@id":"https://cir.nii.ac.jp/crid/1360870768634339584","@type":"Article","relationType":["references"],"jpcoar:relatedTitle":[{"@value":"Disfluencies Signal Theee, Um, New Information"}]},{"@id":"https://cir.nii.ac.jp/crid/1360870768900704384","@type":"Article","relationType":["references"],"jpcoar:relatedTitle":[{"@value":"Improving robustness of spontaneous speech synthesis with linguistic speech regularization and pseudo-filled-pause insertion"}]}],"dataSourceIdentifier":[{"@type":"JALC","@value":"oai:japanlinkcenter.org:2015064465"},{"@type":"CROSSREF","@value":"10.5715/jnlp.33.186"}]}