English-Read-By-Japanese Speech Synthesis Preserving Speaker Individuality Based on Partial Correction of Prosody and Phonetic Sounds and Effects of English Proficiency Level on Its Performance

大島 悠司, 高道 慎之介, 戸田 智基, Sakriani Sakti, Graham Neubig, 中村 哲

Cross-lingual speech synthesis for generating naturally sounding English speech uttered by Japanese speakers based on voice conversion and HMM-based speech synthesis tends to cause the degradation of speaker individuality in synthetic speech compared to intra-lingual speech synthesis. To address this issue, we have proposed an ERJ(English Read by Japanese) speech synthesis method to preserve speaker individuality in synthetic speech and a prosody correction method to improve its naturalness. However, their effectiveness has never been evaluated by native listeners: the effects of each speaker's English proficiency level on their performance have never been evaluated; and incorrect phonetic sounds of ERJ have never been addressed. In this paper, we evaluate these points by applying the proposed method to multiple speakers with various English proficiency levels and also propose a correction method of some incorrect phonetic sounds based on spectrum swapping for unvoiced consonants. The experimental results demonstrate that (1) the effectiveness of power correction is well confirmed by native listeners; (2) the naturalness of ERJ synthetic speech is successfully improved over various English prociency levels by the prosody correction method; and (3) the proposed phonetic sound correction method is also effective for further improving its naturalness.

English-Read-By-Japanese Speech Synthesis Preserving Speaker Individuality Based on Partial Correction of Prosody and Phonetic Sounds and Effects of English Proficiency Level on Its Performance

Bibliographic Information

Search this article

Description

Journal

Keywords

Details 詳細情報について

Export

Report a problem