A Study on Simplification of Input Features in Speech Generation Method from Lip Video Images

KANAZAWA Naoki, SUZUKI Motoyuki

doi:10.60274/asjsc.sc-2023-12

Bibliographic Information

Other Title

唇動画像からの音声生成法における入力特徴量の単純化に関する検討

Description

<p>In recent years, there have been several studies on speech generation from lip video images. Many conventional methods use DNN models based on CNNs or RNNs to generate speech waveforms. In such methods, the model learns speaker specific features such as skin color and moles, and the performance degrades when data from speakers other than the training speaker is used as input. Therefore, we proposed a method to remove speaker-specific features from the input features in order to generate speech waveforms with high performance for any speaker. In this paper, we generated speech waveforms using the proposed input features and evaluated them using any STOI. As a result, the performance of the proposed method was worse than that of the lip video input method, but we confirmed the effectiveness of the proposed method in suppressing the degradation caused by differences in speakers.</p>

Journal

Proceedings of the Technical Committee on Speech Communication

Proceedings of the Technical Committee on Speech Communication 3 (2), n/a-, 2023-02-24

The Technical Committee on Speech Communication,the Acoustical Society of Japan

Details 詳細情報について

CRID: 1390580626907798016

DOI: 10.60274/asjsc.sc-2023-12

ISSN: 27582744

Text Lang: ja

Data Source

JaLC

Abstract License Flag: Allowed

Export