A Study on Simplification of Input Features in Speech Generation Method from Lip Video Images

DOI

Bibliographic Information

Other Title
  • 唇動画像からの音声生成法における入力特徴量の単純化に関する検討

Abstract

<p>In recent years, there have been several studies on speech generation from lip video images. Many conventional methods use DNN models based on CNNs or RNNs to generate speech waveforms. In such methods, the model learns speaker specific features such as skin color and moles, and the performance degrades when data from speakers other than the training speaker is used as input. Therefore, we proposed a method to remove speaker-specific features from the input features in order to generate speech waveforms with high performance for any speaker. In this paper, we generated speech waveforms using the proposed input features and evaluated them using any STOI. As a result, the performance of the proposed method was worse than that of the lip video input method, but we confirmed the effectiveness of the proposed method in suppressing the degradation caused by differences in speakers.</p>

Journal

Details 詳細情報について

  • CRID
    1390580626907798016
  • DOI
    10.60274/asjsc.sc-2023-12
  • ISSN
    27582744
  • Text Lang
    ja
  • Data Source
    • JaLC
  • Abstract License Flag
    Allowed

Report a problem

Back to top