How Does End-To-End Speech Recognition Training Impact Speech Enhancement Artifacts?

Iwamoto, Kazuma, Ochiai, Tsubasa, Delcroix, Marc, Ikeshita, Rintaro, Sato, Hiroshi, Araki, Shoko, Katagiri, Shigeru

doi:10.48550/arxiv.2311.11599

How Does End-To-End Speech Recognition Training Impact Speech Enhancement Artifacts?

DOI DOI オープンアクセス

説明

Jointly training a speech enhancement (SE) front-end and an automatic speech recognition (ASR) back-end has been investigated as a way to mitigate the influence of \emph{processing distortion} generated by single-channel SE on ASR. In this paper, we investigate the effect of such joint training on the signal-level characteristics of the enhanced signals from the viewpoint of the decomposed noise and artifact errors. The experimental analyses provide two novel findings: 1) ASR-level training of the SE front-end reduces the artifact errors while increasing the noise errors, and 2) simply interpolating the enhanced and observed signals, which achieves a similar effect of reducing artifacts and increasing noise, improves ASR performance without jointly modifying the SE and ASR modules, even for a strong ASR back-end using a WavLM feature extractor. Our findings provide a better understanding of the effect of joint training and a novel insight for designing an ASR agnostic SE front-end.

5 pages, 1 figure, 1 table

収録刊行物

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 11031-11035, 2024-04-14

IEEE

How Does End-To-End Speech Recognition Training Impact Speech Enhancement Artifacts?

説明

収録刊行物

キーワード

詳細情報詳細情報について

書き出し

問題の指摘

How Does End-To-End Speech Recognition Training Impact Speech Enhancement Artifacts?

説明

収録刊行物

キーワード

詳細情報 詳細情報について

書き出し

問題の指摘

詳細情報詳細情報について