スペクトル包絡と基本周波数の時間変化を利用した歌声と朗読音声の識別

大石康智, Yasunori, Ohishi

スペクトル包絡と基本周波数の時間変化を利用した歌声と朗読音声の識別について検討する．聴取実験の結果，人間は200 ms，1 s の音声信号に対して，それぞれ70.0%，99.7%で歌声と朗読音声の識別が可能であった．また，識別に影響する音響的特徴について調査するために，短時間のスペクトルの特徴，また韻律を変形させた音声信号を聴取させたところ，それぞれの特徴が相補的に識別の手がかりになることを確認した．この結果より，短時間，長時間の音声信号に対して，それぞれ異なる特徴が識別に影響するということを想定し，スペクトル包絡（MFCC）と基本周波数の時間変化の2 つの尺度に基づく識別器を設計した．このとき，入力音声信号が1 s よりも長い場合，基本周波数の時間変化を特徴量として利用した方がスペクトル包絡を特徴量とするよりも識別性能が高い．特に，発声開始より2 s の音声に対して85.0%の歌声と朗読音声の識別が可能であった．一方，入力音声信号が1 s よりも短い場合，スペクトル包絡の違いを識別に利用した方が基本周波数の時間変化を利用するよりも識別性能が高い．最終的に，2 つの尺度を単純に統合することによって2 s の音声に対して87.3%の識別率を得ることができた．

In this paper, we discuss the discrimination between singing and speaking voices by using a spectral envelope and a fundamental frequency (F0, perceived as pitch) derivative of voice signals. According to the results of our preliminary subjective experiments, listeners distinguish between singing and speaking voices with the accuracy of 70.0% for 200 ms long signals and 99.7% for 1 second long signals. To examine how humans discriminate between these two voices, we then conducted subjective experiments with singing and speaking voice stimuli whose voice quality and prosody were systematically distorted by using signal processing techniques. The experimental results suggested that spectral and prosodic cues complementarily contributed to the perceptual judgments. By hypothesizing that listeners depend on different cues according to the length of stimuli, we propose an automatic vocal style discriminator that can distinguish between singing and speaking voices by using two measures: a spectral envelope (MFCC) and an F0 derivative. In our experimental results, when voice signals longer than one second are discriminated, the F0-based measure performs better than the MFCC-based measure. On the other hand, when voice signals shorter than one second are discriminated, the MFCC-based measure performs better than the F0-based measure. While the discrimination accuracy with the F0-based measure is 85.0% for two-second signals, simple combination of the two measures improves it by 2.3% for two-second signals.

スペクトル包絡と基本周波数の時間変化を利用した歌声と朗読音声の識別

書誌事項

この論文をさがす

説明

収録刊行物

被引用文献 (9)*注記

参考文献 (12)*注記

キーワード

詳細情報詳細情報について

書き出し

問題の指摘

スペクトル包絡と基本周波数の時間変化を利用した歌声と朗読音声の識別

書誌事項

この論文をさがす

説明

収録刊行物

被引用文献 (9)*注記

参考文献 (12)*注記

キーワード

詳細情報 詳細情報について

書き出し

問題の指摘

詳細情報詳細情報について