肉声と録音音声が重畳する環境における話者なりすまし攻撃検知機構の性能評価

丸橋, 伸也, 矢野, 智彦, 黄, 敬滔, 八代, 理紗, 国井, 裕樹, Shinya, Marubashi, Tomohiko, Yano, Jingtao, Huang, Risa, Yashiro, Hiroki, Kunii

話者認識システムは利便性が高い一方で，なりすまし攻撃の脅威が指摘されている．中でもスピーカーから出力された音声(録音音声)を用いたリプレイ攻撃は，攻撃の簡易さから大きな脅威となり得るため，多くのなりすまし検知手法が提案されている．実環境では，音声に対して様々な環境音が重畳することにより，なりすまし検知の性能が劣化することが知られている．既存研究では，ホワイトノイズや雑多な環境音が重畳された音声に対するなりすまし検知の評価が行われてきた．しかし，肉声と録音音声が重畳した場合についての評価は十分に行われていない．そこで本研究では，肉声と録音音声が重畳する2つのシナリオに焦点を当て評価を行った．1つ目は「攻撃者が肉声を録音音声に意図して重畳する」ことにより録音音声が肉声と誤認識されてしまうシナリオ，2つ目は「テレビなどの録音音声が肉声に意図せず重畳される」ことにより肉声が録音音声と誤認識されてしまうシナリオである．2つのシナリオについてリスクの整理と評価実験を行い，録音音声に対して1/8の音圧の肉声を重畳することにより，半数以上を肉声と誤認識するというセキュリティ上の脅威を明らかにした．

Speaker-recognition systems are convenient, but are vulnerable to various spoofing attacks, such as replay attacks using recorded voice. While many spoofing detection methods have been proposed, their performance may be degraded in real environments with various background noises. Existing studies have evaluated spoofing detection under conditions with white noise or environmental sounds such as cafes and traffic. However, the effectiveness of the detection method in the case of a mixture of bona-fide and spoofed voice has not been sufficiently investigated. This study focuses on two scenarios where bona-fide and spoofed voice are mixed: (1) a security scenario where an attacker intentionally mixes recorded voice with bona-fide voice to deceive the system, and (2) a usability scenario where bona-fide voice is unintentionally mixed with recorded voice from a source such as a television. The evaluation showed that even when bona-fide voice and recorded voice were mixed at a sound pressure ratio of 1:8, more than half of the voice was misrecognized as bona-fide voice. These results indicate a significant security threat.

肉声と録音音声が重畳する環境における話者なりすまし攻撃検知機構の性能評価

書誌事項

説明

収録刊行物

キーワード

詳細情報詳細情報について

書き出し

問題の指摘

肉声と録音音声が重畳する環境における話者なりすまし攻撃検知機構の性能評価

書誌事項

説明

収録刊行物

キーワード

詳細情報 詳細情報について

書き出し

問題の指摘

詳細情報詳細情報について