Indistinguishable Backdoor Attacks for Triggers and Models

岩花, 一輝, 矢内, 直人, 藤原, 融, Kazuki, Iwahana, Naoto, Yanai, Toru, Fujiwara

機械学習のバックドア攻撃は攻撃対象のモデルに対し，トリガと呼ばれるある特定の入力においてのみ攻撃者の意図した不正な出力が得られるような，隠れた領域を埋め込む攻撃である．従来の攻撃手法ではトリガとモデルの挙動からバックドアの存在が検知される問題がある．本稿ではトリガとモデルの挙動双方の観点においても，モデルにバックドアが存在するか識別不可能な新たなバックドア攻撃 IBDF (Indistinguishable Backdoor in Dual Form) を検討する．大まかには，通常の入力と見た目が一致するトリガ付き入力を生成するモデルと，そのトリガを入力する被害者モデルの両方において，中間層の値も識別ができないように競合学習する．実験を MNIST，GTSRB で行ったところ，IBDF は精度と攻撃成功率を損なうことなく，トリガとモデルの識別不可能性を満たすことを示した．関連して，トリガとモデル双方の識別不可能性を満たすことで，トリガの復元やバックドアの除去がより困難になることも期待される．

Backdoor Attacks on machine learning are attacks where an adversary obtains the expected output for a particular input called a trigger. In existing backdoor attacks, backdoors are uncovered by analyzing inputs with the trigger or hidden layers of a model, i.e., no indistinguishability. In this paper, we present a novel backdoor attack with indistinguishability for both triggers and models. Loosely speaking, a generative adversarial network (GAN) generates inputs with triggers, which are identical to regular inputs. In parallel, a victim model is trained with the inputs generated by GAN in a manner that values on the hidden layers are indistinguishable from the regular inputs. We demonstrate that our attacks provide high accuracy, attack success rate, and indistinguishability for triggers and models on the evaluation of MNIST and GTSRB datasets. We also identify that our attack can bypass the current countermeasures.

Indistinguishable Backdoor Attacks for Triggers and Models

Bibliographic Information

Description

Journal

Keywords

Details 詳細情報について

Export

Report a problem