合成データ生成のランダム性に内在する安全性の評価

三浦, 尭之, 紀伊, 真昇, 芝原, 俊樹, 市川, 敦謙, 千田, 浩司

データ利活用の活発化に伴い，活用されるデータに含まれる個人のプライバシーを保護するプライバシー保護技術が数多く提案されている．特に近年，合成データ生成技術を用いたプライバシー保護が注目を集めている．従来の合成データ生成では，データ生成に必要な値，生成パラメータにノイズを加え，差分プライベートにすることで理論的な安全性を保証していた．しかし，企業などが所有するデータのプライバシー保護のために合成データ生成を用いる場合は，生成されたデータのみを公開し，生成パラメータは公開せずに破棄することが多い．この場合，生成パラメータを用いてデータを生成する過程にランダム性があるため，差分プライベートな合成データ生成でなくても，生成されたデータから生成パラメータおよび元データを推定することは難しい問題であると考えられる．本稿では，差分プライベートでない合成データ生成が持つこのランダム性がどの程度のプライバシー保護性を有しているかを考察する．理論的な評価の第一歩目として，平均と分散を生成パラメータとする正規分布に従うデータ生成が満たす安全性を差分プライバシーの考え方に基づいて評価した．さらに，評価方法を高次元データに適用する際の方向性も示した．

With the increasing demands for the data utilization, many techniques have been proposed to protect the privacy of individuals in the data. In recent years, privacy protection techniques based on synthetic data generation has attracted much attention. Conventional synthetic data generation guarantees theoretical security by making generation parameters, which are required for the data generation, differentially private.When enterprises use synthetic data generation to protect their data, they, however, generate synthetic data by their generation parameters and discard them without disclosing them. In addition, since synthetic data generation has its own randomness, it is not easy to estimate the generation parameters and the original data from the generated data. In this paper, we theoretically discuss the difficulty of the estimation. As a first step in the theoretical evaluation, we evaluate the security of synthetic data generation by the normal distributionwith the mean and the variance of the original data, referring to the concept of differential privacy. We also show the future direction of privacy-preserving data generation for high-dimensional data.

合成データ生成のランダム性に内在する安全性の評価

書誌事項

抄録

収録刊行物

キーワード

詳細情報詳細情報について

書き出し

問題の指摘

合成データ生成のランダム性に内在する安全性の評価

書誌事項

抄録

収録刊行物

キーワード

詳細情報 詳細情報について

書き出し

問題の指摘

詳細情報詳細情報について