Reidentification Risk from Pseudonymized Customer Payment History
Description
To assess accurately the risk of being compromised, anonymized data requires a balance between utility and security. This paper studies the risk of reidentification for long-term transaction records aggregated by pseudonym (pid). Given a set of transaction records with pseudonyms, an adversary attempts to identify individuals by maximizing the similarity between sets of goods that have been purchased using the pseudonyms. Assuming a uniform probability for the choice of goods and applying Zipf’s law to the behavior of the number of records per individual, we investigate the likelihood of an individual being correctly reidentified. Our model reveals that the risk of reidentification increases as the number of records associated with the pseudonym increases. A similar effect in terms of risk and the number of pseudonyms was found in a competition for data anonymization (PWS Cup 2017), where the Online Retail dataset comprising 40,000 records for 500 individuals was securely anonymized by each team and then reidentified each other. The competition results suggested that the reidentification rate increases the longer a pseudonym remains unchanged.