Privacy-preserving data mining. A literature review

Authors:

  • Joel Brynielsson
  • Fredrik Johansson
  • Magnus Jändel

Publish date: 2013-02-14

Report number: FOI-R--3633--SE

Pages: 51

Written in: English

Keywords:

  • privacy-preserving data mining

Abstract

This review of the research literature in the field of privacy preserving data mining (PPDM) is based on a competence development project spanning over 140 hours of study. Data mining extracts information from data for the benefit of commercial enterprises or governments. There is often a conflict of interest between advantages gained from data mining and privacy. PPDM offers a set of data mining methods that balances the discordant goals of efficiency and privacy. In the introduction we describe the PPDM problem, the main actors and issues, the different research traditions that form the field, and the relation to neighbouring research fields. The focus of this report is technical methods for PPDM. There are two main strategies. Sanitation methods modify data for the purpose of publishing information that both preserves the overall statistical features of the data and offer some degree of privacy. Distributed secure methods use cryptographic techniques to compute statistical measures without revealing privacy-sensitive details. The first step in all sanitation methods is to remove explicit identifiers such as social security numbers. This is typically not sufficient since individuals can be identified also by quasi-identifiers that occur both in the target database and in background data. Sanitation methods increase privacy by removing quasi-identifiers. The two main approaches to this is 1) deterministic editing for the purpose of exactly fulfilling some measure of privacy and 2) randomized editing aiming at balancing statistical measures of privacy and data mining utility. Different sets of distributed secure methods applies to the cases of horizontally partitioned data where different parties own different sets of database records of the same type and vertically partitioned data where data on different attributes pertaining to the same individuals are distributed between different parties. Diverse flavours of distributed secure protocols make different assumptions about the integrity and honesty of the participants. The review of the mainstream methods is supplemented with descriptions of some less often discussed techniques and problems, including PPDM for unstructured text and network data, and the techniques of importance weighting and classifier downgrading.