Estimating the success of re-identifications in incomplete datasets using generative models

Author(s):

Rocher, Luc
Hendrickx, Julien M.
de Montjoye, Yves-Alexandre

Abstract:

While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We here propose a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. On 210 populations, our method obtains AUC scores for predicting individual uniqueness ranging from 0.84 to 0.97, with low false-discovery rate. Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes. Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.

Document:

https://www.nature.com/articles/s41467-019-10933-3

References:

Poushter, J. Smartphone ownership and internet usage continues to climb in emerging economies (Pew Research Center, Washington, DC, 2016). http://www.pewglobal.org/2016/02/22/smartphone-ownership-and-internet-usage-continues-to-climb-in-emerging-economies/.
Yang, N. & Hing, E. National electronic health records survey. https://cdc.gov/nchs/data/ahcd/nehrs/2015_nehrs_ehr_by_specialty.pdf (2015).
Murdoch, T. B. & Detsky, A. S. The inevitable application of big data to health care. JAMA 309, 1351–1352 (2013).CAS Article Google Scholar
Wyber, R. et al. Big data in global health: improving health in low- and middle-income countries. Bull. World Health Organ. 93, 203–208 (2015).Article Google Scholar
Lazer, D. et al. Life in the network: the coming age of computational social science. Science 323, 721 (2009).CAS Article Google Scholar
Halevy, A., Norvig, P. & Pereira, F. The unreasonable effectiveness of data. IEEE Intell. Syst. 24, 8–12 (2009).Article Google Scholar
Kitchin, R. The real-time city? Big data and smart urbanism. GeoJournal 79, 1–14 (2014).Article Google Scholar
McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D. J. & Barton, D. Big data: the management revolution. Harv. Bus. Rev. 90, 60–68 (2012).PubMed Google Scholar
Hodson, H. Revealed: Google AI has access to huge haul of NHS patient data. New Scientist (29 Apr 2016).
Cadwalladr, C. & Graham-Harrison, E. Revealed: 50 million facebook profiles harvested for Cambridge Analytica in major data breach. The Guardian (17 Mar 2018).
Morey, T., Forbath, T. & Schoop, A. Customer data: designing for transparency and trust. Harv. Bus. Rev. 93, 96–105 (2015). Google Scholar
Polonetsky, J., Tene, O. & Finch, K. Shades of gray: seeing the full spectrum of practical data De-Identification. Santa Clara Law Rev. 56, 593–629 (2016). Google Scholar
Office for Civil Rights, HHS. Standards for privacy of individually identifiable health information. Federal Register. https://ncbi.nlm.nih.gov/pubmed/12180470 (2002).
Malin, B., Benitez, K. & Masys, D. Never too old for anonymity: a statistical standard for demographic data sharing via the HIPAA privacy rule. J. Am. Med. Inform. Assoc. 18, 3–10 (2011).Article Google Scholar
Rothstein, M. A. Is deidentification sufficient to protect health privacy in research? Am. J. Bioeth. 10, 3–11 (2010).Article Google Scholar
Council of European Union. Regulation (EU) 2016/679. Off. J. Eur. Union L 119, 1–88 (2016).
Hrynaszkiewicz, I., Norton, M. L., Vickers, A. J. & Altman, D. G. Preparing raw clinical data for publication: guidance for journal editors, authors, and peer reviewers. BMJ 340, c181 (2010).Article Google Scholar
Opinion 05/2014 on anonymisation techniques. Technical Report, Article 29 Data Protection Working Party. http://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf (2014).
Rubinstein, I. Framing the discussion. https://fpf.org/wp-content/uploads/2016/11/Rubinstein_framing-paper.pdf (2016).
Cal. Civil Code. Assembly Bill No. 375 §§ 1798.100–1798.198 (2018).
Narayanan, A. & Felten, E. W. No silver bullet: de-identification still doesn’t work. http://randomwalker.info/publications/no-silver-bullet-de-identification.pdf (2014).
Ohm, P. Broken promises of privacy: responding to the surprising failure of anonymization. UCLA. Law Rev. 57, 1701 (2010). Google Scholar
Hern, A. ‘Anonymous’ browsing data can be easily exposed, researchers reveal. The Guardian (1 Aug 2017).
Culnane, C., Rubinstein, B. I. P. & Teague, V. Health data in an open world. Preprint at: https://arxiv.org/abs/1712.05627 (2017).
Sweeney, L. Weaving technology and policy together to maintain confidentiality. J. Law Med. Ethics 25, 98–110. 82 (1997).Article Google Scholar
Loukides, G., Denny, J. C. & Malin, B. The disclosure of diagnosis codes can breach research participants’ privacy. J. Am. Med. Inform. Assoc. 17, 322–327 (2010).Article Google Scholar
Douriez, M., Doraiswamy, H., Freire, J. & Silva, C. T. Anonymizing NYC taxi data: does it matter? In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 140–148 (IEEE, Piscataway, NJ, 2016).
Siddle, J. I know where you were last summer: London’s public bike data is telling everyone where you’ve been. https://vartree.blogspot.com/2014/04/i-know-where-you-were-last-summer.html (2014). Accessed 7 Feb 2019.
Lavrenovs, A. & Podins, K. Privacy violations in Riga open data public transport system. In 2016 IEEE 4th Workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE), 1–6 (IEEE, Piscataway, NJ, 2016). https://doi.org/10.1109/AIEEE.2016.7821808.
de Montjoye, Y.-A., Hidalgo, C. A., Verleysen, M. & Blondel, V. D. Unique in the crowd: the privacy bounds of human mobility. Sci. Rep. 3, 1376 (2013).Article Google Scholar
de Montjoye, Y.-A., Radaelli, L., Singh, V. K. & Pentland, A. Unique in the shopping mall: on the reidentifiability of credit card metadata. Science 347, 536–539 (2015).ADS Article Google Scholar
Matthews, G. J. & Harel, O. Data confidentiality: a review of methods for statistical disclosure limitation and methods for assessing privacy. Stat. Surv. 5, 1–29 (2011).MathSciNet Article Google Scholar
Barth-Jones, D. The ‘re-identification’ of Governor William Weld’s medical information: a critical re-examination of health data identification risks and privacy protections, then and now. https://ssrn.com/abstract=2076397 (2012).
El Emam, K. & Arbuckle, L. De-identification: a critical debate. https://fpf.org/2014/07/24/de-identification-a-critical-debate/ (2014).
Sánchez, D., Martnez, S. & Domingo-Ferrer, J. Comment on “unique in the shopping mall: on the reidentifiability of credit card metadata”. Science 351, 1274 (2016).ADS Article Google Scholar
Reiter, J. P. Estimating risks of identification disclosure in microdata. J. Am. Stat. Assoc. 100, 1103–1112 (2005).MathSciNet CAS Article Google Scholar
Fienberg, S. E. & Sanil, A. P. A Bayesian approach to data disclosure: optimal intruder behavior for continuous data. J. Stat. 13, 75 (1997). Google Scholar
Duncan, G. & Lambert, D. The risk of disclosure for microdata. J. Bus. Econ. Stat. 7, 207–217 (1989). Google Scholar
Office of the Australian Information Commissioner. De-identification and the Privacy Act. https://www.oaic.gov.au/agencies-and-organisations/guides/de-identification-and-the-privacy-act (2018).
Ruggles, S., King, M. L., Levison, D., McCaa, R. & Sobek, M. IPUMS-International. Hist. Methods 36, 60–65 (2003).Article Google Scholar
Bennett, J. & Lanning, S. The Netflix prize. In Proc. KDD Cup and Workshop, 35–38 (ACM, New York, NY, 2007). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.115.6998.
Sweeney, L. Simple demographics often identify people uniquely. Health 671, 1–34 (2000). Google Scholar
Genest, C. & Mackay, J. The joy of copulas: bivariate distributions with uniform marginals. Am. Stat. 40, 280–283 (1986).MathSciNet Google Scholar
Cherubini, U., Luciano, E. & Vecchiato, W. Copula Methods in Finance (Wiley-Blackwell, Hoboken, NJ, 2004).
Genest, C. & Favre, A.-C. Everything you always wanted to know about copula modeling but were afraid to ask. J. Hydrol. Eng. 12, 347–368 (2007).Article Google Scholar
Wang, W. & Wells, M. T. Model selection and semiparametric inference for bivariate failure-time data. J. Am. Stat. Assoc. 95, 62–72 (2000).MathSciNet Article Google Scholar
Genz, A. Numerical computation of multivariate normal probabilities. J. Comput. Graph. Stat. 1, 141–149 (1992). Google Scholar
Genz, A. & Bretz, F. Computation of Multivariate Normal and t Probabilities (Springer Science & Business Media, Berlin, 2009).
Brier, G. W. Verification of forecasts expressed in terms of probability. Mon. Weather Rev. 78, 1–3 (1950).ADS Article Google Scholar
Golle, P. Revisiting the uniqueness of simple demographics in the US population. In 5th ACM Workshop on Privacy in Electronic Society (ACM, New York, NY, 2006). https://doi.org/10.1145/1179601.1179615.
Fox-Brewster, T. 120 million american households exposed in ‘massive’ ConsumerView database leak. Forbes (2017).
Waterfield, P. & Revell, T. Huge new facebook data leak exposed intimate details of 3m users. New Scientist (2018).
El Emam, K. & Arbuckle, L. Anonymizing Health Data (O’Reilly, Newton, MA, 2013).
D’Acquisto, G. et al. Privacy by design in big data: an overview of privacy enhancing technologies in the era of big data analytics. Technical Report. European Union Agency for Network and Information Security (2015).
Cho, H., Wu, D. J. & Berger, B. Secure genome-wide association analysis using multiparty computation. Nat. Biotechnol. 36, 547–551 (2018).CAS Article Google Scholar
Cavoukian, A. & Castro, D. Big data and innovation, setting the record straight: de-identification does work. http://www2.itif.org/2014-big-data-deidentification.pdf (2014).
Sweeney, L. k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 10, 557–570 (2002).MathSciNet Article Google Scholar
Meyerson, A. & Williams, R. On the complexity of optimal k-anonymity. In Proc. 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 223–228 (2004). https://doi.org/10.1145/1055558.1055591.
Aggarwal, C. C. On k-anonymity and the curse of dimensionality. In Proceedings of the 31st International Conference on Very Large Data Bases, VLDB ‘05, 901–909 (VLDB Endowment, 2005). http://dl.acm.org/citation.cfm?id=1083592.1083696.
Li, N., Li, T. & Venkatasubramanian, S. t-closeness: privacy beyond k-anonymity and l-diversity. In 2007 IEEE 23rd International Conference on Data Engineering, 106–115 (IEEE, 2007). https://doi.org/10.1109/ICDE.2007.367856.
Ewens, W. J. The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3, 87–112 (1972).MathSciNet CAS Article Google Scholar
Chen, G. & Keller-McNulty, S. Estimation of identification disclosure risk in microdata. J. Stat. 14, 79 (1998). Google Scholar
Hoshino, N. Applying pitman’s sampling formula to microdata disclosure risk assessment. J. Stat. 17, 499 (2001). Google Scholar
Keller, W. J. & Pannekoek, J. Disclosure control of microdata. J. Am. Stat. Assoc. 85, 38–45 (1990).Article Google Scholar
Dankar, F. K., El Emam, K., Neisa, A. & Roffey, T. Estimating the re-identification risk of clinical data sets. BMC Med. Inform. Decis. Mak. 12, 66 (2012).Article Google Scholar
Pitman, J. Random discrete distributions invariant under size-biased permutation. Adv. Appl. Probab. 28, 525–539 (1996).MathSciNet Article Google Scholar
Skinner, C. J. & Holmes, D. J. Estimating the re-identification risk per record in microdata. J. Stat. 14, 361 (1998). Google Scholar
Skinner, C. & Shlomo, N. Assessing identification risk in survey microdata using Log-Linear models. J. Am. Stat. Assoc. 103, 989–1001 (2008).MathSciNet CAS Article Google Scholar
Vinh, N. X., Epps, J. & Bailey, J. Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010).MathSciNet MATH Google Scholar

Estimating the success of re-identifications in incomplete datasets using generative models

Author(s):

Abstract:

Document:

References:

Recent Posts

Archive