Author(s):
- Rocher, Luc
- Hendrickx, Julien M.
- de Montjoye, Yves-Alexandre
Abstract:
While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We here propose a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. On 210 populations, our method obtains AUC scores for predicting individual uniqueness ranging from 0.84 to 0.97, with low false-discovery rate. Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes. Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.
Document:
https://www.nature.com/articles/s41467-019-10933-3
References:
- Poushter, J. Smartphone ownership and internet usage continues to climb in emerging economies (Pew Research Center, Washington, DC, 2016). http://www.pewglobal.org/2016/02/22/smartphone-ownership-and-internet-usage-continues-to-climb-in-emerging-economies/.
- Yang, N. & Hing, E. National electronic health records survey. https://cdc.gov/nchs/data/ahcd/nehrs/2015_nehrs_ehr_by_specialty.pdf (2015).
- Murdoch, T. B. & Detsky, A. S. The inevitable application of big data to health care. JAMA 309, 1351–1352 (2013).CAS Article Google Scholar
- Wyber, R. et al. Big data in global health: improving health in low- and middle-income countries. Bull. World Health Organ. 93, 203–208 (2015).Article Google Scholar
- Lazer, D. et al. Life in the network: the coming age of computational social science. Science 323, 721 (2009).CAS Article Google Scholar
- Halevy, A., Norvig, P. & Pereira, F. The unreasonable effectiveness of data. IEEE Intell. Syst. 24, 8–12 (2009).Article Google Scholar
- Kitchin, R. The real-time city? Big data and smart urbanism. GeoJournal 79, 1–14 (2014).Article Google Scholar
- McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D. J. & Barton, D. Big data: the management revolution. Harv. Bus. Rev. 90, 60–68 (2012).PubMed Google Scholar
- Hodson, H. Revealed: Google AI has access to huge haul of NHS patient data. New Scientist (29 Apr 2016).
- Cadwalladr, C. & Graham-Harrison, E. Revealed: 50 million facebook profiles harvested for Cambridge Analytica in major data breach. The Guardian (17 Mar 2018).
- Morey, T., Forbath, T. & Schoop, A. Customer data: designing for transparency and trust. Harv. Bus. Rev. 93, 96–105 (2015). Google Scholar
- Polonetsky, J., Tene, O. & Finch, K. Shades of gray: seeing the full spectrum of practical data De-Identification. Santa Clara Law Rev. 56, 593–629 (2016). Google Scholar
- Office for Civil Rights, HHS. Standards for privacy of individually identifiable health information. Federal Register. https://ncbi.nlm.nih.gov/pubmed/12180470 (2002).
- Malin, B., Benitez, K. & Masys, D. Never too old for anonymity: a statistical standard for demographic data sharing via the HIPAA privacy rule. J. Am. Med. Inform. Assoc. 18, 3–10 (2011).Article Google Scholar
- Rothstein, M. A. Is deidentification sufficient to protect health privacy in research? Am. J. Bioeth. 10, 3–11 (2010).Article Google Scholar
- Council of European Union. Regulation (EU) 2016/679. Off. J. Eur. Union L 119, 1–88 (2016).
- Hrynaszkiewicz, I., Norton, M. L., Vickers, A. J. & Altman, D. G. Preparing raw clinical data for publication: guidance for journal editors, authors, and peer reviewers. BMJ 340, c181 (2010).Article Google Scholar
- Opinion 05/2014 on anonymisation techniques. Technical Report, Article 29 Data Protection Working Party. http://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf (2014).
- Rubinstein, I. Framing the discussion. https://fpf.org/wp-content/uploads/2016/11/Rubinstein_framing-paper.pdf (2016).
- Cal. Civil Code. Assembly Bill No. 375 §§ 1798.100–1798.198 (2018).
- Narayanan, A. & Felten, E. W. No silver bullet: de-identification still doesn’t work. http://randomwalker.info/publications/no-silver-bullet-de-identification.pdf (2014).
- Ohm, P. Broken promises of privacy: responding to the surprising failure of anonymization. UCLA. Law Rev. 57, 1701 (2010). Google Scholar
- Hern, A. ‘Anonymous’ browsing data can be easily exposed, researchers reveal. The Guardian (1 Aug 2017).
- Culnane, C., Rubinstein, B. I. P. & Teague, V. Health data in an open world. Preprint at: https://arxiv.org/abs/1712.05627 (2017).
- Sweeney, L. Weaving technology and policy together to maintain confidentiality. J. Law Med. Ethics 25, 98–110. 82 (1997).Article Google Scholar
- Loukides, G., Denny, J. C. & Malin, B. The disclosure of diagnosis codes can breach research participants’ privacy. J. Am. Med. Inform. Assoc. 17, 322–327 (2010).Article Google Scholar
- Douriez, M., Doraiswamy, H., Freire, J. & Silva, C. T. Anonymizing NYC taxi data: does it matter? In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 140–148 (IEEE, Piscataway, NJ, 2016).
- Siddle, J. I know where you were last summer: London’s public bike data is telling everyone where you’ve been. https://vartree.blogspot.com/2014/04/i-know-where-you-were-last-summer.html (2014). Accessed 7 Feb 2019.
- Lavrenovs, A. & Podins, K. Privacy violations in Riga open data public transport system. In 2016 IEEE 4th Workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE), 1–6 (IEEE, Piscataway, NJ, 2016). https://doi.org/10.1109/AIEEE.2016.7821808.
- de Montjoye, Y.-A., Hidalgo, C. A., Verleysen, M. & Blondel, V. D. Unique in the crowd: the privacy bounds of human mobility. Sci. Rep. 3, 1376 (2013).Article Google Scholar
- de Montjoye, Y.-A., Radaelli, L., Singh, V. K. & Pentland, A. Unique in the shopping mall: on the reidentifiability of credit card metadata. Science 347, 536–539 (2015).ADS Article Google Scholar
- Matthews, G. J. & Harel, O. Data confidentiality: a review of methods for statistical disclosure limitation and methods for assessing privacy. Stat. Surv. 5, 1–29 (2011).MathSciNet Article Google Scholar
- Barth-Jones, D. The ‘re-identification’ of Governor William Weld’s medical information: a critical re-examination of health data identification risks and privacy protections, then and now. https://ssrn.com/abstract=2076397 (2012).
- El Emam, K. & Arbuckle, L. De-identification: a critical debate. https://fpf.org/2014/07/24/de-identification-a-critical-debate/ (2014).
- Sánchez, D., Martnez, S. & Domingo-Ferrer, J. Comment on “unique in the shopping mall: on the reidentifiability of credit card metadata”. Science 351, 1274 (2016).ADS Article Google Scholar
- Reiter, J. P. Estimating risks of identification disclosure in microdata. J. Am. Stat. Assoc. 100, 1103–1112 (2005).MathSciNet CAS Article Google Scholar
- Fienberg, S. E. & Sanil, A. P. A Bayesian approach to data disclosure: optimal intruder behavior for continuous data. J. Stat. 13, 75 (1997). Google Scholar
- Duncan, G. & Lambert, D. The risk of disclosure for microdata. J. Bus. Econ. Stat. 7, 207–217 (1989). Google Scholar
- Office of the Australian Information Commissioner. De-identification and the Privacy Act. https://www.oaic.gov.au/agencies-and-organisations/guides/de-identification-and-the-privacy-act (2018).
- Ruggles, S., King, M. L., Levison, D., McCaa, R. & Sobek, M. IPUMS-International. Hist. Methods 36, 60–65 (2003).Article Google Scholar
- Bennett, J. & Lanning, S. The Netflix prize. In Proc. KDD Cup and Workshop, 35–38 (ACM, New York, NY, 2007). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.115.6998.
- Sweeney, L. Simple demographics often identify people uniquely. Health 671, 1–34 (2000). Google Scholar
- Genest, C. & Mackay, J. The joy of copulas: bivariate distributions with uniform marginals. Am. Stat. 40, 280–283 (1986).MathSciNet Google Scholar
- Cherubini, U., Luciano, E. & Vecchiato, W. Copula Methods in Finance (Wiley-Blackwell, Hoboken, NJ, 2004).
- Genest, C. & Favre, A.-C. Everything you always wanted to know about copula modeling but were afraid to ask. J. Hydrol. Eng. 12, 347–368 (2007).Article Google Scholar
- Wang, W. & Wells, M. T. Model selection and semiparametric inference for bivariate failure-time data. J. Am. Stat. Assoc. 95, 62–72 (2000).MathSciNet Article Google Scholar
- Genz, A. Numerical computation of multivariate normal probabilities. J. Comput. Graph. Stat. 1, 141–149 (1992). Google Scholar
- Genz, A. & Bretz, F. Computation of Multivariate Normal and t Probabilities (Springer Science & Business Media, Berlin, 2009).
- Brier, G. W. Verification of forecasts expressed in terms of probability. Mon. Weather Rev. 78, 1–3 (1950).ADS Article Google Scholar
- Golle, P. Revisiting the uniqueness of simple demographics in the US population. In 5th ACM Workshop on Privacy in Electronic Society (ACM, New York, NY, 2006). https://doi.org/10.1145/1179601.1179615.
- Fox-Brewster, T. 120 million american households exposed in ‘massive’ ConsumerView database leak. Forbes (2017).
- Waterfield, P. & Revell, T. Huge new facebook data leak exposed intimate details of 3m users. New Scientist (2018).
- El Emam, K. & Arbuckle, L. Anonymizing Health Data (O’Reilly, Newton, MA, 2013).
- D’Acquisto, G. et al. Privacy by design in big data: an overview of privacy enhancing technologies in the era of big data analytics. Technical Report. European Union Agency for Network and Information Security (2015).
- Cho, H., Wu, D. J. & Berger, B. Secure genome-wide association analysis using multiparty computation. Nat. Biotechnol. 36, 547–551 (2018).CAS Article Google Scholar
- Cavoukian, A. & Castro, D. Big data and innovation, setting the record straight: de-identification does work. http://www2.itif.org/2014-big-data-deidentification.pdf (2014).
- Sweeney, L. k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 10, 557–570 (2002).MathSciNet Article Google Scholar
- Meyerson, A. & Williams, R. On the complexity of optimal k-anonymity. In Proc. 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 223–228 (2004). https://doi.org/10.1145/1055558.1055591.
- Aggarwal, C. C. On k-anonymity and the curse of dimensionality. In Proceedings of the 31st International Conference on Very Large Data Bases, VLDB ‘05, 901–909 (VLDB Endowment, 2005). http://dl.acm.org/citation.cfm?id=1083592.1083696.
- Li, N., Li, T. & Venkatasubramanian, S. t-closeness: privacy beyond k-anonymity and l-diversity. In 2007 IEEE 23rd International Conference on Data Engineering, 106–115 (IEEE, 2007). https://doi.org/10.1109/ICDE.2007.367856.
- Ewens, W. J. The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3, 87–112 (1972).MathSciNet CAS Article Google Scholar
- Chen, G. & Keller-McNulty, S. Estimation of identification disclosure risk in microdata. J. Stat. 14, 79 (1998). Google Scholar
- Hoshino, N. Applying pitman’s sampling formula to microdata disclosure risk assessment. J. Stat. 17, 499 (2001). Google Scholar
- Keller, W. J. & Pannekoek, J. Disclosure control of microdata. J. Am. Stat. Assoc. 85, 38–45 (1990).Article Google Scholar
- Dankar, F. K., El Emam, K., Neisa, A. & Roffey, T. Estimating the re-identification risk of clinical data sets. BMC Med. Inform. Decis. Mak. 12, 66 (2012).Article Google Scholar
- Pitman, J. Random discrete distributions invariant under size-biased permutation. Adv. Appl. Probab. 28, 525–539 (1996).MathSciNet Article Google Scholar
- Skinner, C. J. & Holmes, D. J. Estimating the re-identification risk per record in microdata. J. Stat. 14, 361 (1998). Google Scholar
- Skinner, C. & Shlomo, N. Assessing identification risk in survey microdata using Log-Linear models. J. Am. Stat. Assoc. 103, 989–1001 (2008).MathSciNet CAS Article Google Scholar
- Vinh, N. X., Epps, J. & Bailey, J. Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010).MathSciNet MATH Google Scholar