Author(s):

  • Rocher, Luc
  • Hendrickx, Julien M.
  • de Montjoye, Yves-Alexandre

Abstract:

While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We here propose a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. On 210 populations, our method obtains AUC scores for predicting individual uniqueness ranging from 0.84 to 0.97, with low false-discovery rate. Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes. Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.

Document:

https://www.nature.com/articles/s41467-019-10933-3

References:
  1. Poushter, J. Smartphone ownership and internet usage continues to climb in emerging economies (Pew Research Center, Washington, DC, 2016). http://www.pewglobal.org/2016/02/22/smartphone-ownership-and-internet-usage-continues-to-climb-in-emerging-economies/.
  2. Yang, N. & Hing, E. National electronic health records survey. https://cdc.gov/nchs/data/ahcd/nehrs/2015_nehrs_ehr_by_specialty.pdf (2015).
  3. Murdoch, T. B. & Detsky, A. S. The inevitable application of big data to health care. JAMA 309, 1351–1352 (2013).CAS Article  Google Scholar 
  4. Wyber, R. et al. Big data in global health: improving health in low- and middle-income countries. Bull. World Health Organ. 93, 203–208 (2015).Article  Google Scholar 
  5. Lazer, D. et al. Life in the network: the coming age of computational social science. Science 323, 721 (2009).CAS Article  Google Scholar 
  6. Halevy, A., Norvig, P. & Pereira, F. The unreasonable effectiveness of data. IEEE Intell. Syst. 24, 8–12 (2009).Article  Google Scholar 
  7. Kitchin, R. The real-time city? Big data and smart urbanism. GeoJournal 79, 1–14 (2014).Article  Google Scholar 
  8. McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D. J. & Barton, D. Big data: the management revolution. Harv. Bus. Rev. 90, 60–68 (2012).PubMed  Google Scholar 
  9. Hodson, H. Revealed: Google AI has access to huge haul of NHS patient data. New Scientist (29 Apr 2016).
  10. Cadwalladr, C. & Graham-Harrison, E. Revealed: 50 million facebook profiles harvested for Cambridge Analytica in major data breach. The Guardian (17 Mar 2018).
  11. Morey, T., Forbath, T. & Schoop, A. Customer data: designing for transparency and trust. Harv. Bus. Rev. 93, 96–105 (2015). Google Scholar 
  12. Polonetsky, J., Tene, O. & Finch, K. Shades of gray: seeing the full spectrum of practical data De-Identification. Santa Clara Law Rev. 56, 593–629 (2016). Google Scholar 
  13. Office for Civil Rights, HHS. Standards for privacy of individually identifiable health information. Federal Register. https://ncbi.nlm.nih.gov/pubmed/12180470 (2002).
  14. Malin, B., Benitez, K. & Masys, D. Never too old for anonymity: a statistical standard for demographic data sharing via the HIPAA privacy rule. J. Am. Med. Inform. Assoc. 18, 3–10 (2011).Article  Google Scholar 
  15. Rothstein, M. A. Is deidentification sufficient to protect health privacy in research? Am. J. Bioeth. 10, 3–11 (2010).Article  Google Scholar 
  16. Council of European Union. Regulation (EU) 2016/679. Off. J. Eur. Union L 119, 1–88 (2016).
  17. Hrynaszkiewicz, I., Norton, M. L., Vickers, A. J. & Altman, D. G. Preparing raw clinical data for publication: guidance for journal editors, authors, and peer reviewers. BMJ 340, c181 (2010).Article  Google Scholar 
  18. Opinion 05/2014 on anonymisation techniques. Technical Report, Article 29 Data Protection Working Party. http://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf (2014).
  19. Rubinstein, I. Framing the discussion. https://fpf.org/wp-content/uploads/2016/11/Rubinstein_framing-paper.pdf (2016).
  20. Cal. Civil Code. Assembly Bill No. 375 §§ 1798.100–1798.198 (2018).
  21. Narayanan, A. & Felten, E. W. No silver bullet: de-identification still doesn’t work. http://randomwalker.info/publications/no-silver-bullet-de-identification.pdf (2014).
  22. Ohm, P. Broken promises of privacy: responding to the surprising failure of anonymization. UCLA. Law Rev. 57, 1701 (2010). Google Scholar 
  23. Hern, A. ‘Anonymous’ browsing data can be easily exposed, researchers reveal. The Guardian (1 Aug 2017).
  24. Culnane, C., Rubinstein, B. I. P. & Teague, V. Health data in an open world. Preprint at: https://arxiv.org/abs/1712.05627 (2017).
  25. Sweeney, L. Weaving technology and policy together to maintain confidentiality. J. Law Med. Ethics 25, 98–110. 82 (1997).Article  Google Scholar 
  26. Loukides, G., Denny, J. C. & Malin, B. The disclosure of diagnosis codes can breach research participants’ privacy. J. Am. Med. Inform. Assoc. 17, 322–327 (2010).Article  Google Scholar 
  27. Douriez, M., Doraiswamy, H., Freire, J. & Silva, C. T. Anonymizing NYC taxi data: does it matter? In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 140–148 (IEEE, Piscataway, NJ, 2016).
  28. Siddle, J. I know where you were last summer: London’s public bike data is telling everyone where you’ve been. https://vartree.blogspot.com/2014/04/i-know-where-you-were-last-summer.html (2014). Accessed 7 Feb 2019.
  29. Lavrenovs, A. & Podins, K. Privacy violations in Riga open data public transport system. In 2016 IEEE 4th Workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE), 1–6 (IEEE, Piscataway, NJ, 2016). https://doi.org/10.1109/AIEEE.2016.7821808.
  30. de Montjoye, Y.-A., Hidalgo, C. A., Verleysen, M. & Blondel, V. D. Unique in the crowd: the privacy bounds of human mobility. Sci. Rep. 3, 1376 (2013).Article  Google Scholar 
  31. de Montjoye, Y.-A., Radaelli, L., Singh, V. K. & Pentland, A. Unique in the shopping mall: on the reidentifiability of credit card metadata. Science 347, 536–539 (2015).ADS Article  Google Scholar 
  32. Matthews, G. J. & Harel, O. Data confidentiality: a review of methods for statistical disclosure limitation and methods for assessing privacy. Stat. Surv. 5, 1–29 (2011).MathSciNet Article  Google Scholar 
  33. Barth-Jones, D. The ‘re-identification’ of Governor William Weld’s medical information: a critical re-examination of health data identification risks and privacy protections, then and now. https://ssrn.com/abstract=2076397 (2012).
  34. El Emam, K. & Arbuckle, L. De-identification: a critical debate. https://fpf.org/2014/07/24/de-identification-a-critical-debate/ (2014).
  35. Sánchez, D., Martnez, S. & Domingo-Ferrer, J. Comment on “unique in the shopping mall: on the reidentifiability of credit card metadata”. Science 351, 1274 (2016).ADS Article  Google Scholar 
  36. Reiter, J. P. Estimating risks of identification disclosure in microdata. J. Am. Stat. Assoc. 100, 1103–1112 (2005).MathSciNet CAS Article  Google Scholar 
  37. Fienberg, S. E. & Sanil, A. P. A Bayesian approach to data disclosure: optimal intruder behavior for continuous data. J. Stat. 13, 75 (1997). Google Scholar 
  38. Duncan, G. & Lambert, D. The risk of disclosure for microdata. J. Bus. Econ. Stat. 7, 207–217 (1989). Google Scholar 
  39. Office of the Australian Information Commissioner. De-identification and the Privacy Act. https://www.oaic.gov.au/agencies-and-organisations/guides/de-identification-and-the-privacy-act (2018).
  40. Ruggles, S., King, M. L., Levison, D., McCaa, R. & Sobek, M. IPUMS-International. Hist. Methods 36, 60–65 (2003).Article  Google Scholar 
  41. Bennett, J. & Lanning, S. The Netflix prize. In Proc. KDD Cup and Workshop, 35–38 (ACM, New York, NY, 2007). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.115.6998.
  42. Sweeney, L. Simple demographics often identify people uniquely. Health 671, 1–34 (2000). Google Scholar 
  43. Genest, C. & Mackay, J. The joy of copulas: bivariate distributions with uniform marginals. Am. Stat. 40, 280–283 (1986).MathSciNet  Google Scholar 
  44. Cherubini, U., Luciano, E. & Vecchiato, W. Copula Methods in Finance (Wiley-Blackwell, Hoboken, NJ, 2004).
  45. Genest, C. & Favre, A.-C. Everything you always wanted to know about copula modeling but were afraid to ask. J. Hydrol. Eng. 12, 347–368 (2007).Article  Google Scholar 
  46. Wang, W. & Wells, M. T. Model selection and semiparametric inference for bivariate failure-time data. J. Am. Stat. Assoc. 95, 62–72 (2000).MathSciNet Article  Google Scholar 
  47. Genz, A. Numerical computation of multivariate normal probabilities. J. Comput. Graph. Stat. 1, 141–149 (1992). Google Scholar 
  48. Genz, A. & Bretz, F. Computation of Multivariate Normal and t Probabilities (Springer Science & Business Media, Berlin, 2009).
  49. Brier, G. W. Verification of forecasts expressed in terms of probability. Mon. Weather Rev. 78, 1–3 (1950).ADS Article  Google Scholar 
  50. Golle, P. Revisiting the uniqueness of simple demographics in the US population. In 5th ACM Workshop on Privacy in Electronic Society (ACM, New York, NY, 2006). https://doi.org/10.1145/1179601.1179615.
  51. Fox-Brewster, T. 120 million american households exposed in ‘massive’ ConsumerView database leak. Forbes (2017).
  52. Waterfield, P. & Revell, T. Huge new facebook data leak exposed intimate details of 3m users. New Scientist (2018).
  53. El Emam, K. & Arbuckle, L. Anonymizing Health Data (O’Reilly, Newton, MA, 2013).
  54. D’Acquisto, G. et al. Privacy by design in big data: an overview of privacy enhancing technologies in the era of big data analytics. Technical Report. European Union Agency for Network and Information Security (2015).
  55. Cho, H., Wu, D. J. & Berger, B. Secure genome-wide association analysis using multiparty computation. Nat. Biotechnol. 36, 547–551 (2018).CAS Article  Google Scholar 
  56. Cavoukian, A. & Castro, D. Big data and innovation, setting the record straight: de-identification does work. http://www2.itif.org/2014-big-data-deidentification.pdf (2014).
  57. Sweeney, L. k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 10, 557–570 (2002).MathSciNet Article  Google Scholar 
  58. Meyerson, A. & Williams, R. On the complexity of optimal k-anonymity. In Proc. 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 223–228 (2004). https://doi.org/10.1145/1055558.1055591.
  59. Aggarwal, C. C. On k-anonymity and the curse of dimensionality. In Proceedings of the 31st International Conference on Very Large Data Bases, VLDB ‘05, 901–909 (VLDB Endowment, 2005). http://dl.acm.org/citation.cfm?id=1083592.1083696.
  60. Li, N., Li, T. & Venkatasubramanian, S. t-closeness: privacy beyond k-anonymity and l-diversity. In 2007 IEEE 23rd International Conference on Data Engineering, 106–115 (IEEE, 2007). https://doi.org/10.1109/ICDE.2007.367856.
  61. Ewens, W. J. The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3, 87–112 (1972).MathSciNet CAS Article  Google Scholar 
  62. Chen, G. & Keller-McNulty, S. Estimation of identification disclosure risk in microdata. J. Stat. 14, 79 (1998). Google Scholar 
  63. Hoshino, N. Applying pitman’s sampling formula to microdata disclosure risk assessment. J. Stat. 17, 499 (2001). Google Scholar 
  64. Keller, W. J. & Pannekoek, J. Disclosure control of microdata. J. Am. Stat. Assoc. 85, 38–45 (1990).Article  Google Scholar 
  65. Dankar, F. K., El Emam, K., Neisa, A. & Roffey, T. Estimating the re-identification risk of clinical data sets. BMC Med. Inform. Decis. Mak. 12, 66 (2012).Article  Google Scholar 
  66. Pitman, J. Random discrete distributions invariant under size-biased permutation. Adv. Appl. Probab. 28, 525–539 (1996).MathSciNet Article  Google Scholar 
  67. Skinner, C. J. & Holmes, D. J. Estimating the re-identification risk per record in microdata. J. Stat. 14, 361 (1998). Google Scholar 
  68. Skinner, C. & Shlomo, N. Assessing identification risk in survey microdata using Log-Linear models. J. Am. Stat. Assoc. 103, 989–1001 (2008).MathSciNet CAS Article  Google Scholar 
  69. Vinh, N. X., Epps, J. & Bailey, J. Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010).MathSciNet MATH  Google Scholar