Much of the data individuals provide is assumed to be protected because it is anonymized—stripped of any information that identifies who those individuals are. Such anonymized data is everywhere. But how safe is the underlying assumption that individuals can’t be reidentified through such data? Unfortunately, as we discuss in this blog post, there is repeated evidence that this underlying assumption is not holding up—something that raises real concerns that people can be victimized through information they release that can be traced back to them and that makes this an emerging law enforcement issue.
Part of the vulnerability in anonymized data is that there is so much more information available today and that information is easier to combine. In one interesting case, paparazzi ended up as “sensors”: data traced from their photographs of celebrities taking taxis around NYC was linked to an anonymized data set of NYC taxi cab rides. As a result, it was possible to determine where celebrities went and how much they tipped. Digging into the taxi data set and overlaying that information with images found through Google searches showed trips taken by Kourtney Kardashian, Jessica Alba, Bradley Cooper, Olivia Munn, Katherine Heigl, and more.
University of Texas researchers provide a more historic example. They were able to identify Netflix users using the Netflix Prize dataset. This dataset showed the movie ratings of 500,000 Netflix subscribers. When individual ratings were traced back to a particular subscriber, they were able to make inferences about such things as political party affiliation.
Another example comes from a recent Science publication, which reviewed a study about credit card metadata. The study looked at 3 months of credit card charges for a total of 1.1 million people. Even though it was not a publically available data set, it did have shopping information, and researchers were able to show that even four slices of data could reveal who 90 percent of the individuals in the database were.
U.S medical research relies on anonymized data, but it too is vulnerable. Latanya Sweeney, head of the Harvard University Data Privacy Lab, highlighted several cases of identification through anonymized data. In 1997 she was able to discover the Governor of Massachusetts’ medical records through a release of anonymized data from the Massachusetts Group Insurance Commission. More recently her lab exposed vulnerabilities in the Personal Genome Project data set, correctly identifying over 84 percent of the participants and correlating them to their DNA information.
Overall, the concern about vulnerabilities in anonymized data affects law enforcement in cases where the divulgence of personal information made a victim vulnerable to being targeted, harassed, or even blackmailed.
References:
http://gawker.com/the-public-nyc-taxicab-database-that-accidentally-track-1646724546
http://www.sciencemag.org/content/347/6221/536.full
https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf
https://www.eff.org/deeplinks/2009/09/what-information-personally-identifiable
http://dataprivacylab.org/projects/pgp/index.html