Recently, Netflix released some anonymized usage data in order to seed a technical challenge (on recommending algorithms).
Bruce Schneier, a well known security expert, reports that a team of University of Texas researchers successfully de-anonymized a subset of the data through correlation with public IMDb (internet movie database) entries.
Bruce extends this by analogy to point how straightforward these forms of datamining are, and he notes the obvious analogy to book purchasing habits:
[A]s opportunities for this kind of analysis pop up more frequently, lots of anonymous data could end up at risk.
Someone with access to an anonymous dataset of telephone records, for example, might partially de-anonymize it by correlating it with a catalog merchants’ telephone order database. Or Amazon’s online book reviews could be the key to partially de-anonymizing a public database of credit card purchases, or a larger database of anonymous book reviews.
Google, with its database of users’ internet searches, could easily de-anonymize a public database of internet purchases, or zero in on searches of medical terms to de-anonymize a public health database. Merchants who maintain detailed customer and purchase information could use their data to partially de-anonymize any large search engine’s data, if it were released in an anonymized form. A data broker holding databases of several companies might be able to de-anonymize most of the records in those databases.
What the University of Texas researchers demonstrate is that this process isn’t hard, and doesn’t require a lot of data. It turns out that if you eliminate the top 100 movies everyone watches, our movie-watching habits are all pretty individual. This would certainly hold true for our book reading habits, our internet shopping habits, our telephone habits and our web searching habits.
. . .
With only eight movie ratings (of which two may be completely wrong), and dates that may be up to two weeks in error, they can uniquely identify 99 percent of the records in the dataset. After that, all they need is a little bit of identifiable data: from the IMDb, from your blog, from anywhere. The moral is that it takes only a small named database for someone to pry the anonymity off a much larger anonymous database.