“This is just as puerile. Of course Open Science and Open Data
are designed so that patient data, social data, rare species, etc. are
kept confidential.”
Actually this idea is not puerile at all.
Openness and confidentiality are uneasy partners at best. A cursory
review of the academic literature on re-identification makes this
blindingly obvious, but if you’ve never read through it, Paul Ohm’s
article
“Broken
Promises of Privacy” is a good place to start (not to mention open
access, refreshingly), as is
Latanya
Sweeney’s work.
The short version is that we are astonishingly identifiable, and the
more data that is available about us, the more identifiable we become.
The same powers of integration that make scientific data more useful as
they are interconnected apply to the data about ourselves as well.
That’s why social media companies can give away their products. Because
data about people lets you mark them, fairly uniquely, and sell to
them.
Open data is not your new
bicycle. We can’t simply throw open at a problem and solve it without
creating new problems. And one of those problems is a problem that
exists with Big Data generally, whether or not it’s open. Our privacy
laws are out of date, re-identification is easy, and harm is subtle to
notice. Benefits to sharing personal data accrue mainly to the society
at this point, while harm accrues to the individual. We have to take
this issue head on, not dismiss it as puerile.
The reality of big data is less anonymity. The question is why Open Data
is better for a society with less anonymity than Closed Data.