Melissa Hall is a rising fifth year studying Electrical & Computer Engineering and Plan II from Cypress, Texas. For two summers she has interned at Facebook HQ and focused on applying data science to understand multi-lingual Internet searches and developing methods to improve computer architecture anomaly debugging. Read her full article below:
Applying Differential Privacy Methods on American Data
In 1997, a graduate research student shocked the American public by combining information from a public, anonymized health dataset and state voting records to identify the Governor of Massachusetts’ confidential health information. After this revelation, greater stipulations about anonymization tactics were added to the Health Insurance Portability and Accountability Act (HIPAA) through its Privacy Rule in 2003.
However, with increased incentives and means for foreign adversaries to manipulate the habits and ideologies of the American public, the citizens of the United States cannot afford to have their privacy protected retroactively. The growth of data collection and processing methods among public and private entities has allowed governments and corporations in America to share increasingly large swaths of data relevant to individual citizens -- including transportation routes, education scores, crime prevalences, park locations, household power consumption, and event information. Reports of many data breaches among top American institutions and companies reveal the susceptibility of public information to a malicious adversary. Fortunately, the recent development of “differential privacy” demonstrates strong potential as an effective measure for protecting the data of American citizens.
Differential privacy is a system property that guarantees the privacy of an individual’s data in the case of a security breach or leak. A common example of this property is explained as follows: Suppose an experiment occurs with the goal of determining public opinion about a presidential candidate. An individual may call 100 random people and record their answers to the question - “Do you want candidate X to win?”. If this data were to be leaked, anyone with access to this information can hold the people featured in the dataset accountable to their answer. However, to make this a differentially private system, the person conducting the survey can choose to flip a coin every time she calls a person to ask the question. If the coin lands “heads”, a random yes/no response is recorded for that individual. If the coin lands “tails”, the true response is recorded for that individual. With this new method, it is not possible to determine if a single individual’s recorded response was truly her answer or a randomized answer - thus providing her privacy in the case of a data breach - while still giving the researcher the ability to determine public opinion about a particular candidate.
When applied practically, differential privacy can guarantee that no threatening actor can identify an individual’s data. A variety of statistical measures can be used as tools to provide this property, like adding elements of randomness (as seen in the example above), using only a portion of the data received, or using hashing to transform information into unique and irreversible forms. These methods can then be applied at three different points in the formulation of a predictive model: at the beginning (to transform input data), in the process of variable weighting (which prevents the model from forming exactly representative outputs), and at the final outcome (slightly obscuring the true outputs from the publicly observed results).
Recent efforts at including differentially private methods in American institutions have shown good promise. The United States Census Bureau announced in late 2017 that it would begin testing differentially private methods in its 2018 End-to-End Census Test - evaluating whether they would be acceptable for full application in the 2020 Census. Furthermore, American companies are developing related methodologies to secure their data sources. For example, Google has taken steps to develop differentially private algorithms so that its deep learning models - often used for language and image processing - are not traceable to individual content. Similarly, Apple has announced updates to its operating system so that it uses “Differential Privacy technology to help discover the usage patterns of a large number of users without compromising individual privacy.” The work of these institutions lays the groundwork for increased usage of differentially private methods in protecting individuals’ data from adversaries.
There are several steps that can be taken to improve the security of American citizens’ data through differential privacy. Ethics and legal advisors should encourage companies and institutions to use differentially private methods for conducting surveys by incentivizing research of safe data collection and storage while penalizing data that is acquired and held in risky forms. The United States government should establish rules that require that any government-collected datasets be evaluated for the application differentially private protections before being shared with the public. American citizens should be educated on the concept of differential privacy so that they know how to evaluate the potential vulnerability of any personal information requested by an external company or institution. Through these methods, the government, institutions, and citizens of the United States will be better equipped to utilize differential privacy to protect the information of the American public.