Data Science and Privacy Regulations: A Storm on the Horizon
The European Union is a few short months away from finalizing a sweeping regulation that will dramatically change the way in which data can be handled and in which data science can be utilized. This new regulation will affect all corporations using data from EU citizens, not just those with offices in the EU. Those collecting data from more than 5k EU citizens per year will be consider accountable, regardless of company location. The EU parliament is so serious about compliance with these new privacy and data protection laws that it has proposed a fine for violations of up to 5% of global annual turnover (1 million Euros for smaller companies). Needless to say, this massive fine has attracted serious attention to the regulation. Companies have already started preparations to comply.
Personal privacy and data protection are currently legislated and enforced in the EU through a patchwork of individual member state laws and independent supervisors. The current lack of a single privacy framework complicates compliance and data transfer for multi-national corporations, while also preventing EU supervisors from addressing privacy violations in a unified manner. More to the point, overly aggressive data driven business models, ineffective lobby strategies and underinvestment in data protection have resulted in a market failure argument stimulating what is a stepwise regulatory change. This change will be provided by the General Data Protection Regulation (GDPR).
The GDPR will become the law of the land across the EU, replacing for the most part the current member state regulations. Three years in development, final ratification is due this year, during the current Luxembourg presidency of the EU, or as a worst timeline case – during the Dutch Presidency (January-June 2016). Enforcement will occur within a two-year window following ratification, implemented via a One Stop Shop approach to supervision (the member state where the corporation is headquartered will supervise).
The Police and Judicial Cooperation Data Protection Directive (PJCD) will be released simultaneously and will address use of data by law enforcement agencies.
Relevance for Data Scientists
Potential Conflict of Goals: The upcoming privacy regulations will be especially challenging for data scientists as it will push data use in precisely the opposite direction to where many data scientists are tending to push.
Ideally, both data scientists and privacy advocates are pursuing the best interests of the individual. They have, however, different goals in their methodologies. Data Science has the goal of acquiring new data and finding new uses for existing data. While privacy advocates strive to minimize data collection, data scientists strive to maximize it. While privacy advocates strive to decrease unexpected uses of data, data scientists strive to increase them. Compliance with the GDPR will require very careful alignment and coordination of these goals in a way in which the individual is benefited from both a privacy/data protection as well as from an economic perspective.
Generating Private Data: We are becoming increasingly aware of the ways in which the analytic techniques of data scientists are able to draw unanticipated insights from what was thought to be innocuous data. Projects have been carried out which, for example, link sensitive but anonymized data to specific individuals, reveal the gender and/or ethnicity of individuals based on Facebook likes, retrieve personal records of individuals based on a snapshot taken on the street, fingerprint cell phones based on cell tower check-ins, etc.
In a previous post, I wrote about how Netflix had legal problems when they didn’t realize how data science techniques could de-anonymize legally protected data released during the Netflix Prize. The state of Massachusetts had a similar problem in 2002 when health care records of public employees were released as anonymous and later partially de-anonymized.
So we see how personal personal data may be volunteered, observed or inferred. Although the majority of press in the last few years has focused on concerns over data observation (e.g. cookie legislation, audio/video surveillance, RFID etc.), regulators are shifting their attention to the realms of Big Data, Smart Sensors, and advanced analytics.
Thus, advancements in Data Science have and will continue to expand the definition of Personally Identifiable Information (PII). These advancements will undoubtedly influence privacy legislation in the future.
Working with Data:
Our increased usage of cutting-edge data storage and analytic technologies put us even more at risk of violating privacy concerns. Modern data technologies, including an abundance of noSQL technologies, on-demand cloud storage, and in-memory processing, are encouraging data scientists and corporations in general to produce massive stores of raw data (data lakes). This storage raises the following challenges from a privacy compliance perspective:
- Data awareness: Companies lose oversight of what data is stored, where it is replicated, and what the risks and privacy implications of that data may be.
- Governance: Raw data may be flowing into the systems of pilot programs without mature governance models. In addition, there is concern over the security features of many cloud storage systems.
- Control: As raw data with unknown potential is retrieved, stored, copied and distributed, companies may find themselves in a position where they have lost oversight of where data has flowed and have lost the ability to implement right to be forgotten/right to erasure.
Direct Impact on Data Scientists
The GDPR emphasizes the individual’s rights to understand and control how their data are used. The impact of the GDPR for data scientists includes:
- Ability to collect data. There will be an increased legislation of principles of Privacy by Design/Privacy by Default, which minimize the baseline collection level of data thru systems and processes (think, for example, of browser default settings). Individuals will need to give express consent for what data are collected and will need to be informed as to why the data are being collected.
- Ability to use data. It will become necessary to get express consent for each application of personal data. (Details here are still under debate, and there will likely be certain exceptions). This could severely impact the ability of data scientists to find new applications for existing data, as those applications will not have been listed in original consent forms. What’s important to note is that there will likely be a grandfathering of current consent. Thus, it is extremely important to assure that proper consent is in place now.
- Ability to transfer data to and from third parties. Stiff regulatory fines will certainly produce an environment where corporations are very reluctant to buy, sell or share data that may be personal. In addition, right to privacy/erasure regulation may have strong implications on data sharing (details are still under discussion in the EU parliament). As a result, expect a drying up of certain data sources.
- Customer Profiling will be specifically affected by the new regulations. In particular, the customer must be informed when and how data will be used to profile them with material impact (e.g. credit scoring, fraud detection, etc.). In addition, they must have the right to opt-out of automatic profiling algorithms (which will produce additional bias that must be dealt with in the model calibration). Finally, and significantly, companies can be held in violation if their profiling algorithms are not sufficiently robust.
- Requirements in storing data. There are some significant issues here.
- Individuals will be guaranteed the right to be forgotten/right to erasure. Thus, companies will need to know the location of all copies and destinations of any data that may be tied to an individual.
- The GDPR will require not only compliance but also accountability, meaning that corporations must be prepared to demonstrate to the supervisors that they are compliant. This will require extensive preparation, undoubtedly including an extensive up-front data audit, documenting the location, type and accessibility of the three types of customer data (volunteered, observed, and inferred by data science techniques).
- Data scientists will need to be aware of the implications of passing personal data through service providers, including Cloud Storage, Cloud based BI and analytics tools, and web services.
- Much heavier emphasis on privacy in your company. A few factors will be at play here.
- The June draft proposed a fine of up to 2% of global annual turnover for violations of the regulation. This is already massive. The EU parliament subsequently proposed that this fine be increased to 5%. We are waiting to see the final figure, but, regardless, we can be sure that companies must and will make compliance with the GDPR a top priority.
- Larger companies in the EU will need to appoint a Data Protection Officer. Expect to get to know this person quite well over the coming years.
- As mentioned above, the GDPR will impose accountability, not just compliance. This means that substantial effort will need to go into producing documentation to present to the supervisor on demand, demonstrating that your company is in full control of personal data and is in compliance with the GDPR.
Start Preparing Now
The GDPR is so significant that corporations are already beginning to prepare for its implementation. Compliance involves steps that cannot be taking overnight, and the accountability clause will require a documented awareness of data assets and systems, most likely including some type of data audit and risk assessment.
- Audit your entire data ecosystem now, and determine how it may expose you to privacy violations. Start with the structured data in your BI systems. Look at the dark data in your operational systems. Look at your Big Data, including the web log data and any sensor data. Document what is there, to where it is replicated, who has access, and what controls are in place. Document what data are personal and what may be made personal through various data science techniques. You’ll most likely need to do this audit within the next year or two anyway, so it’s best to do this now and already introduce necessary changes to product roadmaps.
- Ensure that user consent is properly implemented before the GDPR takes effect. The reason this is so key is that the current status of the GDPR allows user consent to be grandfathered in. Your ability to use any data that you have collected may be severely limited under the GDPR if you do not have proper user consent.
- Ensure that all product roadmaps comply with the principles of Privacy by Design. If you aren’t already familiar with the concepts of privacy by design/privacy by default, then become familiar. Communicate with product owners so that products developed in the future maintain full functionality while still complying with the restrictions on data collection required by the GDPR. Design these products so that business critical data can be collected in a way that honors privacy laws while still enabling the business to be data driven to the fullest possible extent.
- Initiate dialogue with your corporate privacy officer or external expert. The stakes have become quite high, and the subject matter is complex. There will need to be strong 2-way communication between legal and technical experts, and that communication should start very soon.