How can we mitigate ethical and privacy issues in data science?

26 Oct 2017

Mitigating bias when working with data can be quite difficult. Image: BEST BACKGROUNDS/Shutterstock

With advances in data science creating more automated decision-making tools, how do those in the field mitigate potential problems?

Those who work in data science and analytics are at the coalface of digital transformation, and opportunities for achieving social good within the field of data science are huge. In just a few short years, data has become one of the most valuable commodities in the global economy.

The vast quantities of user data from social media accounts, apps and IoT devices have created swathes of information from which vital insights can be mined to implement new problem-solving technologies, ostensibly making the world a better place.

That hinges on the continuing development of an ethical framework, increased transparency with the public about their data and constant cognisance of potential biases (both human and within datasets) that could be used in automated decision-making or other technological advances.

Alan Smeaton is professor of computing at Dublin City University (DCU) and director of the Insight Centre for Data Analytics at DCU.

He noted Google as the first major example of big-data analysis by a large company, explaining that “they had a huge amount of log files and archives of people’s previous searches that they could mine for patterns”.

This was one of the first steps in the emergence of data science and predictive analytics as we know them today.

Resurgence of deep learning

Smeaton broke down the differences between traditional machine learning (think of it as a current application of AI based on giving machines access to data, letting them learn for themselves) and a complex subset within it: deep learning.

To take an example, traditional machine learning algorithms are why we are able to view targeted advertising based on our online activity, whether it’s a YouTube pre-roll ad or a targeted ad on Facebook.

Deep learning with deep neural networks is a little more complex and a lot more scalable, but that comes with a caveat, said Smeaton: “It can scale to much larger volumes and give much better results. The thing is, we don’t know how it works.”

There is a certain level of opacity and mystery when it comes to these so-called ‘black box’ deep learning models. As Smeaton pointed out: “You can’t open up the lid and look inside and ask, ‘Why is this [the] prediction?’

“We are getting much more accurate machine learning and data analytics; we are getting much more accurate in terms of performance, more accurate in terms of even being able to outperform humans, but we have no capacity to explain why.”

Algorithms and profiling

This creates some ethical quandaries, according to Dr Robert Ross, senior lecturer in the School of Computing at Dublin Institute of Technology (DIT) and a funded investigator in the Adapt Centre.

He mentioned the creation of algorithmic models “by industry for common tasks such as financial profiling or even measuring suitability for employment for a given company”.

He continued: “Companies are creating these models for widespread uses and, while banks were historically required to use models that had the ability to be explained, this requirement is less and less enforced, leading the black-box models that are subject to the whims – or, rather, the biases – that present in their creation.”

Smeaton said that data scientists are trying to explain why certain black-box models reach biased outcomes by slightly changing variables and examining differing results to find the source of the bias.

He also stated that often, bias can be explained by an insufficient amount of data in the first place.

So, what can data scientists do to ensure these exciting new developments are handled in transparent and ethical ways, avoiding biases that could negatively affect people’s lives?

Data-first approach

The start of the pipeline, the data itself and how it is approached, is vital here.

A recent example is Google’s Cloud Natural Language API, which a Motherboard journalist found to have negative associations with the words ‘gay’ and ‘Jew’, reflecting the historical bias in society against these groups of people. The data used by the sentiment analyser was simply a mirror reflecting long-held social prejudices.

Dr Luca Longo, lecturer in the School of Computing in DIT and TEDx speaker, explained that to mitigate such biases, data scientists need to understand the data and its contexts before they even begin modelling algorithmic patterns.

If the data collected is biased, then it follows that the algorithm will carry these same biases. This is not ideal, particularly when some algorithms could be used to solve complex social problems.

“Therefore, the key step before training an algorithm is to understand available data and pre-process it in a way that biases are reduced. Afterwards, a more balanced training of models can take place and automated decision-making might become more neutral.”

Assistant head of the School of Computing in DIT and co-founder of DIT’s Applied Intelligence Research Centre, Prof Sarah Jane Delany explained that algorithms used to produce models “assume that the training data used is representative of all possible outcomes so it can ‘learn’ or generalise well”.

Smeaton explained how “bias creep” can occur relatively easily in datasets, “unless we are aware of it, take appropriate action and be cognisant of it when we look at our outcomes from using those data sources”.

Sampling bias is one of the biggest obstacles data scientists face, according to Ross. He posed the question: “Can we be sure that the data we collect to train our model on is an accurate reflection of the full population that our model will later be applied to?” If the answer is no, gender bias, racial bias and even financial bias could all occur as a result of the initial skewed dataset.

He said that although increasing the amount of data collected can help somewhat, “a little time spent thinking out the problem can solve a lot of pain down the line”. Considering that more and more SMEs are investing in modelling users and individuals, thorough examination of sampling bias risk is a must.

The problem of ‘fairness’

Delany also mentioned the nebulous nature of the concept of ‘fairness’, citing a paper by Harvard researcher Thomas Miconi, who discussed the COMPAS recidivism algorithm that caused controversy last year.

Miconi explained that achieving a truly fair predictive model is extremely difficult: “Since several intuitive measures of fairness are mutually exclusive (when populations differ in prevalence and the predictor is neither perfect nor trivial), it follows that any predictor can always be portrayed as biased or unfair, by choosing a specific measure of fairness.” Miconi’s result applied to all forms of prediction, whether performed by algorithms or by humans.

He emphasised that this incompatibility of fairness measures “should not be used as a cover for obvious injustices”, and that a better understanding of logical constraints on the outcomes of decision systems can and should inform – and therefore assist – efforts towards making the world a fairer place.

Algorithm jocks need ethics, too

Dr Patrick Healy, senior lecturer at University of Limerick’s Computer Science and Information Systems department, said that fair and accurate analysis of data is a sine qua non for data scientists, and “although ethics can be pooh-poohed by algorithm jocks, it is a topic that many data scientists need to be aware of”.

The importance of seeking advice from those outside of the data science field in terms of implementing models based on datasets was highlighted by Longo.

He said the implementation of ethical guidelines in AI in a general sense is “is an interdisciplinary problem that should involve data scientists, philosophers [and] lawyers working together to guarantee the design and development of computational approaches that are beneficial to our societies”.

He spoke passionately about the need for a strong ethical framework. “People, institutions or entities responsible to the public dissemination and usage of these algorithms should be trained or made aware of their impact on human decision-making and on the objectivity of data.

“Ethics, at this level, is fundamental, and the trust placed upon the outcomes of these algorithms should be seriously taken with reservations.”

Healy echoed this sentiment and cited an all-too-common human problem: groupthink. This could cause “data scientists and/or algorithmists to see solely the algorithmic challenges in their work” as opposed to the effect such algorithms could have in the real world.

The need for data science professionals to recognise their errors was also highlighted by Healy, who made mention of Google’s profuse apology after facial-recognition software misidentified black people as gorillas. He added that “less arrogance, more humility and a better understanding of their own limitations” would be of immeasurable help to data scientists seeking to implement ethical work practices.

A sober view of data science

The conversation about ethics in data science, particularly its applications in AI, is getting louder, according to Ross. From the SFI-funded Adapt Centre spread across several Irish institutions, to the Enterprise Ireland-funded CeADAR Centre, he has seen many companies wishing to discuss the issue of data science ethics.

GDPR is also a major factor, as individuals are expected to become more literate in terms of their own data.

Ross said: “While the General Data Protection Regulation framework will provide a firmer legal basis for the protection of personal information than we have had before, the ethical question of whether a user has given informed consent for their data to be mined will become more salient.”

Healy noted a change in the perception around big data in general: “The ‘wow’ phase of big data appears to be coming to an end, and a more sober understanding of its power is replacing it.”

Smeaton also cited this ‘cooldown’ in terms of excitement about the building of the algorithms themselves, saying that it is currently “more about your data sources being aware of bias and imbalance”. He explained that in his teaching at DCU, the emphasis on the privacy and sensitivity of data is always stressed to students.

In terms of going forward, Delany said that it’s not necessarily the tech developments that are the issue, but close attention should be paid to the ethics and objectives of the people and companies that use them.

Through increased transparency, thorough examination of datasets, mindful building and collaboration with experts outside of data science, perhaps the code of ethics could be one that evolves alongside the rapid pace of innovation itself. A collaborative mindset will get us further as a society than argumentative and adversarial outlooks, according to Longo.

As Ross succinctly put it: “It is up to society, influencers and politicians to help make sure that we keep to the ethical path.”