‘Machine learning systems should be tested for fairness’

11 Aug 2021

A woman stands beside a large statue of explorer Tom Crean on a bright, sunny day.

Image: Sarah Jane Delany

TU Dublin’s Sarah Jane Delany discusses the need for greater gender representation in machine learning and the ongoing challenges around bias in datasets.

When we discuss new and emerging trends in technology, it’s easy to get sucked into the futuristic world of what could be and forget how current and applicable these technologies already are. For example, machine learning is employed in many systems we use on a daily basis from predicting journey times on Google Maps to giving recommendations on streaming services.

However, while there’s plenty happening in the area of machine learning and AI, this sector has been known for having poor representation of women – despite many incredible women working in the field.

Sarah Jane Delany, a full professor of inclusive computer science at TU Dublin, is particularly interested in moving the needle on gender representation, having co-founded the Women Leaders in Higher Education Network.

“Our aim is to empower and enable women in all roles in TU Dublin and to encourage time for reflection on their personal and professional development and career enhancement,” she told Siliconrepublic.com.

“Our School of Computer Science at TU Dublin has been involved in a number of initiatives to encourage and retain female students. Building on this, I was successful in getting funding from the HEA Gender Equality Enhancement Fund towards a PhD student who is working on a project we call TechMate, a toolkit of best practice techniques and methods for recruiting and retaining female students on technology courses.”

Delany graduated with first-class honours in mathematics and worked as an IT consultant for nine years before returning to academia. She said this meant she has always worked in areas dominated by men and has seen it in both IT and academia.

“Things are gradually changing now, but very slowly. I also welcome the very recent appointments of Prof Linda Doyle as provost of TCD and Dr Maggie Cusack of Munster TU.”

Delany was awarded her current position as full professor in inclusive computer science in January of this year under the Higher Education Authority’s senior academic leadership initiative.

She is also an active researcher in the area of machine learning and is the TU Dublin lead for the SFI Centre for Research Training in Machine Learning.

‘Machine learning systems should be evaluated and tested for fairness and inclusion’
^{– SARAH JANE DELANY}

She said that her interest in machine learning was initially due to the technical challenge it presented.

“I was drawn to computer science and programming due to the logical, computational aspects and machine learning with its emphasis on learning from data, examples of incidents or events, really appealed to me,” she said.

“My PhD is in computer science and explored solutions for email spam filtering back in the early days of spam. Since then, I have worked a lot with text data, extending into SMS spam filtering, identifying abusive content online and now am working in the area of identifying bias in text data, mainly gender bias at the moment.”

Machine learning bias

As well as the issue of gender representation in the tech sector, machine learning itself also has problems around bias. Because this technology is built on datasets, bias and other ethical problems can be brought into systems, such as an MIT image library used to train AI that was found to contain racist and misogynistic terms.

If the data going into machine learning systems is biased, the systems will also learn this bias. For example, algorithms in the US criminal justice system have been found to be racially biased, with people from minority groups more likely to be falsely flagged as future criminals.

While there are a variety of human biases that can enter datasets such as stereotypical bias, prejudice, overgeneralisation and confirmation bias, Delany said there can also be algorithmic bias, where the bias is manifested in the machine learning algorithm itself.

“There is ongoing research in both areas looking at how to identify bias in datasets, how to de-bias data and also how to ensure algorithms are not biased against minority characteristics in the data,” she said.

“But there are other avenues that can be taken to handle bias – machine learning systems should be evaluated and tested for fairness and inclusion. Systems should be tested with groups of data that cover different genders, race, religion etc, and performance results for the groups should be the same. There should not be differences in how systems predict for one group against another group.”

In addition to testing for bias, Delany said there is an argument to be made for having data statements or model cards that give greater transparency into the data that is used to train machine learning systems.

Although there are no standards in this area yet, Google released its Model Card Toolkit last year, a toolset designed to facilitate AI model transparency reporting, and Salesforce joined in with its own Simulation Cards later in the year.

Challenges within the data

While there is still work to be done in the area of bias, both conscious and unconscious, steps are being taken to address some of these issues.

Delany said another challenge around machine learning is about the data itself as well as how it is used. “For researchers like me, access to data can be difficult. Data quality is also an issue, it tends to be incomplete and noisy. Companies have access to their own data but need to build systems to collect, capture, clean and store the data,” she said.

Another issue is around data privacy. Even when data is anonymised in machine learning systems, privacy can still be a problem. An example of this came in 2010, when Netflix was forced to cancel a $1m competition that aimed to improve its algorithms when it faced a lawsuit over insufficiently anonymised data.

“The Cambridge Analytica data scandal also shows how data can be misused,” said Delany. “It showed how consumer and lifestyle data on voters bought from data aggregators were used to build machine learning models to target voters in the 2016 presidential election.”

While there are plenty of problems and challenges yet to be solved within the area of machine learning, Delany did highlight her favourite applications of the technology that make life easier for many people.

“Speech recognition, speech to text, text summarisation, image recognition and automatic captioning of videos and images are some applications of machine learning that are used in assistive technology. Although some of these applications work better than others, these are difficult problems to solve.”

However, she also believes that deep learning, a subset of machine learning where algorithms essentially mimic how the human brain works, is overhyped.

“[It] is being used to solve problems that can be solved by traditional machine learning techniques,” she said.

Want stories like this and more direct to your inbox? Sign up for Tech Trends, Silicon Republic’s weekly digest of need-to-know tech news.