Big data has long been heralded as the ultimate 21st-century game changer. But as the development of data science rushes ahead, we are increasingly expected to put our faith into things we can’t control, writes Elaine Burke.
Since the term ‘big data’ was coined, we’ve been hearing from technology evangelists how great masses of data will be used to vastly improve how we run the world. Now that tools such as algorithms, artificial intelligence (AI) and machine learning are more and more readily at our disposal, that potential is beginning to be realised. Unfortunately, however, some of the stories emerging about how data is being used don’t yet paint a hopeful picture.
Right now, Netflix documentary The Social Dilemma is racking up viewers interested to learn how social media makes use of data, with insights from the very people who built these platforms. The film’s dramatisations seek to demonstrate how the data we feed into social media is used to trigger addictive behaviour, casting Mad Men’s Vincent Kartheiser as a multifaceted algorithm programmed to continually optimise engagement and advertising.
The power of persuasion is nothing new in the advertising world but the issue raised by The Social Dilemma and other critiques of the online industry is that advertising in other media can be strictly regulated, but you can’t feasibly control advertising at the scale and level of personalisation stemming from online engines.
Social media has borne the brunt of most coverage of data abuses, but data is being used in problematic ways elsewhere. There are regular reports of biased algorithms let loose in the wild to make flawed decisions, particularly against people who don’t fall into the rich, white male demographic.
Earlier this year, Abeba Birhane of Science Foundation Ireland’s software research centre Lero helped uncover how MIT’s much-cited ‘80 Million Tiny Images’ dataset may have contaminated AI systems with racist and misogynistic terms.
This dataset has been available to train machine learning systems since 2008, so that’s more than a decade of problematic data being used blindly by the community responsible for advancing this decision-making technology. Because who has time to vet the quality of tens of millions of data entries?
In response to this report, MIT has retired the dataset and discouraged its use. Birhane, meanwhile, has called for “radical ethics” and the careful consideration of direct and indirect impacts of using such datasets in future, particularly for vulnerable groups.
‘In the age of big data, the fundamentals of informed consent, privacy or agency of the individual have gradually been eroded’
– ABEBA BIRHANE
The risk of social bias being embedded in data came to the fore in debates over how to allocate grades to students who could not sit traditional exams during the Covid-19 pandemic. In the UK, a grading system built to make use of the data available on students and their schools was widely criticised for how it could impact students from disadvantaged communities and, overall, the roll-out of predicated grades determined by an algorithm was a disaster for the UK government.
The Irish Government tried to sidestep this pitfall by relying only on teacher assessments and students’ previous performance in State examinations for its calculated grades. However, there wasn’t enough time to code and sufficiently test such a system and the result is that grave errors were discovered after the grades were already published.
In this case, the Government’s decision to have the calculated grading code developed in secrecy only exasperated its issues with time and testing. Had it taken a transparent and open-source approach to development, it would have benefitted from the help of many experienced hands on a project of significant public importance.
‘It is remarkable how quickly technology or the algorithm is blamed’
– PAUL CLOUGH
Another issue that Birhane and co-author Vinay Uday Prabhu discovered in the MIT image dataset was that images of children and others scraped from Google and other search engines had been acquired without consent. “In the age of big data, the fundamentals of informed consent, privacy or agency of the individual have gradually been eroded,” the pair warned in their paper.
The question of consent in the context of valuable research datasets was also raised in extensive investigative reporting by Noteworthy and the Business Post on genetics research in Ireland. Few sectors have heralded such promise for deep dives into data than genomics and personalised medicine. However, the research ongoing in this area has raised numerous red flags in terms of consent, data protection, regulation and commercial interest.
The big danger here is the erosion of public trust in genomics research practices, which would ultimately be detrimental to all. One community’s mistrust in health science can have wide-ranging impacts – just look at the anti-vax movement.
The overwhelming consensus among people the Noteworthy investigation team spoke to was that a public genome project and strategy is needed in Ireland. However, if we want public bodies to be able to handle the regulation, and indeed the use of datasets, we’ll need to overcome both a substantial knowledge gap and technology deficit.
Just look at how Public Health England flubbed its Covid-19 statistics reporting with legacy IT. By using an old version of Excel that simply couldn’t manage the volume of data to be processed, the UK state agency missed 16,000 confirmed Covid-19 cases in its reporting. Its solution? Use more Excel spreadsheets.
As University of Sheffield professor in search and analytics Paul Clough put it: “The bigger issue is that, in light of the data-driven and technologically advanced age in which we live, a system based on shipping around Excel templates was even deemed suitable in the first place.”
How can we trust bodies that approach data management with such ignorance to police others with regulations?
‘It’s time to reflect on where data science is going to take us and how’
Writing for The Conversation, Clough hit upon another common issue. “It is also remarkable how quickly technology or the algorithm is blamed (especially by politicians), but herein lies another fundamental issue – accountability and taking responsibility,” he wrote.
The fact is that the big problem with data-driven systems is not really the data but the people making use of it. Just as The Social Dilemma illustrated with Vincent Kartheiser’s algorithm portrayal, there are humans at the centre of the machine.
Sometimes these people are overzealous in their technological development, taking advantage of regulators’ complete inability to keep up. Sometimes they decide to take shortcuts with a wealth of materials that are readily available but unchecked. Often, they are people who have been told to ask forgiveness, not permission, when it comes to developing far-reaching technology. And then there are others who are clumsily operating a powerful tool without properly understanding it.
For the most part, these people are also working behind closed doors with no obligation for transparency in how they are using data. They may not even be able to explain their own systems thanks to leaving it to the machines to learn independently and interpret the data.
We need to be able to trust in data and trust the science behind it, but we’re not there yet. As we set upon another Data Science Week on the pages of Siliconrepublic.com, it’s time to reflect on where this science is going to take us and how. For a start, the cogs of the machine should be visible and subject to scrutiny.
If you get shortchanged at the till, you can check the data on your receipt and right the wrong there and then. But if you get shortchanged by a decision made by a black-box algorithm even the people who built it can’t explain, you can’t see the mistake, let alone correct it.
Want stories like this and more direct to your inbox? Sign up for Tech Trends, Silicon Republic’s weekly digest of need-to-know tech news.