80m images used to train AI pulled after researchers find string of racist terms

13 Jul 2020

Abeba Birhane wearing a yellow scarf and white jacket against a snowy background.

Lero and UCD PhD candidate Abeba Birhane. Image: Abeba Birhane

Lero and UCD researcher Abeba Birhane has helped uncover racist and misogynistic terms in an MIT image library that was used to train AI.

Abeba Birhane of University College Dublin and the SFI software research centre Lero, who was recently featured on Siliconrepublic.com, has helped uncover how the much-cited ‘80 Million Tiny Images’ dataset may have contaminated AI systems with racist, misogynistic and other slurs.

While still awaiting peer review, a pre-print paper from Birhane and her co-author Vinay Uday Prabhu – chief scientist at UnifyID – found that the database of 80m images developed at MIT contained thousands of offensive terms.

Birhane said linking images to slurs and offensive language infuses prejudice and bias into AI and machine learning models. This helps perpetuate stereotypes and prejudices, inflicting “incalculable harm” on those already on the margins of society.

‘Ephemeral and vacuous’

“Lack of scrutiny has played a role in the creation of monstrous and secretive datasets without much resistance, prompting further questions such as what other secretive datasets currently exist hidden and guarded under the guise of proprietary assets?” Birhane said.

Birhane and Prabhu also discovered that the images used to populate the datasets were “non-consensual”, with images of children and others scraped from Google and other search engines.

Writing in their paper, the pair said: “In the age of big data, the fundamentals of informed consent, privacy or agency of the individual have gradually been eroded.

“Institutions, academia, and industry alike amass millions of images of people without consent and often for unstated purposes under the guise of anonymisation, a claim that is both ephemeral and vacuous.”

The Tiny Images dataset was created in 2006 and contained more than 53,000 different nouns directly copied from Wordnet. Those terms were then used to automatically download images of the corresponding noun from internet search engines.

MIT withdraws dataset

Following the publication of the paper, MIT researchers apologised and said they would be withdrawing the dataset. They went on to say that because the images were so small (32×32 pixels), it can be difficult for people to visually recognise the content.

“It has been taken offline and it will not be put back online,” said Antonio Torralba, Rob Fergus and Bill Freeman. “We ask the community to refrain from using it in future and also delete any existing copies of the dataset that may have been downloaded.”

Looking to the future, Birhane said that she hopes this incident makes people more aware of the potential damage caused by ill-considered datasets.

“We believe radical ethics that challenge deeply ingrained traditions need to be incentivised and rewarded in order to bring about a shift in culture that centres justice and the welfare of disproportionately impacted communities,” she said.

“I would urge the machine learning community to pay close attention to the direct and indirect impact of our work on society, especially on vulnerable groups.”