We’re stuck in a swamp of online content – how do we get out?

4 Jan 2021

A hand is reaching out of a swamp-like image of binary code.

Image: © Photobank/Stock.adobe.com

Now that even computers are getting lost in the weeds of online content, we need to start thinking about how a curated internet might benefit us all, writes Elaine Burke.

The origin of the phrase “content is king”, while surely around for some time, is primarily attributed to a 1996 essay by Bill Gates. In it, he compared the rapidly evolving internet age to the advent of television – where “the long-term winners were those who used the medium to deliver information and entertainment”.

Gates identified the internet’s “broad opportunities” to provide information and entertainment in new forms, and the democratising effect of a low barrier to entry. His assertion that “no company is too small to participate” correctly forecasted the growth in businesses of all stripes becoming online content producers in a multitude of ways, but did not foretell a very 21st-century phenomenon: user-generated content.

“Over time, the breadth of information on the internet will be enormous, which will make it compelling,” Gates wrote in the mid-90s, at the time assuming that this proliferation of content would be led by businesses and organisations, not each and every citizen of the digital age. “Enormous”, for this measure of content, is a whopping understatement. In fact, there doesn’t seem to be a word capable of capturing the magnitude of online content today. (I could say it’s unprecedented, but that’s 2020’s word and we are all keen to move on.)

AI’s great content challenge

Impossible seems to be a suitable term. We are living in an age of impossible levels of content. There is simply too much. Not just for human consumption, but even for machines.

Artificial intelligence researchers hungry for data to feed the machine have revelled in the proliferation of online information, but it is fool’s gold they are mining. The size of the datasets makes vetting and curation nearly impossible, and many find themselves now facing the reality that that the oceans of data used to train machine learning models also contain the dregs.

When everyone is a publisher and everything is published, the online dataset of ‘content’ becomes polluted with misinformation, bias and, in all certainty, a plentiful amount of absolute garbage. And one of the first principles of data analysis is garbage in, garbage out.

‘There’s a risk that racist, sexist and otherwise abusive language ends up in the training data’
^{– KAREN HAO}

Prof Vincent Wade, director of the Adapt centre for digital media research, addressed this issue at Future Human in October 2020. “We talk about data lakes but actually a number of them have turned toxic because they can’t be used, because of that provenance problem, because of those issues of trust,” he told the virtual audience at the event. He also advocated for researchers to do “more AI with less data”.

Wade is by no means alone in his concerns. The recent controversial departure of Timnit Gebru from Google’s ethical AI team was reportedly precipitated by a paper that raised many issues around training language models on the data gleaned from the broadest (and, likely, cheapest and easiest to access) segments of the internet.

“Researchers have sought to collect all the data they can from the internet, so there’s a risk that racist, sexist and otherwise abusive language ends up in the training data,” wrote Karen Hao, senior AI reporter at MIT Technology Review, who has seen the paper co-authored by Gebru.

The potential problem with this requires no further explanation, but Hao said the paper expands on the even more subtle issues that arise from these muddied datasets, such as the inability for a model trained on normalised brutal language to progress as language in society does.

“Shifts in language play an important role in social change; the MeToo and Black Lives Matter movements, for example, have tried to establish a new anti-sexist and anti-racist vocabulary. An AI model trained on vast swathes of the internet won’t be attuned to the nuances of this vocabulary and won’t produce or interpret language in line with these new cultural norms,” wrote Hao.

“It will also fail to capture the language and the norms of countries and peoples that have less access to the internet and thus a smaller linguistic footprint online. The result is that AI-generated language will be homogenised, reflecting the practices of the richest countries and communities.”

Unfortunately, voices sharing these concerns appear largely as outsiders in the race to build technologies based on these AI models.

A path forward

Gates’ essay is a quarter of a century old at this point yet it outlines challenges that publishers are still struggling with today. Namely, how to financially support online content.

Gates correctly predicted that an online payments infrastructure fit for micro-transactions would be necessary for publishers to secure sustainable business models, and users are finally starting to adapt to the idea of paying for online content. A more modern forecast, however, might suggest that what’s worth paying for is online curation.

There will always be an appetite for free content, but anyone who spent the recent festive period enjoying copious amounts of chocolates and booze (I mean, you deserved it) will know that what you desire isn’t necessarily what’s good for you.

We, as humans, don’t have the capacity to sift through the impossible amount of information freely available to us, especially as some of it now comes intentionally packaged to mislead. And the AI-led technologies being designed to filter this information are becoming stuck in the same quagmire, unable to tell the good from the bad, and marred by some of the same ills that plague society at large.

The real business opportunity gleaming amid the swamp of online content is a signpost for the way out. And, ideally, those signs will be legible to all.

The freedom of information online has many benefits, not least of all that it is accessible to those who can’t afford subscriptions to knowledgeable sources. A future where those who can pay support access for those who can’t is not only desirable, but quite likely. And there are further opportunities for those curators of content who can offer safe passage through it all. Maybe that future will be assisted by AI, but it will certainly need humans at its core.

Want stories like this and more direct to your inbox? Sign up for Tech Trends, Silicon Republic’s weekly digest of need-to-know tech news.

Related: Opinion, AI, data, internet, online services, research

Elaine Burke is the host of For Tech’s Sake, a co-production from Silicon Republic and The HeadStuff Podcast Network. She was previously the editor of Silicon Republic.

editorial@siliconrepublic.com