Generative AI has caused confusion around copyright in terms of artistic content, but copyright issues may extend to popular AI models.
The AI sector may face a new wave of copyright disputes, as a new report claims many large language models (LLMs) are trained on copyrighted content from news organisations.
The report comes from the News Media Alliance (NMA), a nonprofit that has more than 2200 publishers as members in the US. The association claims to have looked at sets of training data that are used to train LLMs such as the popular ChatGPT.
In a white paper, the NMA claims that these datasets contain copyrighted content from news, magazine and digital media organisations. The report also claims that some of the most “widely used LLMs” have a preference for publisher content over “generic” content scraped from the internet.
This report also suggests that these AI models “copy and use” copyrighted publisher content when generating outputs. The NMA said the “pervasive copying” infringes copyright law as it is not excused under fair use doctrine.
“There is no question that creating such models relies on copying – indeed, many rounds of copying – of third-party works, such as the protected expression of our members,” the NMA said in its white paper.
The NMA said generative AI has the potential to benefit society by imitating and copying content “quickly and cheaply”, but argued that these models can only do so because they have been trained on “the fruits of human creativity at massive scale and largely without consent or compensation”.
“The works these models can imitate and copy in this way include prize-winning landmarks of culture produced at great cost to news, magazine and digital publishers – and often at great peril to the journalists they employ,” the NMA said.
While the NMA analysed publicly available datasets, a recent report claims leading AI companies are becoming less transparent around their creations. In March, OpenAI faced criticism for keeping various details of GPT-4 private, such as the model’s architecture, hardware and training methods.
The NMA has submitted comments to the US Copyright Office, in relation to a study around the copyright issues raised by generative AI.
Earlier this year, thousands of authors including Margaret Atwood and Jodi Picoult signed a letter calling on the likes of OpenAI, Alphabet and Meta to stop using their work to train AI models without “consent, credit or compensation”.
Meanwhile, the issue of AI-generated artwork remains a contentious issue in terms of copyright law. A copyright infringement case against several prominent AI art generators has developed recently, as the judge dismissed the claims against DeviantArt and Midjourney, but allowed a case against Stability AI to continue.
In August, a US district court judge ruled that artwork generated by AI cannot be copyrighted, arguing that copyright has never been granted to work that was “absent any guiding human hand” and that human beings are an “essential part of a valid copyright claim”.
10 things you need to know direct to your inbox every weekday. Sign up for the Daily Brief, Silicon Republic’s digest of essential sci-tech news.