Why transparency is needed to protect from AI copyright infringement

26 Nov 2024

Image: © Pavel Iarunichev/Stock.adobe.com

The real issue is that copyrighted work is being ‘folded into a proprietary product’ by LLMs without consent or compensation, said Dr Sean Ren.

Earlier this month, two news outlets lost a copyright lawsuit against OpenAI where they alleged that the ChatGPT-maker “intentionally” removed copyright management information from their work and used it to train its artificial intelligence (AI) models.

The plaintiffs, Raw Story Media and AlterNet Media, were unable to prove “concrete injury”, the judge presiding over the case said, adding that the likelihood that ChatGPT – an AI model that processes large swaths of data from all across the internet – would output plagiarised content from one of the plaintiffs is “remote”.

The news outlets’ loss “highlights how hard it is for publishers to prove copyright infringement in the context of today’s AI,” said Dr Sean Ren, an associate professor of computer science at the University of Southern California and the CEO of Sahara AI, a decentralised AI blockchain platform.

OpenAI – much like the creators of most large language models (LLM) – does not reveal the exact data it uses to train its proprietary AI model. Moreover, modern LLMs are made up of billions and trillions of parameters – adjustable dials that define the behaviour of AI models, including the randomness of the its responses.

The enormous scale of such models can sometimes make it nearly impossible for even its developers to fully comprehend the models’ inner workings any more than a basic understanding of what it does – turning it into a “black box”.

Although, in an attempt to prove their case, the plaintiffs alleged that an “extensive review” of publicly available information revealed that “thousands” of their copyrighted works were included in OpenAI’s datasets without details on the author, title and copyright information – details that the plaintiffs had made available. However, the plaintiffs’ attempt to demonstrate that OpenAI had plagiarised their work proved unsuccessful.

Speaking with SiliconRepublic.com, Ren said that this case underscores the “urgent need” for transparency in how AI models work, highlighting that current copyright laws aren’t equipped to handle such situations.

He added: “The real issue isn’t just about collecting, transforming or repurposing copyright content – it’s that the work of creators is essentially being folded into a proprietary product without their consent or any form of compensation.”

In a similar lawsuit filed last year, The New York Times launched a legal battle against OpenAI for alleged copyright infringement, claiming that ChatGPT is trained on millions of articles published by the outlet. Business Insider reported last month that the Times’ lawyers were poring over ChatGPT’s source code to try to figure out how AI trains on creative work.

Many of today’s popular LLMs are centralised systems, where the model’s control and ownership lie with a central authority. These systems “operate with little transparency, leaving publishers guessing about how models are trained or what data they rely on,” said Ren, adding that “this lack of clarity undermines trust and makes it harder to enforce copyright protections”.

On the other hand, decentralised AI systems distribute the processing, storage and decision-making across a number of nodes in a network – often using blockchain technology. This system promotes better transparency and accountability by ensuring actions taking by an AI system are logged on the blockchain.

Supporting decentralised AI platforms, Ren said that they “offer a promising solution, providing clear records of how AI models are trained and giving creators the ability to control and monetise their intellectual property more effectively”.

“Transparency in training data will be key, and decentralised platforms can help by enabling clear ownership records, automating royalties, and creating an ecosystem where creators and AI developers can work together more equitably.”

Earlier this year, Stability AI founder Emad Mostaque resigned from his role as company CEO and from the company’s board of directors in pursuit of decentralised AI.

Explaining his reasoning behind the departure, he said that we are “not going to beat centralised AI with more centralised AI”, and wanted to push for “more transparent and distributed governance in AI”.

Future regulation around AI will also need to rethink copyright protections to ensure fairness and transparency for content creators, Ren argued, saying “AI development needs to move toward collaboration rather than extraction, ensuring creators have a clear path to be rewarded for their contributions”.

Don’t miss out on the knowledge you need to succeed. Sign up for the Daily Brief, Silicon Republic’s digest of need-to-know sci-tech news.

Suhasini Srinivasaragavan is a sci-tech reporter for Silicon Republic

editorial@siliconrepublic.com