New open-source tool compares AI models like ChatGPT

18 Aug 2023

Multiple computer screens on a desk in an office room, with lamps above the screens.

Image: © Gorodenkoff/Stock.adobe.com

US start-up Arthur claims its Bench tool has a range of scoring metrics to evaluate and compare large language models.

With the surge of AI models launched this year, it seems inevitable that comparison tools would be developed.

Now, AI start-up Arthur has released a new open-source tool to help companies compare different large language models (LLMs). The US company claims this tool – Arthur Bench – lets users compare how AI models perform in real-world scenarios.

LLMs are systems that are trained on a massive volume of data and are able to answer reading comprehension questions, solve basic maths problems and generate text.

These systems are the power behind leading AI products such as ChatGPT, which caused a surge in the number of AI models being released.

Arthur said its open-source tool has a range of scoring metrics to help companies decide which LLM they want to use, such as summarisation quality and “hallucinations” – which is essentially when an AI model makes a mistake.

“Arthur Bench helps companies compare the different LLM options available using consistent metrics so they can determine the best fit for their application in a rapidly evolving AI landscape,” the start-up says on its website.

The start-up said new metrics and features will become available as the project and community expands, boosted by the fact its an open-source project.

Arthur was founded in 2019 and has raised $60m in venture capital to create tools that measure AI performance. The start-up has also unveiled its Generative Assessment Project, a research initiative that ranks strengths and weaknesses of different models from companies like OpenAI, Anthropic and Meta.

This research suggests that Anthropic may have a slight edge over OpenAI’s GPT-4 in terms of reliability, within specific domains. Recent reports suggest Anthropic is raising $100m from SK Telecom, the largest telco in South Korea, to boost the company’s communications-focused AI business.

Arthur CEO and founder Adam Wenchel claims there is “an incredible amount of nuance” to understanding the differences in performance between different LLMs, based on the company’s research.

“With Bench, we’ve created an open-source tool to help teams deeply understand the differences between LLM providers, different prompting and augmentation strategies, and custom training regimes,” Wenchel said.

10 things you need to know direct to your inbox every weekday. Sign up for the Daily Brief, Silicon Republic’s digest of essential sci-tech news.