UPDATED 22:13 EDT / JULY 01 2024

AI

Anthropic launches new program to fund creation of more reliable AI benchmarks

Generative artificial intelligence startup Anthropic PBC wants to prove that its large language models are the best in the business. To do that, it has announced the launch of a new program that will incentivize researchers to create new industry benchmarks that can better evaluate AI performance and impact.

The new program was announced in a blog post published today. The company explained that it’s willing to dish out grants to any third-party organization that can come up with a better way to “measure advanced capabilities in AI models.”

Anthropic’s initiative stems from the growing criticism of existing benchmark tests for AI models, such as the MLPerf evaluations that are carried out twice annually by the nonprofit entity MLCommons. It’s generally agreed that the most popular benchmarks used to rate AI models do a poor job of assessing how the average person actually uses AI systems on a day-to-day basis.

For instance, most benchmarks are too narrowly focused on single tasks, whereas AI models such as Anthropic’s Claude and OpenAI’s ChatGPT are designed to perform a multitude of tasks. There’s also a lack of decent benchmarks capable of assessing the dangers posed by AI.

Anthropic wants to encourage the AI research community to come up with more challenging benchmarks, focused on their societal implications and their security. It’s calling for a complete overhaul of existing methodologies.

“Our investment in these evaluations is intended to elevate the entire field of AI safety, providing valuable tools that benefit the whole ecosystem,” the company stated. “Developing high-quality, safety-relevant evaluations remains challenging, and the demand is outpacing the supply.”

As an example, the startup said, it wants to see the development of a benchmark that’s better able to assess an AI model’s ability to get up to no good, such as by carrying out cyberattacks, manipulating or deceiving people, enhancing weapons of mass destruction and more. It said it wants to help develop an “early warning system” for potentially dangerous models that could pose national security risks.

It also wants to see more focused benchmarks that can rate AI system’s potential for aiding scientific studies, mitigating ingrained biases, self-censoring toxicity and conversing in multiple languages, it says.

The company believes that this will entail the creation of new tooling and infrastructure that will enable subject-matter experts to create their own evaluations for specific tasks, followed by large-scale trials that involve hundreds or even thousands of users. To get the ball rolling, it has hired a full-time program coordinator, and in addition to providing grants, it will give researchers the opportunity to discuss their ideas with its own domain experts, such as its red team, fine-tuning, trust and safety teams.

Additionally, it said it may even invest in or acquire the most promising projects that arise from the initiative. “We offer a range of funding options tailored to the needs and stage of each project,” the company said.

Anthropic isn’t the only AI startup pushing for the adoption of newer, better benchmarks. Last month, a company called Sierra Technologies Inc. announced the creation of a new benchmark test called “𝜏-bench” that’s designed to evaluate the performance of AI agents, which are models that go further than simply engaging in conversation, performing tasks on behalf of users when they’re requested to do so.

But there are reasons to be distrustful of any AI company that’s looking to establish new benchmarks, because it’s clear that there are commercial benefits to be had if it can use those tests as proof of its AI models’ superiority over others.

With regard to Anthropic’s initiative, it said in its blog post that it wants researchers’ benchmarks to align with its own AI safety classifications, which were developed by itself with input from third-party AI researchers. As a result, there’s a risk that AI researchers might be forced to accept definitions of AI safety that they don’t necessarily agree with.

Still, Anthropic insists that the initiative is meant to serve as a catalyst for progress across the wider AI industry, paving the way for a future where more comprehensive evaluations become the norm.

Image: SiliconANGLE/Microsoft Designer

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU