Large Language Models Benchmarks

MUO on MSN

AI benchmark numbers are meaningless — here's what to look for instead

Numbers go up, AI gets better.

OpenAI, Mistral AI release new hardware-efficient language models

OpenAI Group PBC and Mistral AI SAS today introduced new artificial intelligence models optimized for cost-sensitive use cases. OpenAI is rolling out two algorithms called GPT-5.4 mini and GPT 5.4 ...

Tech Xplore on MSN

New 'renewable' benchmark streamlines LLM jailbreak safety tests with minimal human effort

As new large language models, or LLMs, are rapidly developed and deployed, existing methods for evaluating their safety and discovering potential vulnerabilities quickly become outdated. To identify ...

Geeky Gadgets

AI Benchmarks Are Broken : The Leaderboard Illusion

What if the tools we trust to measure progress are actually holding us back? In the rapidly evolving world of large language models (LLMs), AI benchmarks and leaderboards have become the gold standard ...

ZDNet

With AI models clobbering every benchmark, it's time for human evaluation

Artificial intelligence has traditionally advanced through automatic accuracy tests in tasks meant to approximate human knowledge. Carefully crafted benchmark tests such as The General Language ...

Live Science

AI benchmarking platform is helping top companies rig their model performances, study claims

LMArena, a popular benchmark for large language models, has been accused of giving preferential treatment to AIs made by big tech firms, potentially enabling them to game their results. When you ...

VentureBeat

Beyond generic benchmarks: How Yourbench lets enterprises evaluate AI models against actual data

Every AI model release inevitably includes charts touting how it outperformed its competitors in this benchmark test or that evaluation matrix. However, these benchmarks often test for general ...

Morning Overview on MSN

OpenAI releases GPT-5.4 mini and nano small models

OpenAI has released two compact models in its GPT‑5.4 family, branded GPT‑5.4 mini and GPT‑5.4 nano, and is promoting them as ...

InfoQ

Google Researchers Propose Bayesian Teaching Method for Large Language Models

Google Research has proposed a training method that teaches large language models to approximate Bayesian reasoning by learning from the predictions of an optimal Bayesian system. The approach focuses ...

The Economist

Top AI models underperform in languages other than English

This illustrates a widespread problem affecting large language models (LLMs): even when an English-language version passes a safety test, it can still hallucinate dangerous misinformation in other ...

Results that may be inaccessible to you are currently showing.

Hide inaccessible results