Leading AI models accused of cheating benchmark tests
Able to regurgitate test sets verbatim
Some of the world’s most prominent AI models have been accused of cheating on industry-standard benchmarking systems.
Allegations of data contamination and potential manipulation have surfaced against models developed by Alibaba, Meta, Google, Microsoft, OpenAI and Mistral AI, according to The Stack.
Analysts have evidence suggesting that several state-of-the-art AI models can reproduce test sets for popular benchmarks such as MMLU (Massive Multitask Language Understanding) and GSM8K (Grade School Math 8K).
Such allegations raise questions about the reliability of the models’ benchmark scores, and the implications for the pursuit of artificial general intelligence (AGI).
Benchmarks are designed to test a model's capabilities in specific tasks. So-called "AGI leaderboards" are created by combining these benchmarks to measure a model's performance across multiple domains.
Critics say this ability could be due to contamination of the training data – a scenario where test data unintentionally or intentionally ends up in a model's training set.
Documents released by Louis Hunt, former CFO of LiquidAI, indicate that open-source models like Alibaba's Qwen 2.5 14B, Microsoft's Phi 3 Medium 128K Instruct, Meta's Llama 3 70B, Google's Gemma 2 and Mistral AI's Mistral 7B Instruct were able to output benchmark test sets verbatim.
Hunt's data even provided the code necessary to reproduce the test sets, fuelling speculation of widespread intentional manipulation.
"It is impossible for them to not know," said Yoan Sallami, CEO of cognitive architecture firm SynaLinks, who described data contamination as both pervasive and deliberate.
He attributed the issue to competitive pressures to dominate leaderboard rankings, a key metric in attracting both funding and industry acclaim.
Experts also questioned the legitimacy of benchmarks as indicators of intelligence.
"Even if you combine all these narrow tasks into one test, it is still not testing for intelligence," Sallami said, pointing to the risks of overfitting and data leakage.
Ruchir Puri, chief scientist at IBM Research, said, "Good benchmarks are a reflection of reality, but never reality."
Closed models such as OpenAI's GPT-4 o1 have also come under scrutiny.
Research by Alejandro Cuadron, a scholar at Berkeley EECS, revealed discrepancies in the performance of GPT-4 o1 on OpenAI's SWE-Bench Verified benchmark.
In independent testing, GPT-4 o1 scored only 30%, well below OpenAI's claimed 50% performance.
Cuadron questioned whether OpenAI's preferred testing framework, "Agentless," was specifically chosen to favour models that rely on memorisation of benchmark datasets.
The alleged benchmark manipulation has far-reaching implications. Governments, regulatory bodies and investors often rely on these scores to evaluate AI capabilities and inform policy decisions.
A December 2024 report by Stanford's Human-centered Artificial Intelligence (HAI) hub warned of misleading evaluations in AI benchmarking.
The UK's AI Safety Institute and the EU's AI Safety Act both use benchmarks like MMLU to assess model performance. If these benchmarks are compromised, the impact could undermine public trust in AI development and complicate efforts to ensure the technology's safe deployment.