Researchers expose flaws in LLM reasoning abilities

AI models struggle with slight changes in the wording of queries, they claim

Image:
LLMs made mistakes on simple problems when given extraneous information

A new study from Apple's AI researchers has exposed significant limitations in the reasoning capabilities of large language models (LLMs).

In a newly released paper titled "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models," the researchers argued that LLMs, despite their impressive language skills, demonstrate a troubling degree of inconsistency when solving mathematical problems.

The study found that these models struggle with even simple mathematical problems when presented with even slight changes in the wording of queries.

The researchers tested LLMs on various math problems and introduced minor, seemingly insignificant changes to the wording of the questions. However, these changes caused a dramatic decline in the models' accuracy.

One example illustrated the problem: "Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday."

The query asked, "How many kiwis does Oliver have?"

The correct answer is straightforward: 44 + 58 + (44 * 2) = 190.

The next query adds an unrelated detail, noting that five of the kiwis picked on Sunday were smaller than average. It then repeats the question, "How many kiwis does Oliver have?"

Despite the fact that the size of the kiwis shouldn't affect the total count, both OpenAI's model and Meta's Llama3-8b incorrectly subtracted the five smaller kiwis from the total.

The researchers suggest that LLMs may not be capable of true reasoning, but rather rely on pattern recognition to generate responses. This means that they can be easily misled by irrelevant information, even when it is clearly unrelated to the problem at hand.

The researchers wrote: "[W]e investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data."

The study also takes a closer look at the GSM8K benchmark, a widely used tool for evaluating the mathematical reasoning of AI models.

While LLMs have shown marked improvements on this benchmark over the years, Apple's researchers questioned whether these advancements reflect true mathematical reasoning or simply better pattern-matching.

The team argues that existing benchmarks may be overstating the models' mathematical abilities, and that performance gains could mask underlying issues in logical reasoning.

They also introduce GSM-Symbolic, a new benchmark designed to provide a more detailed and reliable assessment of LLMs' mathematical reasoning by generating questions from symbolic templates.

The new approach allows for more controlled testing and reveals deeper insights into the models' strengths and weaknesses.

The researchers state: "We believe further research is essential to develop AI models capable of formal reasoning, moving beyond pattern recognition to achieve more robust and generalizable problem-solving skills. This remains a critical challenge for the field as we strive to create systems with human-like cognitive abilities or general intelligence."