LLMs Don't Really Understand Math.
A deep dive into the latest research showing that current state-of-the-art AI is not capable of understanding and solving math as popularized.
We are in an AI bubble right now.
There’s a lot of hype around the current capabilities of AI.
Some say it will make extremely skilled jobs like Software Engineering and Medicine obsolete.
Others warn of an upcoming AI apocalypse within the next few years.
These claims are far from the truth.
Yes, AI might someday take over our jobs, but the current AI architecture is incapable of doing that and has been juiced out.
LLMs, based on the Transformer architecture, are great next-word predictors and language generators, but there’s much evidence that they cannot reliably solve maths.
They can reliably fake doing so but lack true logical reasoning capabilities at their core.
Here’s a story in which we discuss the performance of the current state of AI in mathematical tasks, the reasons for this, and debunk the lies sold to us by big tech.
Linda Problem Breaks LLMs When It Becomes A Bob Problem
Have you heard about the classical Linda problem?
It is an example from cognitive psychology that shows the Conjunction fallacy.
In simple terms, this fallacy occurs when people mistakenly judge the likelihood of two events happening together (or in conjunction) as being more likely than the likelihood of one of the events alone when this is not mathematically true.
The problem goes like this:
Linda is 31 years old, single, outspoken, and very bright.
She majored in philosophy.
As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.
Which is more probable?
A. Linda is a bank teller.
B. Linda is a bank teller and is active in the feminist movement
The answer?
Statement A must be more likely than statement B because, mathematically, the probability of a conjunction P(A and B)
is always less than or equal to the probability of any single event P(A)
or P(B)
.
Take a look at what happens when GPT-4 answers this.
When prompted in a one-shot fashion, GPT-4 correctly identifies the conjunction fallacy and answers the question correctly.
But changing the name of Linda to Bob confuses it, and its logical reasoning goes down the drain.
(I tested the same on GPT-4o, and yes, it answered incorrectly.)
The researchers of this ArXiv paper generate several other tweaked questions and statistically analyse the performance of LLMs on them.
They consistently (with statistical significance) find that LLMs have a huge Token bias.
This means that LLMs largely rely on specific patterns or tokens in the input text when solving problems rather than truly understanding them.
Take a look at another example.
The “Twenty-five Horses” problem goes like this:
There are 25 horses.
The horses can only race five at a time, and you cannot measure their actual speeds;
you can only measure their relative rankings in a race.
The challenge is to find the minimum number of races needed to find the top three horses.
Changing this problem to a “Thirty-Six Bunnies” problem again confuses GPT-4 and Claude 3 Opus, and they incorrectly solve this problem.
Apple Breaks The Ice With GSM-Symbolic
The GSM8K (Grade School Math 8K) benchmark is popularly used to assess LLMs' mathematical reasoning.
This dataset comprises 8.5 thousand high-quality linguistically diverse grade school math word problems.
The questions here are relatively simple for humans and require knowing only the four basic arithmetic operations (+ − × ÷
) to reach the final answer.
These questions require multi-step reasoning, but a bright middle school student should still be able to solve every problem in this dataset.
Check out an example:
{
'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May.
How many clips did Natalia sell altogether in April and May?',
'answer': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.
\nNatalia sold 48+24 = <<48+24=72>>
72 clips altogether in April and May.
\n#### 72',
}
All state-of-the-art LLMs (including Claude, GPT-4o, o1, and Gemini) perform exceptionally well on GSM8K, but Apple researchers have questioned these metrics.
To test their hypothesis, they tweaked this benchmark using templates and generated variations of questions based on these.
Their modifications include changing names/ numerical values and adding or removing clauses from the original questions in GSM8K.
They called their new benchmark — GSM Symbolic.
Keep reading with a 7-day free trial
Subscribe to Into AI to keep reading this post and get 7 days of free access to the full post archives.