Evidence that Recent AI Gains are Mostly from Inference-Scaling
Toby Ord
In the last year or two, the most important trend in modern AI came to an end. The scaling-up of computational resources used to train ever-larger AI models through next-token prediction (pre-training) stalled out. Since late 2024, we’ve seen a new trend of using reinforcement learning (RL) in the second stage of training (post-training). Through RL, the AI models learn to do superior chain-of-thought reasoning about the problem they are being asked to solve.
This new era involves scaling up two kinds of compute:
the amount of compute used in RL post-training
the amount of compute used every time the model answers a question
Industry insiders are excited about the first new kind of scaling, because the amount of compute needed for RL post-training started off being small compared to the tremendous amounts already used in next-token prediction pre-training. Thus, one could scale the RL post-training up by a factor of 10 or 100 before even doubling the total compute used to train the model.
But the second new kind of scaling is a problem. Major AI companies were already starting to spend more compute serving their models to customers than in the training phase. So if it costs a factor of 10 or 100 times as much to answer each question, this really does affect their bottom line. And unlike training costs, these costs can’t be made up in volume. This kind of scaling is known as inference-scaling (since it is scaling the compute used in the output, or ‘inference’, stage).
It is critical to find out how much of the benefits are coming from each kind of scaling. If further improvements in AI capabilities are mainly going to come from inference-scaling, this would have many implications for the trajectory of AI, including for AI companies, AI governance, and AI risk.
In past writings, I’ve mainly focused on the inference-scaling, treating RL as primarily enabling larger and larger amounts of inference to be productively used to answer a question — improving capabilities, but at a steep cost every time the AI system is deployed. However, it is probably more common in the AI discourse for people focus on the training costs of RL — treating it as a relatively cheap way to train a smarter model. In support of this view, people often point to the fact that even when reasoning models have thinking turned off, they still have superior performance on maths or science benchmarks compared with the base model that they were trained from.
For example, Anson Ho and Arden Berg from Epoch AI focus on this angle in their attempt to quantify the algorithmic improvement from reasoning. However, they ultimately suggest that about half the benefit comes from the RL training (the shift from the base model to the no-thinking reasoning model) and half comes from scaling up the number of tokens that are used in reasoning. But as they note, despite producing zero hidden ‘reasoning tokens’ the no-thinking version of the reasoning model still had a chain of thought that was about twice as long as the base model. Ho and Berg address this qualitatively, as creating some uncertainty about their claim that half the gains come from each source.
But it is possible to do better — we can use this data as part of the analysis for finding out what fraction of performance increase is due to each type of scaling.
While it would be easy for people inside the leading AI companies to perform this analysis, it is more challenging with open data. A key thing is needing to have access to both a reasoning model and to the base model it started training from. In all my analysis I’ll follow Ho and Berg in using Anthropic’s Sonnet 3.7 as the reasoning model and assuming that Anthropic’s October 2024 release of Sonnet 3.5 (commonly called Sonnet 3.6 by the community) was its base model. (If that’s wrong and Sonnet 3.7’s base model incorporated improvements beyond ‘Sonnet 3.6’, that would only help my main conclusion.)
As always, it is instructive to first look at the raw scatter plot. Using the same Epoch AI dataset, here is a chart showing how benchmark performance (for MATH level 5) varies with the number of output tokens:
On this chart you can see a clear linear trend connecting the red data points of the RL-trained model. Since the x-axis is on a log-scale (as it often is on inference-scaling charts), this straight line actually represents performance increasing logarithmically with the number of tokens. Or, put another way, we require exponential growth in the number of tokens in the chain of thought to keep performance increasing linearly. This is typical for inference-scaling and is generally regarded by the industry as good scaling behaviour.
Note how the orange square representing the base model is slightly below this trend. This same pattern appears for the two other benchmarks I checked:
We can use these charts to find out how much of the performance boost is coming from scaling up inference compute versus how much is coming from RL making the model perform better for a given compute budget.
The key thing is to find the inference-scaling trend and then see how far below that trend the base model falls. That performance-gap between what the base model achieves and what the RLed model would achieve using the same number of tokens gives us the ‘RL boost’ — the portion of the improvement that wasn’t driven by more spending compute at test time. We can see below that the RL Boost for MATH level 5 is quite small at just 6 percentage points. But the RL training also teaches the model how to productively use reasoning chains up to 30 times as long. This costs 30x as much money every time the model is used, but enables a further boost of 28 percentage points via inference-scaling. So in this case, the total performance boost is 34 percentage points with 82% of this coming from inference-scaling.
We see similar results on the other two benchmarks I analysed. There was a somewhat better showing from the RL boost (at a fixed compute budget) for GPQA Diamond, but most of the gains were still coming from inference-scaling, requiring more than 10x as much compute each time the model is used:
For GPQA Diamond, the boosts are 9 percentage points and 15 percentage points respectively, with 63% of the total coming from inference-scaling.
And for the OTIS Mock AIME benchmark, an even larger fraction of the benefit was coming from the inference-scaling boost:
Here there is just 4 percentage points from the RL Boost and 45 percentage points from inference-scaling, so the inference-scaling is producing 92% of the total gain.
In this case, some of this extreme ratio is being driven by the straightness of the trend-line. Since performance is capped between 0% and 100%, we know the trend cannot be straight forever. When we are so close to one of the bounds it is probably more accurate to use some kind of sigmoid trend-line. But no matter what kind of sigmoid we fit, the RL boost is still substantially smaller than the inference-scaling boost (and smaller in proportional terms too):
So that’s the simple case that most of the gains from RL-scaling are driven by the inference-scaling it opens up.
One obvious piece of further work would be to extend this analysis to models from other companies.
And an even more important piece of further work would be to examine the changes between generations of reasoning models. In particular, how would scaling up the RL post-training by a factor of 10 affect the inference-scaling trend-line? It would presumably move it up or left and might change its slope (though the ARC-AGI leaderboard suggests the slope usually stays the same). This extra RL training would presumably also enable longer chains of thought and thus more inference-scaling. The relationship between these two kinds of changes would determine whether further scaling of RL changes the importance of the RL boost compared to that of inference-scaling.
Alas, it is challenging for people outside the major labs to get the data to check this as simply using data from publicly released models will include gains from all other non-RL sources, inflating the apparent RL boost.
For now, I think the graphs we’ve seen here make a strong case that most of the gains from using RL to train reasoning models have been coming from scaling-up the number of tokens used to solve the task. This creates a flow of costs that have to be paid every time you use the model, substantially changing the business model, the risk profile, and the policy response.