Contact Us

Use the form on the right to contact us.

You can edit the text in this area, and change where the contact form on the right submits to, by entering edit mode using the modes on the bottom right. 

         

123 Street Avenue, City Town, 99999

(123) 555-6789

email@address.com

 

You can set your address, phone number, email and site description in the settings tab.
Link to read me page with more information.

Writing

Are the Costs of AI Agents Also Rising Exponentially?

Toby Ord

There is an extremely important question about the near-future of AI that almost no-one is asking.

We’ve all seen the graphs from METR showing that the length of tasks AI agents can perform has been growing exponentially over the last 7 years. While GPT-2 could only do software engineering tasks that would take someone a few seconds, the latest models can (50% of the time) do tasks that would take a human a few hours.

As this trend shows no signs of stopping, people have naturally taken to extrapolating it out, to forecast when we might expect AI to be able to do tasks that take an engineer a full work-day; or week; or year.

But we are missing a key piece of information — the cost of performing this work.

Over those 7 years AI systems have grown exponentially. The size of the models (parameter count) has grown by 4,000x and the number of times they are run in each task (tokens generated) has grown by about 100,000x. AI researchers have also found massive efficiencies, but it is eminently plausible that the cost for the peak performance measured by METR has been growing — and growing exponentially.

This might not be so bad. For example, if the best AI agents are able to complete tasks that are 3x longer each year and the costs to do so are also increasing by 3x each year, then the cost to have an AI agent perform tasks would remain the same multiple of what it costs a human to do those tasks. Or if the costs have a longer doubling time than the time-horizons, then the AI-systems would be getting cheaper compared with humans.

But what if the costs are growing more quickly than the time horizons? In that case, these cutting-edge AI systems would be getting less cost-competitive with humans over time. If so, the METR time-horizon trend could be misleading. It would be showing how the state of the art is improving, but part of this progress would be due to more and more lavish expenditure on compute so it would be diverging from what is economical. It would be becoming more like the Formula 1 of AI performance — showing what is possible, but not what is practical.

So in my view, a key question we need to ask is:

        How is the ‘hourly’ cost of AI agents changing over time?

By ‘hourly’ cost I mean the financial cost of using an LLM to complete a task right at the model’s 50% time horizon divided by the length of that time horizon. So as with the METR time horizons themselves, the durations are measured not by how long it takes the model, but how long it typically takes humans to do that task. For example, Claude 4.1 Opus’s time horizon is 50%: it can succeed in 50% of tasks that take human software engineers 2 hours. So we can look at how much it costs for it to perform such a task and divide by 2, to find its hourly rate for this work.

I’ve found that very few people are asking this question. And when I ask people what they think is happening to these costs over time, their opinions vary wildly. Some assume the total cost of a task is staying the same, even as the task length increases exponentially. That would imply an exponentially declining hourly rate. Others assume the total cost is also growing exponentially — after all, we’ve seen dramatic increases in the costs to access cutting-edge models. And most people (myself included) had little idea of how much it currently costs for AI agents to do an hour’s software engineering work. Are we talking cents? Dollars? Hundreds of dollars? An AI agent can’t cost more per hour than a human to complete these tasks can it? Can it?

A couple of months ago I asked METR if they could share the cost data for their benchmarking. I figured it would be easy — just take the cost of running their benchmark for each model, plot it against release date and see how it is growing. Or plot the cost of each model vs its time horizon and see the relationship.

But they helpfully pointed out that it isn’t so easy at all. Their headline time-horizon numbers are meant to show the best possible performance that can be attained with a model (regardless of cost). So they run their models inside an agent scaffold until the performance has plateaued. Since they really want to make sure it has plateaued, they use a lot of compute on this and don’t worry too much about whether they’ve used too much. After all, if you are just trying to find the eventual height of a plateau, there is no problem in going far into the flat part of the graph.

But if you are trying to find out when the plateau begins, there is a problem with this strategy. Their total spend for each model is sometimes just enough to get onto the plateau and sometimes many times more than is needed. So total spend can’t be used as direct estimate of the costs of achieving that performance.

Fortunately, they released a chart that can be used to shed some light on the key question of how hourly costs of LLM agents are changing over time:

This chart (from METR’s page for GPT-5) shows how performance increases with cost. The cost in question is the cost of using more and more tokens to complete the task (and thus more and more compute).

The yellow curve is the best human performance for each task. It steadily marches onwards and upwards, transforming more wages into longer tasks. Since it is human performance that is used to define the vertical axis for METR’s time horizon work, it isn’t surprising that this curve is fairly linear — it costs about 8 times as much to get a human software engineer to perform an 8-hour task as a 1-hour task.

The other colours are the curves for a selection of LLM-based agents. Unlike the humans, they all show diminishing returns, with the time horizon each one can achieve eventually stalling out and plateauing as more and more compute is added.

The short upticks at the end of some of these curves are an artefact of some models not being prepared to give an answer until the last available moment. This suggests that the model must have been still making progress during the apparent flatline before the uptick (just not showing it). Indeed, this chart was originally displayed on METR’s page for GPT-5 to show that they may have stopped its run before it’s performance had truly plateaued. These upticks do make analysis harder and hopefully future versions of this chart will be able to avoid these glitches.

So what can this chart tell us about our key question concerning the hourly cost of AI agents?

To tease out the lessons that lie hidden in the chart, we’ll need to add a number of annotations. The first step is to add lines of constant hourly cost. On a log-log plot like this, every constant hourly cost will be a straight line with slope 1. Lower hourly costs will appear as lines that are located further to the left.

For each curve I’ve added a line of constant hourly cost that just grazes it. That is the cheapest hourly cost the model achieves. We can call the point where the line touches the curve the sweet spot for that model. Before a model’s sweet spot, its time horizon is growing super-linearly in cost — it is getting increasing marginal returns. The sweet spot is exactly the point at which diminishing marginal returns set in (which would correspond to the point of inflection if this was replotted on linear axes). It is thus a key point on any model’s performance curve.

We can see that the human software engineer is at best \$120 per hour, while the sweet spots for the AI agents range from \$40 per hour for o3, all the way down to 40 cents per hour for Grok 4 and Sonnet 3.5. That’s quite a range of costs. While differences in horizon length between these models vary by about a factor of 15 (judged at either the end-points or at the sweet-spots) their sweet-spot costs vary by a factor of 100.

And these are the best hourly rates for these models. On many task lengths (including those near their plateau) they cost 10 to 100 times as much per hour. For instance, Grok 4 is at \$0.40 per hour at its sweet spot, but \$13 per hour at the start of its final plateau. GPT-5 is about \$13 per hour for tasks that take about 45 minutes, but \$120 per hour for tasks that take 2 hours. And o3 actually costs \$350 per hour (more than the human price) to achieve tasks at its full 1.5 hour task horizon. This is a lot of money to pay for an agent that fails at the task you’ve just paid for 50% of the time — especially in cases where failure is much worse than not having tried at all.

However, I do want to note that I’m a bit puzzled by how much higher the costs are here for the reasoning models from OpenAI compared to models from Anthropic and xAI. The METR page suggests that the price data for those models was still an estimate at that point (based on o1 costs), so I wouldn’t be surprised if these curves should really be shifted somewhat to the left, making them several times cheaper. We therefore shouldn’t lean too heavily on the fact that they cost as much or more than human labour at their full time-horizon.

As well as the sweet spot, ideally we could add a saturation point for each curve — a point to represent the location where the plateau begins. We can’t simply use the end of the curve since some have run longer into the plateau than others. What I’ll do is find the point where the slope has diminished to 1/10th that of the sweet spot. This is the point at which it requires a 10% increase in cost just to increase the time horizon by 1%. Or equivalently, the time horizon is only growing as the 1/10th power of compute.

Of course the number 1/10 is somewhat arbitrary, but unlike for the sweet spot, any definition of a saturation point will be arbitrary to some degree. As you can see below, this definition of saturation point does roughly correspond with the intuitive location, though it is still not quite clear how best to deal with the final upticks.

Armed with our sweet spots and saturation points, we can start to tease out the relationship between time horizon and cost.

Let’s start with a sweet spot scatter plot:

We can see that there is a weak, but clear, positive correlation between task duration and cost in this dataset. Moreover, we see that higher task durations (at the sweet spot) are associated with higher hourly costs (and recall that these hourly costs at the sweet spot are the best hourly cost achievable with that model).

What about if we instead look at the models’ saturation points, which are a little arbitrary in their definition, but closer to what METR is measuring in their headline results about time horizons:

Again, there is a correlation between time horizon and cost, and again the hourly costs seem to be increasing with time horizon too. Indeed it suggests we are nearing the point where the models’ peak performance comes at an impractically high cost. If this relationship were to continue, then forecasting when certain time horizons will be available from the headline METR trend will be misleading, as the models would be impractically expensive when they first reach those capabilities. We would need to wait some additional period of time for them to come down sufficiently in cost.

That said, there are some significant limitations to the analysis above. Ideally one would want to:

  • include curves for a larger and more representative set of models

  • find a way of addressing the uptick problem

  • check if there is an issue with the costs of the OpenAI models

  • explicitly plot hourly cost against release date

  • numerically determine the trend-lines and correlation co-efficients

Fortunately, it should be fairly easy for METR to perform such analysis, and I hope they will follow up on this.

Conclusions

  • Too few people are asking about how the costs of AI agents are growing

  • The key question is How is the ‘hourly’ cost of LLM agents changing over time?

  • We can use METR’s chart to shed some light on this.

  • We need to add lines of constant hourly cost, sweet spots, and saturation points.

  • This provides moderate evidence that:

    • the costs to achieve the time horizons are growing exponentially,

    • even the hourly costs are rising exponentially,

    • the hourly costs for some models are now close to human costs.

  • Thus, there is evidence that:

    • the METR trend is partly driven by unsustainably increasing inference compute

    • there will be a divergence between what time horizon is possible in-principle and what is economically feasible

    • real-world applications of AI agents will lag behind the METR time-horizon trend by increasingly large amounts

Appendix

METR has a similar graph on their page for GPT-5.1 codex. It includes more models and compares them by token counts rather than dollar costs:

It suggests:

  1. the correlation between time horizon and cost holds for these other models too

  2. reasoning models with more RL post-training don’t always dominate their predecessors (e.g. o1 is better at small token budgets than o3 or GPT-5)

  3. the horizontal gap between the OpenAI reasoning models and the rest is smaller, supporting the idea that their costs were a bit high in the main chart

$\setCounter{0}$