Hazard Rates for AI Agents Decline as a Task Goes On
Toby Ord
Contrary to my earlier hypothesis, AI agents probably don't have a constant hazard rate / half-life. Instead their hazard rates systematically decline as the task goes on.
This means that AI agents’ success rates on tasks beyond their 50%-horizon are better than my constant-hazard-rate model suggests, but those for tasks shorter than the 50% horizon are worse.
I have previously suggested a constant hazard rate was a good starting assumption for how their success rate at tasks decays with longer durations. It is the simplest model and fits the data reasonably well. So I thought there was a strong case for it being the default, or null hypothesis, that would take some evidence to overturn.
But some excellent new analysis by Gus Hamilton provides this evidence. He used the standard second-simplest model from survival analysis (the Weibull distribution rather than the exponential distribution). It has a second parameter, k, which represents how the hazard rate changes with time (if at all). If k = 1, there is a constant hazard rate (so the exponential distribution is a special case of the Weibull). But if k < 1, then hazard decreases over time (like the Lindy effect), and if k > 1, hazard increases (like aging).
Gus found that the estimated values for k were below 1 for all the models, showing that all of them had decreasing hazard rates as the task went on.
A distribution that generalises another is always going to get a better fit of the data. So fit alone wouldn't be decisive when comparing the Weibull to the exponential distribution. But the way that every single model has k statistically significantly below 1 convinces me he is right. His model extends my approach of using survival-analysis as a lens through which to view the METR time horizons, and sees further through that lens than I did.
Note that values of k for different models are all quite clustered around 0.6, with little change in k as models improved over the years. In contrast, the improvement in METR timelines was being driven by changes in the Weibull distribution’s other parameter, λ, which uniformly shrinks the hazard rate at all times:
So what does all this imply?
One thing is that it implies very different estimated success rates for tasks much shorter or longer than the 50% horizon (which METR focuses on because it is easier to reliably estimate). So one should use the Weibull distribution rather than the exponential distribution to estimate the 99% horizon (or 10% horizon).
Another thing is that the AI agents mainly have a k of about 0.6, while the human value of k is significantly lower, at about 0.37. This means even were they to have the same 50% horizon, humans would have a higher success rate on really long tasks (but a lower rate on really short ones).
As this shows, for a fixed 50% horizon length, it isn't clearly better or worse to have a lower value of k. Lower values are useful for the higher success rate really long tasks, but are worse on short tasks. As k gets lower, those tasks need to be even shorter to reach high standards like 99%. Since the exponential distribution is a Weibull with k = 1, but the measured value of k is lower, the 99% time horizon of these models is probably shorter than you’d expect if you were previously using my constant hazard rate model to estimate it.
As a word of warning, I’d seen Gus’s preliminary results prior to METR adding the data for Opus 4.5 and was quite sure that it would have a more human-like value of k since it had a great 50% horizon length, but only an average showing at the 80% reliability threshold. But the estimates are that its value of k are similar to other AI agents, while its value of λ are similar to humans. I'm pretty puzzled by this, as a change in k would perfectly explain this surprising discrepancy between its 50% and 80% thresholds. I not sure what is going on, but it might be due to the fact that there isn't much data to go on here and things are quite noisy for any individual model.
So, from Gus's results, it still looks like there is some important gap between how human success rates drop off at longer tasks versus how AI agents do.
Finally, Gus also compares his two-parameter Weibull model of the data to METR's two-parameter log-logistic model. He finds that they are similar, but with the log-logistic fitting slightly better. So it isn't clear which of these to use if you have the choice. They differ quite a lot in the tails of the distribution (i.e. in estimated success rates for very short or very long tasks).
e.g. the Weibull says the 99% horizon is 1/20th as long as the log-logistic predicts. That's a crucial difference and the data doesn't tell us which to favour! I'd slightly favour the Weibull, on the grounds that it is more plausible ex ante. But maybe the bigger lesson is that it is unknown which is right, and thus the 99%-reliability horizons (a threshold that is necessary for much useful work) are deeply uncertain.
4 February 2026