I didn't really understand the "long task" thing until I actually experienced it...

dwohnitmok · 2025-12-21T05:39:51 1766295591

To be clear this doesn't mean that it takes the AI > 4 hours to do the task. METR is measuring the difficulty of tasks by how long it takes a human to do the same task. This benchmark is saying that Opus 4.5 can now do tasks (related to AI R&D, coding foremost among them) that take human experts > 4 hours (at a 50% reliability level; whether that's actually useful depends on of course the cost of failure). It is silent on how long it takes AI systems to do those tasks. In theory an AI system could take longer than that (in practice it's usually significantly shorter).

This is of course quite highly correlated with an AI system being able to churn through a task for a long time. But it's not necessarily the same thing.

Of course the big questions are going to arise if/when we start passing lines like 8 hours (a whole work day) or 40 hours (a whole work week).

ehnto · 2025-12-21T05:37:59 1766295479

I think you might be misunderstanding the article actually, this is about AI solving tasks as measured by how long it takes a human to solve the task. The AI could potentially solve it much quicker, but the use of "human time to solve" is an attempt to create a metric that reveals long horizon complexity (as I understand it anyway).

It's interesting because like the article notes, AI is really smashing benchmarks, but actual usefulness in automation of thought work is proving much more elusive. I think that collective experience of AI just not being that useful, or as useful as benchmarks suggest it should be, is captured in this metric.

rishabhaiover · 2025-12-21T14:24:11 1766327051

I've practiced a healthy skepticism of the recent boom but I can't reason why the long horizon time wouldn't stretch to 8 hours or a week worth's of effort from next year. After Opus-4.5, governments and organizations should really figure out a path out of this storm because we're in it now.

theptip · 2025-12-21T23:49:25 1766360965

Doubling time has been 7 months for a while, so you should expect 8h not 1 week next year.

rishabhaiover · 2025-12-22T03:35:18 1766374518

Predictions over historical data in a landscape with fragile priors doesn't seem like a strong metric to me (it's a useful approximation at best)

dwohnitmok · 2025-12-22T02:27:24 1766370444

It's significantly accelerated to 4 months since the beginning of 2025, which puts 1 week within reach if things stay on trend. But yes 7 months is the more reliable long-term trend.

ehnto · 2025-12-22T04:58:54 1766379534

Can we attribute the acceleration to something specific, that might not actually continue growth? For example, agentic coding and reasoning models seem to have made a huge leap in abilities, but wouldn't translate to an ongoing exponential growth.

dwohnitmok · 2025-12-23T04:10:18 1766463018

There's a fair amount of uncertainty on this point. In general it's unclear when/whether things will plateau out (although there are indications again that the trend is accelerating not decelerating).

That being said, if by "agentic coding" you are implying that a leap in capabilities is due to novel agentic frameworks/scaffolding that have appeared in 2025, I believe you are confusing cause and effect.

In particular, the agentic frameworks and scaffolding are by and large not responsible for the jump in capabilities. It is rather that the underlying models have improved sufficiently such that these frameworks and scaffolding work. None of the frameworks and scaffolding approaches of 2025 are new. All of them had been tried as early as 2023 (and indeed most of them had been tried in 2020 when GPT-3 came out). It's just that 2023-era models such as GPT-4 were far too weak to support them. Only in 2025 have models become sufficiently powerful to support these workflows.

Hence agentic frameworks and scaffolding are symptoms of ongoing exponential growth, not one-time boosts of growth.

Likewise reasoning models do not seem to be a one-time boost of growth. In particular reasoning models (or more accurate RLVR) seem to be an on-going source of new pretraining data (where the reasoning traces of models created during the process of RLVR serve as pretraining data for the next generation of models).

I remain uncertain, but I think there is a very real chance (>= 50%) that we are on an exponential curve that doesn't top out anytime soon (which gets really crazy really fast). If you want to do something about it, whether that's stopping the curve, flattening the curve, preparing yourself for the curve etc., you better do it now.

rishabhaiover · 2025-12-23T19:55:52 1766519752

Well said. I don't think anybody's stopping anything. I wish I knew how to prepare for it.

twotwotwo · 2025-12-21T05:44:27 1766295867

METR is using hours of equivalent human effort, not actual hours the agent itself spends, so by their methodology, your task might qualify as one where it pulls off much more than 4h of human work.

"Human hours equivalent" itself is an interesting metric, because: which human? Or rather, I'm sure they had a coherent definition in mind: presumably a human reasonably competent at whatever the specific task is. But hours the abstract human standard would spend is different from the hours any specific person, say you or I, would spend.

In particular, some of the appeal (and risk!!) of these things is precisely that you can ask for help with things that would be quick work for someone (who knows jq, or a certain corner of the PyPI library ecosystem, or modern CSS, or TypeScript annotations, or something else) but not for you.

TobiasJBeers · 2025-12-21T07:19:42 1766301582

The “50% time horizon” feels most actionable when you pair it with an expected-value model. For a given task: EV ≈ (human_time_saved × $/hour) − (p_fail × cost_of_failure) − (iteration/oversight cost). A model crossing 4h-at-50% might be hugely useful for low failure-cost work, and still net-negative for anything where rollback/debug is expensive. The missing piece is how p_fail scales with task length + how recoverable failures are.

twotwotwo · 2025-12-21T07:23:38 1766301818

Yeah--it's difficult to go from a benchmark involving the model attempting things alone to the effect assisting people on real tasks because, well, ideally you'd measure that with real people doing real tasks. Last time METR tried that (in early '25) they found a net slowdown rather than any speedup at all. Go figure!

nightshift1 · 2025-12-21T07:43:53 1766303033

>which human

The second graph has this under it:

The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years...

twotwotwo · 2025-12-21T08:08:16 1766304496

Yeah--I wanted a short way to gesture at the subsequent "tasks that are fast for someone but not for you are interesting," and did not mean it as a gotcha on METR, but I should've taken a second longer and pasted what they said rather than doing the "presumably a human competent at the task" handwave that I did.

nightshift1 · 2025-12-21T09:32:24 1766309544

I agree. After all, benchmarks don't mean much, but I guess they are fine as long as they keep measuring the same thing every time. Also, the context matter. In my case, I see a huge difference between the gains at work vs those at home on a personal project where I don't have to worry about corporate policies, security, correctness, standards, etc. I can let the LLM fly and not worry about losing my job in record time.

tacitusarc · 2025-12-21T05:49:48 1766296188

My problem with the OpenAI models (GPT5.2 in particular) recently is an extreme aversion to doing more than the smallest step in a task before asking for using input. Even if I explicitly instruct it to continue without input until the task is complete, it ignores the instruction.

I cannot imagine GPT5.2 working on a task for more than 2 minutes, let alone 4 hours. I’m curious if you’ve run into this and figured out a way around it?

simonw · 2025-12-21T08:06:29 1766304389

I've not had that problem at all with GPT-5.2 running in Codex CLI.

I use prompts like this:

  Build a pure JavaScript library (no dependencies) for encoding and 
  decoding this binary format. Start by looking at how the lite3-python 
  library works - the JavaScript one should have the same API and probably the
   same code design too. Build the JS one in lite3-javascript - it should be a
   single JavaScript module which works in both Node.js and in the browser. 
  There should be a test script that runs with Node.js which runs against the 
  files in the lite3-python/format_suite folder. Write the test script first, 
  run it and watch it fail, then build the JavaScript library and keep running
   the tests until they pass.

tacitusarc · 2025-12-21T19:55:26 1766346926

I have not tried it in Codex CLI, I’ll give that a shot and see if it changes things.

tacitusarc · 2025-12-26T19:46:51 1766778411

It did make a noticeable difference

macrolime · 2025-12-21T15:53:09 1766332389

I find that surprising. GPT 5.2 is the model I've had working the longest. It frequently works more than 4 hours nonstop, while earlier models would stop to ask if they should continue every 10 minutes. 5.1 and earlier ignores it if I ask it to continue until a task is done, but 5.2 will usually finish it.

BoiledCabbage · 2025-12-21T06:02:25 1766296945

What agent framework are you using? It can differ from one to the next on the same model.

tacitusarc · 2025-12-21T19:54:34 1766346874

I am using it in Zed.

Jcampuzano2 · 2025-12-21T16:07:19 1766333239

How are you guys even doing long tasks with plain Codex or Claude code?

I use Claude code and I get hit with a permissions prompt every 2 seconds for anything I try to do.

Sure I can turn off all dangerous permissions but it'd probably honestly stop and claim it's finished well before it actually is in most cases from my experience.

To be fair I haven't tried codex so maybe it's better at this but I'm my experience almost every model stops at some point and claims victory or stops and tells me something like "next we'll continue on with XYZ" at which point I have to prompt it to continue.

simonw · 2025-12-21T16:20:09 1766334009

You have to use --yolo or --dangerously-skip-permissions options.

Thankfully the cloud versions (Claude Code for web, Codex Cloud) run like that already, and are relatively safe in that if anything goes wrong it happens on someone else's computer.

stavros · 2025-12-21T18:42:37 1766342557

Codex (at least 5 and 5.1) is bad at asking for permission. Whenever it wants to run pre-commit or platformio, it tries to do that, that fails because of the sandbox, and then Codex decides something is wrong with the cache directory and keeps asking for permission to sudo chown ~/.cache, every time.

I have to specifically tell it to request permission for the command it wants to run, and then it works. Very annoying, and very annoying that it can't persist the permission, like Claude Code can, so it doesn't have to ask again every single time.

lifis · 2025-12-21T08:32:14 1766305934

Quickly looking at the source code, mostly treeBuilder and tokenizer, I do see several possible improvements: - Use Typescript instead of JavaScript - Use perfect hashes instead of ["a', "b", "c"].includes() idioms, string equalities, Seys, etc. - Use a single perfect hash to match all tags/attribute names and then use enums in the rest of the codebase - Use a single if (token.kind === Tag.START instead of repeating that for 10 consecutive conditionals - Don't return the "reprocess" constant, but use an enum or perhaps nothing if "reprocess" is the only option - Try tail recursion instead of a switch over the state in the tokenizer - Use switches (best after a perfect hash lookup) instead of multiple ifs on characters in the tokenizer - "treeBuilder.openElements = treeBuilder.open_elements;" can't possibly be good code

Perhaps the agent can find these themselves if told to make the code perfect and not just pass tests

simonw · 2025-12-21T15:07:03 1766329623

Thanks for the feedback - I pasted it into a Claude Code session on my phone, here's the resulting PR: https://github.com/simonw/justjshtml/pull/7

I didn't include the TypeScript bit though - it didn't use TypeScript because I don't like adding a build step to my JavaScript projects if I can possible avoid it. The agent would happily have used TypeScript if I had let it.

I don't like that openElements = open_elements pattern either - it did that because I asked it for a port of a Python library and it decided to support the naming conventions for both Python and JavaScript at once. I told it to remove all of those.

I had it run a micro benchmark too against the before and after - here's the code it used for that: https://github.com/simonw/justjshtml/blob/a9dbe2d7c79522a76f...

  BEFORE benchmark:
  Input: 87,707 bytes
  Average: 7.846 ms
  Ops/sec: 127.5

After applying your suggestions:

  AFTER: 
  Average: 7.769ms
  Ops/sec: 128.7  (1% improvement)

It pushed back against the tail recursion suggestion:

> The current implementation uses a switch statement in step(). JavaScript doesn’t have proper tail call optimization (only Safari implements it), so true tail recursion would cause stack overflow on large documents.

visarga · 2025-12-21T07:29:54 1766302194

You should take into consideration the time it took to make those 9200 tests originally. If you have good test coverage the agent can go much farther ahead.

dangus · 2025-12-21T08:30:29 1766305829

Heh, I mostly use AI in the opposite direction to write tests because:

1. That’s the part of development work I hate the most and never really clicked with me

2. AI to to this point seems to be better at writing tests than code

Take this with the grain of salt that:

1. I suck

2. My work is mostly in the realm of infrastructure where testing has always been weird and a little dumb

9rx · 2025-12-21T13:38:45 1766324325

AI has become very good at writing pointless and bad tests, at least. It remains difficult to compel it to write good tests consistently.

But even if it wrote great tests every time, the trouble is that testing was designed around the idea of "double entry accounting". Even great tests can test the wrong thing. In the old world you would write a test case and then implement something to satisfy the same. If both sides of the ledger agree, so to speak, you can be pretty confident that both are correct. — In other words, going through the process of implementation gives an opportunity to make sure the test you wrote isn't ill-conceived or broken itself. If you only write the tests, or only write the implementation, or write none of it, there is no point at which you can validate your work.

If you have already built up an application and are reusing its test suite to reimplement the software in another language, like above, that is one thing, but in greenfield work it remains an outstanding problem of how to validate the work when you start to involve AI agents. Another article posted here recently suggests that we can go back to manual testing to validate the work... But that seems like a non-solution.

visarga · 2025-12-21T20:13:14 1766347994

Every error is a signal you need better tests. You can let the LLM create tests for every error it stumbles into, besides all the regular tests it can write on its own. Add all test scenarios you can think of, since you are not implementing them by hand. A bad test is invalidated by code, and a bad code invalidated by the test, so between them the AI agent can become reliable.

noosphr · 2025-12-21T06:41:53 1766299313

What's more amazing is how fast your account empties when they do that.

fragmede · 2025-12-21T06:43:49 1766299429

it's $200/month for the "unlimited" plan.

noosphr · 2025-12-21T06:49:05 1766299745

It's amazing how fast your account hits usage limits.

lelanthran · 2025-12-21T16:41:51 1766335311

I think GP was being sarcastic: they did say that the plans were "unlimited".

I read

    It's "unlimited"

and

    It's unlimited

quite differently.

hatefulheart · 2025-12-21T07:15:52 1766301352

Simon have you got to the point where you just don’t read the article?

Others have pointed out your interpretation of long task is not the same as the article.

Maybe this is the negative effects of excessive LLM usage that are spoken about.

simonw · 2025-12-21T08:07:41 1766304461

They were right. I hadn't read enough of the article to understand what was meant by multi-hour tasks. I upvoted them for pointing that out.

lelanthran · 2025-12-21T16:50:11 1766335811

>> Maybe this is the negative effects of excessive LLM usage that are spoken about.

> I upvoted them for pointing that out.

I'm also curious about what you think about the GPs question. TBH, responding after reading half an article was a common thing for most people pre-LLM anyway.

simonw · 2025-12-21T20:07:17 1766347637

Yeah, show me a Hacker News user who's never posted a comment on a story without properly reading it (or even without clicking the link). LLMs have nothing to do with it.

If I had piped the article through an LLM first, I wouldn't have made the embarrassing mistake in that comment!