I didn't really understand the "long task" thing until I actually experienced it. The problem is finding a task you can set an agent that justifies working for that long. I finally hit one when I tried porting that Python HTML5 parser to JavaScript by pointing Codex CLI at the 9,200 html5lib-tests test suite: https://simonwillison.net/2025/Dec/15/porting-justhtml/
It's pretty amazing to watch tools-in-a-loop crunch away for >4 hours to solve a generally difficult problem through sheer brute-force.
To be clear this doesn't mean that it takes the AI > 4 hours to do the task. METR is measuring the difficulty of tasks by how long it takes a human to do the same task. This benchmark is saying that Opus 4.5 can now do tasks (related to AI R&D, coding foremost among them) that take human experts > 4 hours (at a 50% reliability level; whether that's actually useful depends on of course the cost of failure). It is silent on how long it takes AI systems to do those tasks. In theory an AI system could take longer than that (in practice it's usually significantly shorter).
This is of course quite highly correlated with an AI system being able to churn through a task for a long time. But it's not necessarily the same thing.
Of course the big questions are going to arise if/when we start passing lines like 8 hours (a whole work day) or 40 hours (a whole work week).
I think you might be misunderstanding the article actually, this is about AI solving tasks as measured by how long it takes a human to solve the task. The AI could potentially solve it much quicker, but the use of "human time to solve" is an attempt to create a metric that reveals long horizon complexity (as I understand it anyway).
It's interesting because like the article notes, AI is really smashing benchmarks, but actual usefulness in automation of thought work is proving much more elusive. I think that collective experience of AI just not being that useful, or as useful as benchmarks suggest it should be, is captured in this metric.
I've practiced a healthy skepticism of the recent boom but I can't reason why the long horizon time wouldn't stretch to 8 hours or a week worth's of effort from next year. After Opus-4.5, governments and organizations should really figure out a path out of this storm because we're in it now.
It's significantly accelerated to 4 months since the beginning of 2025, which puts 1 week within reach if things stay on trend. But yes 7 months is the more reliable long-term trend.
Can we attribute the acceleration to something specific, that might not actually continue growth? For example, agentic coding and reasoning models seem to have made a huge leap in abilities, but wouldn't translate to an ongoing exponential growth.
There's a fair amount of uncertainty on this point. In general it's unclear when/whether things will plateau out (although there are indications again that the trend is accelerating not decelerating).
That being said, if by "agentic coding" you are implying that a leap in capabilities is due to novel agentic frameworks/scaffolding that have appeared in 2025, I believe you are confusing cause and effect.
In particular, the agentic frameworks and scaffolding are by and large not responsible for the jump in capabilities. It is rather that the underlying models have improved sufficiently such that these frameworks and scaffolding work. None of the frameworks and scaffolding approaches of 2025 are new. All of them had been tried as early as 2023 (and indeed most of them had been tried in 2020 when GPT-3 came out). It's just that 2023-era models such as GPT-4 were far too weak to support them. Only in 2025 have models become sufficiently powerful to support these workflows.
Hence agentic frameworks and scaffolding are symptoms of ongoing exponential growth, not one-time boosts of growth.
Likewise reasoning models do not seem to be a one-time boost of growth. In particular reasoning models (or more accurate RLVR) seem to be an on-going source of new pretraining data (where the reasoning traces of models created during the process of RLVR serve as pretraining data for the next generation of models).
I remain uncertain, but I think there is a very real chance (>= 50%) that we are on an exponential curve that doesn't top out anytime soon (which gets really crazy really fast). If you want to do something about it, whether that's stopping the curve, flattening the curve, preparing yourself for the curve etc., you better do it now.
METR is using hours of equivalent human effort, not actual hours the agent itself spends, so by their methodology, your task might qualify as one where it pulls off much more than 4h of human work.
"Human hours equivalent" itself is an interesting metric, because: which human? Or rather, I'm sure they had a coherent definition in mind: presumably a human reasonably competent at whatever the specific task is. But hours the abstract human standard would spend is different from the hours any specific person, say you or I, would spend.
In particular, some of the appeal (and risk!!) of these things is precisely that you can ask for help with things that would be quick work for someone (who knows jq, or a certain corner of the PyPI library ecosystem, or modern CSS, or TypeScript annotations, or something else) but not for you.
The “50% time horizon” feels most actionable when you pair it with an expected-value model.
For a given task: EV ≈ (human_time_saved × $/hour) − (p_fail × cost_of_failure) − (iteration/oversight cost).
A model crossing 4h-at-50% might be hugely useful for low failure-cost work, and still net-negative for anything where rollback/debug is expensive. The missing piece is how p_fail scales with task length + how recoverable failures are.
Yeah--it's difficult to go from a benchmark involving the model attempting things alone to the effect assisting people on real tasks because, well, ideally you'd measure that with real people doing real tasks. Last time METR tried that (in early '25) they found a net slowdown rather than any speedup at all. Go figure!
The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years...
Yeah--I wanted a short way to gesture at the subsequent "tasks that are fast for someone but not for you are interesting," and did not mean it as a gotcha on METR, but I should've taken a second longer and pasted what they said rather than doing the "presumably a human competent at the task" handwave that I did.
I agree. After all, benchmarks don't mean much, but I guess they are fine as long as they keep measuring the same thing every time.
Also, the context matter. In my case, I see a huge difference between the gains at work vs those at home on a personal project where I don't have to worry about corporate policies, security, correctness, standards, etc. I can let the LLM fly and not worry about losing my job in record time.
My problem with the OpenAI models (GPT5.2 in particular) recently is an extreme aversion to doing more than the smallest step in a task before asking for using input. Even if I explicitly instruct it to continue without input until the task is complete, it ignores the instruction.
I cannot imagine GPT5.2 working on a task for more than 2 minutes, let alone 4 hours. I’m curious if you’ve run into this and figured out a way around it?
I've not had that problem at all with GPT-5.2 running in Codex CLI.
I use prompts like this:
Build a pure JavaScript library (no dependencies) for encoding and
decoding this binary format. Start by looking at how the lite3-python
library works - the JavaScript one should have the same API and probably the
same code design too. Build the JS one in lite3-javascript - it should be a
single JavaScript module which works in both Node.js and in the browser.
There should be a test script that runs with Node.js which runs against the
files in the lite3-python/format_suite folder. Write the test script first,
run it and watch it fail, then build the JavaScript library and keep running
the tests until they pass.
I find that surprising. GPT 5.2 is the model I've had working the longest. It frequently works more than 4 hours nonstop, while earlier models would stop to ask if they should continue every 10 minutes. 5.1 and earlier ignores it if I ask it to continue until a task is done, but 5.2 will usually finish it.
How are you guys even doing long tasks with plain Codex or Claude code?
I use Claude code and I get hit with a permissions prompt every 2 seconds for anything I try to do.
Sure I can turn off all dangerous permissions but it'd probably honestly stop and claim it's finished well before it actually is in most cases from my experience.
To be fair I haven't tried codex so maybe it's better at this but I'm my experience almost every model stops at some point and claims victory or stops and tells me something like "next we'll continue on with XYZ" at which point I have to prompt it to continue.
You have to use --yolo or --dangerously-skip-permissions options.
Thankfully the cloud versions (Claude Code for web, Codex Cloud) run like that already, and are relatively safe in that if anything goes wrong it happens on someone else's computer.
Codex (at least 5 and 5.1) is bad at asking for permission. Whenever it wants to run pre-commit or platformio, it tries to do that, that fails because of the sandbox, and then Codex decides something is wrong with the cache directory and keeps asking for permission to sudo chown ~/.cache, every time.
I have to specifically tell it to request permission for the command it wants to run, and then it works. Very annoying, and very annoying that it can't persist the permission, like Claude Code can, so it doesn't have to ask again every single time.
Quickly looking at the source code, mostly treeBuilder and tokenizer, I do see several possible improvements:
- Use Typescript instead of JavaScript
- Use perfect hashes instead of ["a', "b", "c"].includes() idioms, string equalities, Seys, etc.
- Use a single perfect hash to match all tags/attribute names and then use enums in the rest of the codebase
- Use a single if (token.kind === Tag.START instead of repeating that for 10 consecutive conditionals
- Don't return the "reprocess" constant, but use an enum or perhaps nothing if "reprocess" is the only option
- Try tail recursion instead of a switch over the state in the tokenizer
- Use switches (best after a perfect hash lookup) instead of multiple ifs on characters in the tokenizer
- "treeBuilder.openElements = treeBuilder.open_elements;" can't possibly be good code
Perhaps the agent can find these themselves if told to make the code perfect and not just pass tests
I didn't include the TypeScript bit though - it didn't use TypeScript because I don't like adding a build step to my JavaScript projects if I can possible avoid it. The agent would happily have used TypeScript if I had let it.
I don't like that openElements = open_elements pattern either - it did that because I asked it for a port of a Python library and it decided to support the naming conventions for both Python and JavaScript at once. I told it to remove all of those.
It pushed back against the tail recursion suggestion:
> The current implementation uses a switch statement in step(). JavaScript doesn’t have proper tail call optimization (only Safari implements it), so true tail recursion would cause stack overflow on large documents.
You should take into consideration the time it took to make those 9200 tests originally. If you have good test coverage the agent can go much farther ahead.
AI has become very good at writing pointless and bad tests, at least. It remains difficult to compel it to write good tests consistently.
But even if it wrote great tests every time, the trouble is that testing was designed around the idea of "double entry accounting". Even great tests can test the wrong thing. In the old world you would write a test case and then implement something to satisfy the same. If both sides of the ledger agree, so to speak, you can be pretty confident that both are correct. — In other words, going through the process of implementation gives an opportunity to make sure the test you wrote isn't ill-conceived or broken itself. If you only write the tests, or only write the implementation, or write none of it, there is no point at which you can validate your work.
If you have already built up an application and are reusing its test suite to reimplement the software in another language, like above, that is one thing, but in greenfield work it remains an outstanding problem of how to validate the work when you start to involve AI agents. Another article posted here recently suggests that we can go back to manual testing to validate the work... But that seems like a non-solution.
Every error is a signal you need better tests. You can let the LLM create tests for every error it stumbles into, besides all the regular tests it can write on its own. Add all test scenarios you can think of, since you are not implementing them by hand. A bad test is invalidated by code, and a bad code invalidated by the test, so between them the AI agent can become reliable.
>> Maybe this is the negative effects of excessive LLM usage that are spoken about.
> I upvoted them for pointing that out.
I'm also curious about what you think about the GPs question. TBH, responding after reading half an article was a common thing for most people pre-LLM anyway.
Yeah, show me a Hacker News user who's never posted a comment on a story without properly reading it (or even without clicking the link). LLMs have nothing to do with it.
If I had piped the article through an LLM first, I wouldn't have made the embarrassing mistake in that comment!
It's pretty amazing to watch tools-in-a-loop crunch away for >4 hours to solve a generally difficult problem through sheer brute-force.