Hacker Newsnew | past | comments | ask | show | jobs | submit | MattRix's commentslogin

Only if you didn’t read the article…

How can you say “without bringing in proof” when there is literally proof in the article?

The problem isn’t that the tasks are impossible to solve, it’s that they’re underspecified and/or impossible to solve consistently (ex. because a test is expecting the solution function to have a specific name that wasn’t specified in the task itself).

So maybe Anthropic runs Mythos through the benchmark 10000 times and takes the highest score, who knows?


We actually know that a "100% pass rate" is trivially possible: https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/

Anthropic p-hacking the benchmark strikes me as cheating, and somewhat unlikely. Mythos figuring out how to cheat at the benchmark strikes me as much more likely.

But if that hypothesis is the explanation the interesting part is Opus 4.7 (but not 4.6) seems to be doing the same.


>Mythos figuring out how to cheat at the benchmark strikes me as much more likely.

Define "cheat". If it's just hacking the test harness to return "PASSED", surely this would be easily detected with some human auditing? It sounds far more likely their solution are designed to pass the incorrect tests. That might be considered bad in a SWE context, but it's not exactly cheating either. It might even be considered a good thing, eg. in the context of backwards compatibility.

[1] https://learn.microsoft.com/en-us/troubleshoot/microsoft-365...


They mention this in the article. This is why private (non public) benchmark tasks that have been made from scratch are necessary.

I don’t see why this would happen when the modern models already use MoE, which gives them most of the benefits of having specialized models.

The open models of similar scales (ex. the new 1T deepseek model) are a fraction of the cost per token, so I don’t see how that can be the case. Inference is profitable, it’s the training that makes it unprofitable.

Yeah but in a discussion about technology it’s a little silly. It’s like someone complaining about their phone and then finding out they still use a Nokia.

Overwatch also has “kill cams”, which basically create an entire alternate game state to show you how the enemy killed you, and they have the “Play of the Game” system that replays the coolest moment of the game at the end. It’s impressive tech.

Log out and log in again? That usually fixes these kind of issues for me.

From what I’ve seen, once people start using these, they will do the font size thing. Then all your changes go through the same interface.

Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: