The problem isn’t that the tasks are impossible to solve, it’s that they’re underspecified and/or impossible to solve consistently (ex. because a test is expecting the solution function to have a specific name that wasn’t specified in the task itself).
So maybe Anthropic runs Mythos through the benchmark 10000 times and takes the highest score, who knows?
Anthropic p-hacking the benchmark strikes me as cheating, and somewhat unlikely. Mythos figuring out how to cheat at the benchmark strikes me as much more likely.
But if that hypothesis is the explanation the interesting part is Opus 4.7 (but not 4.6) seems to be doing the same.
>Mythos figuring out how to cheat at the benchmark strikes me as much more likely.
Define "cheat". If it's just hacking the test harness to return "PASSED", surely this would be easily detected with some human auditing? It sounds far more likely their solution are designed to pass the incorrect tests. That might be considered bad in a SWE context, but it's not exactly cheating either. It might even be considered a good thing, eg. in the context of backwards compatibility.
The open models of similar scales (ex. the new 1T deepseek model) are a fraction of the cost per token, so I don’t see how that can be the case. Inference is profitable, it’s the training that makes it unprofitable.
Yeah but in a discussion about technology it’s a little silly. It’s like someone complaining about their phone and then finding out they still use a Nokia.
Overwatch also has “kill cams”, which basically create an entire alternate game state to show you how the enemy killed you, and they have the “Play of the Game” system that replays the coolest moment of the game at the end. It’s impressive tech.
reply