More

MattRix · 2026-04-26T15:43:21 1777218201

Only if you didn’t read the article…

MattRix · 2026-04-26T15:43:00 1777218180

How can you say “without bringing in proof” when there is literally proof in the article?

MattRix · 2026-04-26T15:41:48 1777218108

The problem isn’t that the tasks are impossible to solve, it’s that they’re underspecified and/or impossible to solve consistently (ex. because a test is expecting the solution function to have a specific name that wasn’t specified in the task itself).

So maybe Anthropic runs Mythos through the benchmark 10000 times and takes the highest score, who knows?

gpm · 2026-04-26T15:52:52 1777218772

We actually know that a "100% pass rate" is trivially possible: https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/

Anthropic p-hacking the benchmark strikes me as cheating, and somewhat unlikely. Mythos figuring out how to cheat at the benchmark strikes me as much more likely.

But if that hypothesis is the explanation the interesting part is Opus 4.7 (but not 4.6) seems to be doing the same.

gruez · 2026-04-26T15:59:50 1777219190

>Mythos figuring out how to cheat at the benchmark strikes me as much more likely.

Define "cheat". If it's just hacking the test harness to return "PASSED", surely this would be easily detected with some human auditing? It sounds far more likely their solution are designed to pass the incorrect tests. That might be considered bad in a SWE context, but it's not exactly cheating either. It might even be considered a good thing, eg. in the context of backwards compatibility.

[1] https://learn.microsoft.com/en-us/troubleshoot/microsoft-365...

MattRix · 2026-04-26T15:38:12 1777217892

They mention this in the article. This is why private (non public) benchmark tasks that have been made from scratch are necessary.

MattRix · 2026-04-25T21:34:38 1777152878

I don’t see why this would happen when the modern models already use MoE, which gives them most of the benefits of having specialized models.

MattRix · 2026-04-25T21:32:07 1777152727

The open models of similar scales (ex. the new 1T deepseek model) are a fraction of the cost per token, so I don’t see how that can be the case. Inference is profitable, it’s the training that makes it unprofitable.

MattRix · 2026-04-25T13:00:41 1777122041

Yeah but in a discussion about technology it’s a little silly. It’s like someone complaining about their phone and then finding out they still use a Nokia.

MattRix · 2026-04-19T13:07:08 1776604028

Overwatch also has “kill cams”, which basically create an entire alternate game state to show you how the enemy killed you, and they have the “Play of the Game” system that replays the coolest moment of the game at the end. It’s impressive tech.

MattRix · 2026-04-17T13:50:47 1776433847

Log out and log in again? That usually fixes these kind of issues for me.

MattRix · 2026-04-17T13:49:21 1776433761

From what I’ve seen, once people start using these, they will do the font size thing. Then all your changes go through the same interface.