I put the examples he gave into Claude 4(Sonnet) purely asking to eval the code,...

metalrain · 2025-06-17T13:49:32 1750168172

It's not about "model quality". Most models can improve code their output when asked, but problem is the lack of introspection by the user.

Basically same problem as copy paste coding, but LLM can (sometimes) know your exact variable names, types so it's easier to forget that you need to understand and check the code.

shayonj · 2025-06-17T11:22:28 1750159348

My experience hasn't changed between models, given the core issue mentioned in the article. Primarily I have used Gemini and Claude 3.x and 4. Some GPT 4.1 here and there.

All via Cursor, some internal tools and Tines Workbench

soulofmischief · 2025-06-17T14:34:04 1750170844

My experience changes just throughout the day on the same model, it seems pretty clear that during peak hours (lately most of the daytime) Anthropic is degrading their models in order to meet demand. Claude becomes a confident idiot and the difference is quite noticeable.

drewnick · 2025-06-17T16:37:57 1750178277

I too have noticed variability and it's impossible to know for sure but late one Friday or Saturday night (PST) it seemed to be brilliant, several iterations in a row. Some of my best output has been in very short windows.

genewitch · 2025-06-17T15:27:19 1750174039

this is on paid plans?

soulofmischief · 2025-06-17T17:13:37 1750180417

This is through providers such as Cursor, but the consistency of this experience has put me off from directly subscribing to Anthropic since I'm already subscribed up to my eyeballs in various AI services.

Last I'd checked, Anthropic would not admit that they were degrading models for obvious scummy business reasons, but they are probably quantizing them, reducing beam search, lowering precision/sampling), etc. because the model goes from being superpowered to completely unusable, constantly dropping code and mangling files, getting caught in loops, doing the weirdest detours, and sometimes completely ignoring my instructions from just one message prior.

t first I wondered if Cursor was mishandling the context, and while they indeed aren't doing the best with context stuffing, the rest of the issues are not context-related.

marxism · 2025-06-18T02:49:54 1750214994

As you pointed out the examples in the blog post are not an LLM failure. The real failure is asking too little.

Engineers think "the LLM can handle the simple code change, but if I ask for too much it'll fall over." Wrong. Modern LLMs can easily handle a 50-line function plus 50 lines of detailed comments explaining assumptions, performance implications, and what changes would invalidate this approach.

But most engineers are either asking for solutions without enough context or failing to ask the LLM to document its assumptions.

Then they're shocked when they have to reverse engineer out that the code assumes 100 users when they have 100k, or why it's doing individual API calls when they needed batch processing.

Most engineers have never seen good comments, so they don't know they can ask LLMs to write them.

The default LLM comment is just English pseudo-code: "this function takes a user ID and sends them a notification." Completely useless. But that's because most engineers have never experienced comments that explain trade-offs, performance implications, or future system evolution.

Writing clear technical explanations is genuinely difficult. Almost no one does it well. So when you ask an LLM for "comments," you get the same terrible pattern you've seen everywhere else.

But you can literally ask for explanations of assumptions, performance characteristics, and scenarios where this approach would break. The LLM handles it perfectly. You just have to know that's even possible. Makes the code review so much easier.

Most engineers don't, because they've never seen it done.

[1] https://peoplesgrocers.com/en/writing/asking-llms-the-right-...

owebmaster · 2025-06-18T04:07:52 1750219672

This seems to be the default for Gemini 2.5 Pro now

lazy_moderator1 · 2025-06-17T15:12:29 1750173149

did it detect n+1 in the first one, race condition in the second one and memory leak in the third one?

danielbln · 2025-06-17T15:32:20 1750174340

It did, yeah.

lazy_moderator1 · 2025-06-17T22:27:04 1750199224

Could it be that it just found this article and suggested issues based on that? Because the issues are somewhat arbitrary. One could make a case for a several different issues for any of the snippets and yet the model chose the exact ones mentioned by the article.