I’ve followed your blog for a while, and I have been meaning to unsubscribe beca...

simonw · 2025-03-28T09:49:17 1743155357

"my experience is that in the context of working in a large systems codebase with years of history, it’s just not that useful."

It's possible that changed this week with Gemini 2.5 Pro, which is equivalent to Claude 3.7 Somnet in terms of code quality but has a 1 million token context (with excellent scores on long context benchmarks) and an increased output limit too.

I've been dumping hundreds of thousands of times of codebase into it and getting very impressive results.

mplanchard · 2025-03-28T12:26:08 1743164768

See this is one of the things that’s frustrating about the whole endeavor. I give it an honest go, it’s not very good, but I’m constantly exhorted to try again because maybe now that Model X 7.5qrz has been released, it’ll be really different this time!

It’s exhausting. At this point I’m mostly just waiting for it to stabilize and plateau, at which point it’ll feel more worth the effort to figure out whether it’s now finally useful for me.

simonw · 2025-03-28T13:27:30 1743168450

Not going to disagree that it's exhausting! I've been trying to stay on top of new developments for the past 2.5 years and there are so many days when I'll joke "oh, great, it's another two new models day".

Just on Tuesday this week we got the first widely available high quality multi-modal image output model (GPT-4o images) and a new best-overall model (Gemini 2.5) within hours of each other. https://simonwillison.net/2025/Mar/25/

jjani · 2025-03-28T03:06:14 1743131174

> One, your projects are small enough that you can reasonably provide enough context for the language model to be useful. Two, you’re using the most common languages in the training data. Three, because of those factors, you’re willing to put much more work into learning how to use it effectively, since it can actually produce useful content for you.

Take a look at the 2024 StackOverflow survey.

70% of professional developer respondents had only done extensive work over the last year in one of:

JS 64.6% SQL 54.1% JTML/CSS 52.9% PY 46.9% TS 43.4% Bash/Shell 34.2% Java 30%

LLMs are of course very strong in all of these. 70% of developers only code in languages LLMs are very strong at.

If anything, for the developer population at large, this number is even higher than 70%. The survey respondents are overwhelmingly American (where the dev landscape is more diverse), and self-select to those who use niche stuff and want to let the world know.

Similar argument can be made for median codebase size, in terms of LOC written every year. A few days ago he also gave Gemini Pro 2.5 a whole codebase (at ~300k tokens) and it performed well. Even in huge codebases, if any kind of separation of concerns is involved, that's enough to give all context relevant to the part of the code you're working on. [1]

[1] https://simonwillison.net/2025/Mar/25/gemini/

mplanchard · 2025-03-28T12:34:03 1743165243

What’s 300k tokens in terms of lines of code? Most codebases I’ve worked on professionally have easily eclipsed 100k lines, not including comments and whitespace.

But really that’s the vision of actual utility that I imagined when this stuff first started coming out and that I’d still love to see: something that integrates with your editor, trains on your giant legacy codebase, and can actually be useful answering questions about it and maybe suggesting code. Seems like we might get there eventually, but I haven’t seen that we’re there yet.

simonw · 2025-03-28T13:35:06 1743168906

We hit "can actually be useful answering questions about it" within the last ~6 months with the introduction of "reasoning" models with 100,000+ token contest limits (and the aforementioned Gemini 1 million/2 million models).

The "reasoning" thing is important because it gives models the ability to follow execution flow and answer complex questions that down many different files and classes. I'm finding it incredible for debugging, eg: https://gist.github.com/simonw/03776d9f80534aa8e5348580dc6a8...

I built a files-to-prompt tool to help dump entire codebases into the larger models and I use it to answer complex questions about code (including other people's projects written in languages I don't know) several times a week. There's a bunch of examples of that here: https://simonwillison.net/search/?q=Files-to-prompt&sort=dat...

jjani · 2025-03-28T16:07:52 1743178072

How much lines of context and understanding can we, human developers, keep in our heads, taken into account and refer to when implementing something?

Whatever the amount may be, it definitely fits into 300k tokens.

mplanchard · 2025-03-29T02:04:54 1743213894

After more than a few years working on a codebase? Quite a lot. I know which interfaces I need and from where, what the general areas of the codebase are, and how they fit together, even if I don’t remember every detail of every file.

simoncion · 2025-03-28T07:28:59 1743146939

> But it’s not fair to say that we’re just holding it wrong.

<troll>Have you considered that asking it to solve problems in areas it's bad at solving problems is you holding it wrong?</troll>

But, actually seriously, yeah, I've been massively underwhelmed with the LLM performance I've seen, and just flabbergasted with the subset of programmer/sysadmin coworkers who ask it questions and take those answers as gospel. It's especially frustrating when it's a question about something that I'm very knowledgeable about, and I can't convince them that the answer they got is garbage because they refuse to so much as glance at supporting documentation.