Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Look for the ways that AI works, and it can be a powerful tool. Try and figure out where it still fails, and you will see nothing but hype and hot air. Not every use case is like this, but there are many.

The problem is that I feel I am constantly being bombarded by people bullish on AI saying "look how great this is" but when I try to do the exact same things they are doing, it doesn't work very well for me

Of course I am skeptical of positive claims as a result.



I don't know what you are doing or why it's failed. Maybe my primary use cases really are in the top whatever percentile for AI usefulness, but it doesn't feel like it. All I know is that frontier models have already been good enough for more than a year to increase my productivity by a fair bit.


Your use case is in fact in the top whatever percentile for AI usefulness. Short simple scripting that won't have to be relied on due to never being widely deployed. No large codebase it has to comb through, no need for thorough maintenance and update management, no need for efficient (and potentially rare) solutions.

The only use case that would beat yours is the type of office worker that cannot write professional sounding emails but has to send them out regularly manually.


I fully believe it's far better at the kind of coding/scripting that I do than the kind that real SWEs do. If for no other reason than the coding itself that I do is far far simpler and easier, so of course it's going to do better at it. However, I don't really believe that coding is the only use case. I think that there are a whole universe of other use cases that probably also get a lot of value from LLMs.

I think that HN has a lot of people who are working on large software projects that are incredibly complex and have a huge numbers of interdependencies etc., and LLMs aren't quite to the point that they can very usefully contribute to that except around the edges.

But I don't think that generalizing from that failure is very useful either. Most things humans do aren't that hard. There is a reason that SWE is one of the best paid jobs in the country.


Even a 1 month project with one good senior engineer working on it will get 20+ different files and 5,000+ loc.

Real programming is on a totally different scale than what you're describing.

I think that's true for most jobs. Superficially an AI looks like it can do good.

But LLMs:

1. Hallucinate all the time. If they were human we'd call them compulsive liars

2. They are consistenly inconsistent, so are useless for automation

3. Are only good at anything they can copy from their data set. They can't create, only regurgitate other people's work

4. AI influencing hasn't happened yet, but will very soon start making AI LLMs useless, much like SEO has ruined search. You can bet there are a load of people already seeding the internet with a load of advertising and misinformation aimed solely at AIs and AI reinforcement


> Even a 1 month project with one good senior engineer working on it will get 20+ different files and 5,000+ loc.

For what it's worth, I mostly work on projects in the 100-200 files range, at 20-40k LoC. When using proper tooling with appropriate models, it boosts my productivity by at least 2x (being conservative). I've experimented with this by going a few days without using them, then using them again.

Definitely far from the massive codebases many on here work on, small beans by HN standards. But also decidedly not just writing one-off scripts.


> Real programming is on a totally different scale than what you're describing.

How "real" are we talking?

When I think of "real programming" I think of flight control software for commercial airplanes and, I can assure you, 1 month != 5,000 LoC in that space.


It's not about the size it's more about if the task is trivial.


And... I know people who now use AI to write their professional-sounding emails, and they often don't sound as professional as they think they do. It can be easy to just skim what an AI generates and think it's okay to send if you aren't careful, but the people you send those emails to actually have to read what was written and attempt to understand it, and doing that makes you notice things that a brief skim doesn't catch.

It's actually extremely irritating that I'm only half talking to the person when I email with these people.


It's kinda like machine translated novels. You have to really be passionate about the novel to endure these kinds of translations. That's when you realize how much work novel translators do to get a coherent result.


Especially jarring when you have read translation that put thought in them. Noticed this in Xianxia so Chinese power-fantasy. Where selection of what to translate and what to transliterate can have huge impact. And then editorial work also becomes important if something in past need to be changed based on future information.


I literally had a developer of an open source package I’m working with tell me “yeah that’s a known problem, I gave up on trying to fix it. You should just ask ChatGPT to fix it, I bet it will immediately know the answer.”

Annoying response of course. But I’d never used an LLM to debug before, so I figured I’d give it a try.

First: it regurgitated a bunch of documentation and basic debugging tips, which might have actually been helpful if I had just encountered this problem and had put no thought into debugging it yet. In reality, I had already spent hours on the problem. So not helpful

Second: I provided some further info on environment variables I thought might be the problem. It latched on to that. “Yes that’s your problem! These environment variables are (causing the problem) because (reasons that don’t make sense). Delete them and that should fix things.” I deleted them. It changed nothing.

Third: It hallucinated a magic numpy function that would solve my problem. I informed it this function did not exist, and it wrote me a flowery apology.

Clearly AI coding works great for some people, but this was purely an infuriating distraction. Not only did it not solve my problem, it wasted my time and energy, and threw tons of useless and irrelevant information at me. Bad experience.


The biggest thing I've found is that if you give any hint at all as to what you think the problem is, the LLM will immediately and enthusiastically agree, no matter how wildly incorrect your suggestion is.

If I give it all my information and add "I think the problem might be X, but I'm not sure", the LLM always agrees that the problem is X and will reinterpret everything else I've said to 'prove' me right.

Then the conversation is forever poisoned and I have to restart an entirely new chat from scratch.

98% of the utility I've found in LLMs is getting it to generate something nearly correct, but which contains just enough information for me to go and Google the actual answer. Not a single one of the LLMs I've tried have been any practical use editing or debugging code. All I've ever managed is to get it to point me towards a real solution, none of them have been able to actually independently solve any kind of problem without spending the same amount of time and effort to do it myself.


> The biggest thing I've found is that if you give any hint at all as to what you think the problem is, the LLM will immediately and enthusiastically agree, no matter how wildly incorrect your suggestion is.

I'm seeing this sentiment a lot in these comments, and frankly it shows that very few here have actually gone and tried the variety of models available. Which is totally fine, I'm sure they have better stuff to do, you don't have to keep up with this week's hottest release.

To be concrete - the symptom you're talking about is very typical of Claude (or earlier GPT models). o3-mini is much less likely to do this.

Secondly, prompting absolutely goes a huge way to avoiding that issue. Like you're saying - if you're not sure, don't give hints, keep it open-minded. Or validate the hint before starting, in a separate conversation.


I literally got this problem earlier today on ChatGPT, which claims to be based on o4-mini. So no, does not sound like it's just a problem with Claude or older GPTs.

And on "prompting", I think this is a point of friction between LLM boosters and haters. To the uninitiated, most AI hype sounds like "it's amazing magic!! just ask it to do whatever you want and it works!!" When they try it and it's less than magic, hearing "you're prompting it wrong" seems more like a circular justification of a cult follower than advice.

I understand that it's not - that, genuinely, it takes some experience to learn how to "prompt good" and use LLMs effectively. I buy that. But some more specific advice would be helpful. Cause as is, it sounds more like "LLMs are magic!! didn't work for you? oh, you must be holding it wrong, cause I know they infallibly work magic".


> I understand that it's not - that, genuinely, it takes some experience to learn how to "prompt good" and use LLMs effectively

I don't buy it this at all.

At best "learning to prompt" is just hitting the slot machine over and over until you get something close to what you want, which is not a skill. This is what I see when people "have a conversation with the LLM"

At worst you are a victim of sunk cost fallacy, believing that because you spent time on a thing that you have developed a skill for this thing that really has no skill involved. As a result you are deluding yourself into thinking that the output is better.. not because it actually is, but because you spent time on it so it must be


On the other hand, when it works it's darn near magic.

I spent like a week trying to figure out why a livecd image I was working on wasn't initializing devices correctly. Read the docs, read source code, tried strace, looked at the logs, found forums of people with the same problem but no solution, you know the drill. In desperation I asked ChatGPT. ChatGPT said "Use udevadm trigger". I did. Things started working.

For some problems it's just very hard to express them in a googleable form, especially if you're doing something weird almost nobody else does.


i started (re)using AI recently. it/i mostly failed until i decided on a rule.

if it's "dumb and annoying" i ask the AI, else i do it myself.

since that AI has been saving me a lot of time on dumb and annoying things.

also a few models are pretty good for basic physics/modeling stuff (get basic formulas, fetching constants, do some calculations). these are also pretty useful. i recently used it for ventilation/co2 related stuff in my room and the calculations matched observed values pretty well, then it pumped me a broken desmos syntax formula, and i fixed that by hand and we were good to go!

---

(dumb and annoying thing -> time-consuming to generate with no "deep thought" involved, easy to check)


> For some problems it's just very hard to express them in a googleable form

I had an issue where my Mac would report that my tethered iPhone's batteries were running low when the battery was in fact fine. I had tried googling an answer, and found many similar-but-not-quite-the-same questions and answers. None of the suggestions fixed the issue.

I then asked the 'MacOS Guru' model for chatGPT my question, and one of the suggestions worked. I feel like I learned something about chatGPT vs Google from this - the ability of an LLM to match my 'plain English question without a precise match for the technical terms' is obviously superior to a search engine. I think google etc try synonyms for words in the query, but to me it's clear this isn't enough.


If the solution to "devices not setting up" was "udevadm trigger" and it took an LLM suggestion to get there, I question your google skills.

When I google "linux device not initializing correctly", someone suggesting "udevadm trigger" is the 5th result


Google isn't the same for everyone. Your results could be very different from mine. They're probably not quite the same as months ago either.

I may also have accidentally made it harder by using the wrong word somewhere. A good part of the difficulty of googling for a vague problem is figuring out how to even word it properly.

Also of course it's much easier now that I tracked down what the actual problem was and can express it better. I'm pretty sure I wasn't googling for "devices not initializing" at the time.

But this is where I think LLMs offer a genuine improvement -- being able to deal with vagueness better. Google works best if you know the right words, and sometimes you don't.


There is a difference between a directly correct answer and a “5th result”


There is, but if it's the 5th result then either that exact wording is magic or something is wrong with the story.

And it might not have been the first and only thing ChatGPT said. It got there fast but 5th result isn't too slow either.


Honestly this says more about how bad Google has become than about how good GPT is


This morning I was using an LLM to develop some SQL queries against a database it had never seen before. I gave it a starting point, and outlined what I wanted to do. It proposed a solution, which was a bit wrong, mostly because I hadn't given it the full schema to work with. Small nudges and corrections, and we had something that worked. From there, I iterated and added more features to the outputs.

At many points, the code would have an error; to deal with this, I just supply the error message, as-is to the LLM, and it proposes a fix. Sometimes the fix works, and sometimes I have to intervene to push the fix in the right direction. It's OK - the whole process took a couple hours, and probably would have been a whole day if I were doing it on my own, since I usually only need to remember anything about SQL syntax once every year or three.

A key part of the workflow, imo, was that we were working in the medium of the actual code. If the code is broken, we get an error, and can iterate. Asking for opinions doesn't really help...


I often wonder if people who report that LLMs are useless for code haven't cracked the fact that you need to to have a conversation with it - expecting a perfect result after your first prompt is setting it up for failure, the real test is if you can get to a working solution after iterating with it for a few rounds.


As someone who has finally found a way to increase productivity by adding some AI, my lesson has sort of been the opposite. If the initial response after you've provided the relevant context isn't obviously useful: give up. Maybe start over with slightly different context. A conversation after a bad result won't provide any signal you can do anything with, there is no understanding you can help improve.

It will happily spin forever responding in whatever tone is most directly relevant to your last message: provide an error and it will suggest you change something (it may even be correct every once in a while!), suggest a change and it'll tell you you're obviously right, suggest the opposite and you will be right again, ask if you've hit a dead end and yeah, here's why. You will not learn anything or get anywhere.

A conversation will only be useful if the response you got just needs tweaks. If you can't tell what it needs feel free to let it spin a few times, but expect to be disappointed. Use it for code you can fully test without much effort, actual test code often works well. Then a brief conversation will be useful.


Why would I do this, when I can just write it from scratch in less time than it takes you to have this conversation with the LLM?


Because once you get good at using LLMs you can write it with 5 rounds with an LLM in way less time than it would have taken you to type out the whole thing yourself, even if you got it exactly right first time coding it by hand.


I suspect this is only true if you are lousy at writing code or have a very slow typing speed


I suspect the opposite is only true if you haven't taken the time to learn how to productively use LLMs for coding.

(I've written a fair bit about this: https://simonwillison.net/tags/ai-assisted-programming/ and https://simonwillison.net/2025/Mar/11/using-llms-for-code/ and 80+ examples of tools I've built mostly with LLMs on https://tools.simonwillison.net/colophon )


Maybe I've missed it, but what did you use to perform the actual code changes on the repo?


You mean for https://tools.simonwillison.net/colophon ?

I've used a whole bunch of techniques.

Most of the code in there is directly copied and pasted in from https://claude.ai or https://chatgpt.com - often using Claude Artifacts to try it out first.

Some changes are made in VS Code using GitHub Copilot

I've used Claude Code for a few of them https://docs.anthropic.com/en/docs/agents-and-tools/claude-c...

Some were my own https://llm.datasette.io tool - I can run a prompt through that and save the result straight to a file

The commit messages usually link to either a "share" transcript or my own Gist showing the prompts that I used to build the tool in question.


So the main advantage is that LLMs can type faster than you?


Yes, exactly.


Burning down the rainforests so I don’t have to wait for my fingers.


The environmental impact of running prompts through (most) of these models is massively over-stated.

(I say "most" because GPT-4.5 is 1000x the price of GPT-4o-mini, which implies to me that it burns a whole lot more energy.)


If you do a basic query to GPT-4o every ten seconds it uses a blistering... hundred watts or so. More for long inputs, less when you're not using it that rapidly.


This is honestly really unimpressive

Typing speed is not usually the constraint for programming, for a programmer that knows what they are doing

Creating the solution is the hard work, typing it out is just a small portion of it


I know. That's why I've consistently said that LLMs give me a 2-5x productivity boost on the portion of my job which involves typing code into a computer... which is only about 10% of what I do. (One recent example: https://simonwillison.net/2024/Sep/10/software-misadventures... )

(I get boosts from LLMs to a bunch of activities too, like researching and planning, but those are less obvious than the coding acceleration.)


> That's why I've consistently said that LLMs give me a 2-5x productivity boost on the portion of my job which involves typing code into a computer... which is only about 10% of what I do

This explains it then. You aren't a software developer

You get a productivity boost from LLMs when writing code because it's not something you actually do very much

That makes sense

I write code for probably between 50-80% of any given week, which is pretty typical for any software dev I've ever worked with at any company I've ever worked at

So we're not really the same. It's no wonder LLMs help you, you code so little that you're constantly rusty


I'm a software developer: https://github.com/simonw

I very much doubt you spend 80% of your working time actively typing code into a computer.

My other activities include:

- Researching code. This is a LOT of my time - reading my own code, reading other code, reading through documentation, searching for useful libraries to use, evaluating if those libraries are any good.

- Exploratory coding in things like Jupyter notebooks, Firefox developer tools etc. I guess you could call this "coding time", but I don't consider it part of that 10% I mentioned earlier.

- Talking to people about the code I'm about to write (or the code I've just written).

- Filing issues, or updating issues with comments.

- Writing documentation for my code.

- Straight up thinking about code. I do a lot of that while walking the dog.

- Staying up-to-date on what's new in my industry.

- Arguing with people about whether or not LLMs are useful on Hacker News.


"typing code is a small portion of programming"

"I agree, only 10% of what I do is typing code"

"that explains it, you aren't a software developer"

What the hell?


You should check out Simon’s wikipedia and github pages, when you have time between your coding sprints.


You must not be learning very many new things then if you can't see a benefit to using an LLM. Sure, for the normal crud day-to-day type stuff, there is no need for an LLM. But when you are thrown into a new project, with new tools, new code, maybe a new language, new libraries, etc., then having an LLM is a huge benefit. In this situation, there is no way that you are going to be faster than an LLM.

Sure, it often spits out incomplete, non-ideal, or plain wrong answers, but that's where having SWE experience comes in to play to recognize it


> But when you are thrown into a new project, with new tools, new code, maybe a new language, new libraries, etc., then having an LLM is a huge benefit. In this situation, there is no way that you are going to be faster than an LLM.

In the middle of this thought, you changed the context from "learning new things" to "not being faster than an LLM"

It's easy to guess why. When you use the LLM you may be productive quicker, but I don't think you can argue that you are really learning anything

But yes, you're right. I don't learn new things from scratch very often, because I'm not changing contexts that frequently.

I want to be someone who had 10 years of experience in my domain, not 1 year of experience repeated 10 times, which means I cannot be starting over with new frameworks, new languages and such over and over


"When you use the LLM you may be productive quicker, but I don't think you can argue that you are really learning anything"

Here's some code I threw together without even looking at yesterday: https://github.com/simonw/tools/blob/main/incomplete-json-pr... (notes here: https://simonwillison.net/2025/Mar/28/incomplete-json-pretty... )

Reading it now, here are the things it can teach me:

    :root {
      --primary-color: #3498db;
      --secondary-color: #2980b9;
      --background-color: #f9f9f9;
      --card-background: #ffffff;
      --text-color: #333333;
      --border-color: #e0e0e0;
    }
    body {
      font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
      line-height: 1.6;
      color: var(--text-color);
      background-color: var(--background-color);
      padding: 20px;
That's a very clean example of CSS variables, which I've not used before in my own projects. I'll probably use that pattern myself in the future.

    textarea:focus {
      outline: none;
      border-color: var(--primary-color);
      box-shadow: 0 0 0 2px rgba(52, 152, 219, 0.2);
    }
Really nice focus box shadow effect there, another one for me to tuck away for later.

        <button id="clearButton">
          <svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round">
            <rect x="3" y="3" width="18" height="18" rx="2" ry="2"></rect>
            <line x1="9" y1="9" x2="15" y2="15"></line>
            <line x1="15" y1="9" x2="9" y2="15"></line>
          </svg>
          Clear
        </button>
It honestly wouldn't have crossed my mind that embedding a tiny SVG inline inside a button could work that well for simple icons.

      // Copy to clipboard functionality
      copyButton.addEventListener('click', function() {
        const textToCopy = outputJson.textContent;
        
        navigator.clipboard.writeText(textToCopy).then(function() {
          // Success feedback
          copyButton.classList.add('copy-success');
          copyButton.textContent = ' Copied!';
          
          setTimeout(function() {
            copyButton.classList.remove('copy-success');
            copyButton.innerHTML = '<svg xmlns="http://www.w3.org/2000/svg" width="14" height="14" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><rect x="9" y="9" width="13" height="13" rx="2" ry="2"></rect><path d="M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1"></path></svg> Copy to Clipboard';
          }, 2000);
        });
      });
Very clean example of clipboard interaction using navigator.clipboard.writeText

And the final chunk of code on the page is a very pleasing implementation of a simple character-by-character non-validating JSON parser which indents as it goes: https://github.com/simonw/tools/blob/1b9ce52d23c1335777cfedf...

That's half a dozen little tricks I've learned from just one tiny LLM project which I only spent a few minutes on.

My point here is that if you actively want to learn things, LLMs are an extraordinary gift.


Exactly! I learn all kinds of things besides coding-related things, so I don't see how it's any different. ChatGPT 4o does an especially good job of walking thru the generated code to explain what it is doing. And, you can always ask for further clarification. If a coder is generating code but not learning anything, they are either doing something very mundane or they are being lazy and just copy/pasting without any thought--which is also a little dangerous, honestly.


It really depends on what you're trying to achieve.

I was trying to prototype a system and created a one-pager describing the main features, objectives, and restrictions. This took me about 45 minutes.

Then I feed it into Claude and asked to develop said system. It spent the next 15 minutes outputting file after file.

Then I ran "npm install" followed by "npm run" and got a "fully" (API was mocked) functional, mobile-friendly, and well documented system in just an hour of my time.

It'd have taken me an entire day of work to reach the same point.


Yeah nah. The endless loop of useless suggestions or ”solutions” is very easily achiavable and common, at least on my use cases, not matter how much you iterate with it. Iterating gets counter-productive pretty fast, imo. (Using 4o).


When I use Claude to iterate/troubleshoot I do it in a project and in multiple chats. So if I test something and it throws and error or gives an unexpected result I’ll start a new chat to deal with that problem, correct the code, update that in the project, then go back to my main thread and say “I’ve update this” and provide it the file, “now let’s do this”. When I started doing this it massively reduced the LLM getting lost or going off on weird quests. Iteration in side chats, regroup in the main thread. And then possibly another overarching “this is what I want to achieve” thread where I update it on the progress and ask what we should do next.


I have been thinking about this a lot recently. I have a colleague who simply can’t use LLMs for this reason - he expects them to work like a logical and precise machine, and finds interacting with them frustrating, weird and uncomfortable.

However, he has a very black and white approach to things and he also finds interacting with a lot of humans frustrating, weird and uncomfortable.

The more conversations I see about LLMs the more I’m beginning to feel that “LLM-whispering” is a soft skill that some people find very natural and can excel at, while others find it completely foreign, confusing and frustrating.


It really requires self-discipline to ignore the enthusiasm of the LLM as a signal for whether you are moving in the direction of a solution. I blame myself for lazy prompting, but have a hard time not just jumping in with a quick project, hoping the LLM can get somewhere with it, and not attempt things that are impossible, etc.


> OK - the whole process took a couple hours, and probably would have been a whole day if I were doing it on my own, since I usually only need to remember anything about SQL syntax once every year or three

If you have any reasonable understanding of SQL, I guarantee you could brush up on it and write it yourself in less than a couple of hours unless you're trying to do something very complex

SQL is absolutely trivial to write by hand


Obviously to a mega super genius like yourself an LLM is useless. But perhaps you can consider that others may actually benefit from LLMs, even if you’re way too talented to ever see a benefit?

You might also consider that you may be over-indexing on your own capabilities rather than evaluating the LLM’s capabilities.

Lets say an llm is only 25% as good as you but is 10% the cost. Surely you’d acknowledge there may be tasks that are better outsourced to the llm than to you, strictly from an ROI perspective?

It seems like your claim is that since you’re better than LLMs, LLMs are useless. But I think you need to consider the broader market for LLMs, even if you aren’t the target customer.


Knowing SQL isn't being a "mega super genius" or "way talented". SQL is flawed, but being hard to learn is not among its flaws. It's designed for untalented COBOL mainframe programmers on the theory that Codd's relational algebra and relational calculus would be too hard for them and prevent the adoption of relational databases.

However, whether SQL is "trivial to write by hand" very much depends on exactly what you are trying to do with it.


Sure, I could do that. But I would learn where to put my join statements relative to the where statements, and then forget it again in a month because I have lots of other tihngs that I actually need to know on a daily basis. I can easily outsource the boilerplate to the LLM and get to a reasonable starting place for free.

Think of it as managing cognitive load. Wandering off to relearn SQL boilerplate is a distraction from my medium-term goal.

edit: I also believe I'm less likely to get a really dumb 'gotcha' if I start from the LLM rather than cobbling together knowledge from some random docs.


If you don’t take care to understand what the LLM outputs, how can you be confident that it works in the general case, edge cases and all? Most of the time that I spend as a software engineer is reasoning about the code and its logic to convince myself it will do the right thing in all states and for all inputs. That’s not something that can be offloaded to an LLM. In the SQL case, that means actually understanding the semantics and nuances of the specific SQL dialect.


That makes sense, and from what I’ve heard this sort of simple quick prototyping is where LLM coding works well. The problem with my case was I’m working with multiple large code bases, and couldn’t pinpoint the problem to a specific line, or even file. So I wasn’t gonna just copy multiple git repos into the chat

(The details: I was working with running a Bayesian sampler across multiple compute nodes with MPI. There seemed to be a pathological interaction between the code and MPI where things looked like they were working, but never actually progressed.)


I wonder if it breaks like this: people who don't know how to code find LLMs very helpful and don't realize where they are wrong. People who do know immediately see all the things they get wrong and they just give up and say "I'll do it myself".


> Small nudges and corrections, and we had something that worked. From there, I iterated and added more features to the outputs.

FWIW, I've seen people online refer to this as "vibe coding".


This is exactly my experience, every time! If I offer it the slightest bit of context it will say 'Ah! I understand now! Yes, that is your problem, …' and proceed to spit out some non-existent function, sometimes the same one it has just suggested a few prompts ago which we already decided doesn't exist/work. And it just goes on and on giving me 'solutions' until I finally realise it doesn't have the answer (which it will never admit unless you specifically ask it to – forever looking to please) and give up.


My experiences have all been like this too. I am puzzled by how some people say it works for them


I wrote this article precisely for people who are having trouble getting good results out of LLMs for coding: https://simonwillison.net/2025/Mar/11/using-llms-for-code/


I’ve followed your blog for a while, and I have been meaning to unsubscribe because the deluge of AI content is not what I’m looking for.

I read the linked article when it was posted, and I suspect a few things that are skewing your own view of the general applicability of LLMs for programming. One, your projects are small enough that you can reasonably provide enough context for the language model to be useful. Two, you’re using the most common languages in the training data. Three, because of those factors, you’re willing to put much more work into learning how to use it effectively, since it can actually produce useful content for you.

I think it’s great that it’s a technology you’re passionate about and that it’s useful for you, but my experience is that in the context of working in a large systems codebase with years of history, it’s just not that useful. And that’s okay, it doesn’t have to be all things to all people. But it’s not fair to say that we’re just holding it wrong.


"my experience is that in the context of working in a large systems codebase with years of history, it’s just not that useful."

It's possible that changed this week with Gemini 2.5 Pro, which is equivalent to Claude 3.7 Somnet in terms of code quality but has a 1 million token context (with excellent scores on long context benchmarks) and an increased output limit too.

I've been dumping hundreds of thousands of times of codebase into it and getting very impressive results.


See this is one of the things that’s frustrating about the whole endeavor. I give it an honest go, it’s not very good, but I’m constantly exhorted to try again because maybe now that Model X 7.5qrz has been released, it’ll be really different this time!

It’s exhausting. At this point I’m mostly just waiting for it to stabilize and plateau, at which point it’ll feel more worth the effort to figure out whether it’s now finally useful for me.


Not going to disagree that it's exhausting! I've been trying to stay on top of new developments for the past 2.5 years and there are so many days when I'll joke "oh, great, it's another two new models day".

Just on Tuesday this week we got the first widely available high quality multi-modal image output model (GPT-4o images) and a new best-overall model (Gemini 2.5) within hours of each other. https://simonwillison.net/2025/Mar/25/


> One, your projects are small enough that you can reasonably provide enough context for the language model to be useful. Two, you’re using the most common languages in the training data. Three, because of those factors, you’re willing to put much more work into learning how to use it effectively, since it can actually produce useful content for you.

Take a look at the 2024 StackOverflow survey.

70% of professional developer respondents had only done extensive work over the last year in one of:

JS 64.6% SQL 54.1% JTML/CSS 52.9% PY 46.9% TS 43.4% Bash/Shell 34.2% Java 30%

LLMs are of course very strong in all of these. 70% of developers only code in languages LLMs are very strong at.

If anything, for the developer population at large, this number is even higher than 70%. The survey respondents are overwhelmingly American (where the dev landscape is more diverse), and self-select to those who use niche stuff and want to let the world know.

Similar argument can be made for median codebase size, in terms of LOC written every year. A few days ago he also gave Gemini Pro 2.5 a whole codebase (at ~300k tokens) and it performed well. Even in huge codebases, if any kind of separation of concerns is involved, that's enough to give all context relevant to the part of the code you're working on. [1]

[1] https://simonwillison.net/2025/Mar/25/gemini/


What’s 300k tokens in terms of lines of code? Most codebases I’ve worked on professionally have easily eclipsed 100k lines, not including comments and whitespace.

But really that’s the vision of actual utility that I imagined when this stuff first started coming out and that I’d still love to see: something that integrates with your editor, trains on your giant legacy codebase, and can actually be useful answering questions about it and maybe suggesting code. Seems like we might get there eventually, but I haven’t seen that we’re there yet.


We hit "can actually be useful answering questions about it" within the last ~6 months with the introduction of "reasoning" models with 100,000+ token contest limits (and the aforementioned Gemini 1 million/2 million models).

The "reasoning" thing is important because it gives models the ability to follow execution flow and answer complex questions that down many different files and classes. I'm finding it incredible for debugging, eg: https://gist.github.com/simonw/03776d9f80534aa8e5348580dc6a8...

I built a files-to-prompt tool to help dump entire codebases into the larger models and I use it to answer complex questions about code (including other people's projects written in languages I don't know) several times a week. There's a bunch of examples of that here: https://simonwillison.net/search/?q=Files-to-prompt&sort=dat...


How much lines of context and understanding can we, human developers, keep in our heads, taken into account and refer to when implementing something?

Whatever the amount may be, it definitely fits into 300k tokens.


After more than a few years working on a codebase? Quite a lot. I know which interfaces I need and from where, what the general areas of the codebase are, and how they fit together, even if I don’t remember every detail of every file.


> But it’s not fair to say that we’re just holding it wrong.

<troll>Have you considered that asking it to solve problems in areas it's bad at solving problems is you holding it wrong?</troll>

But, actually seriously, yeah, I've been massively underwhelmed with the LLM performance I've seen, and just flabbergasted with the subset of programmer/sysadmin coworkers who ask it questions and take those answers as gospel. It's especially frustrating when it's a question about something that I'm very knowledgeable about, and I can't convince them that the answer they got is garbage because they refuse to so much as glance at supporting documentation.


LLMs need to stay bad. What is going to happen if we have another few GPT-3.5 to Gemini 2.5 sized steps? You're telling people who need to keep the juicy SWE gravy train running for another 20 years to recognize that the threat is indeed very real. The writing is on the wall and no one here (here on HN especially) is going to celebrate those pointing to it.


I don't think people really realize the danger of mass unemployment

Go look up what happens in history when tons of people are unemployed at the same time with no hope of getting work. What happens when the unemployed masses become desperate?

Naw I'm sure it will be fine, this time will be different


Just wanted to chime in and say how appreciative I’ve been about all your replies here, and overall content on AI. Your takes are super reasonable and well thought out.


Exactly which model did you use? You talk about LLMs as though they are all the same.

Alien 1: I gave Jeff Dean a giant complex system to build, he crushed it! Humans are so smart.

Alien 2: I gave a random human a simple programming problem and he just stared at me like an idiot. Humans suck.


It's worse than that.

I see people say, "Look how great this is," and show me an example, and the example they show me is just not great. We're literally looking at the same thing, and they're excited that this LLM can do a college grads's job to the level of a third grader, and I'm just not excited about that.


What changed my point of view regarding LLMs was when I realized how crucial context is in increasing output quality.

Treat the AI as a freelancer working on your project. How would you ask a freelancer to create a Kanban system for you? By simply asking "Create a Kanban system", or by providing them a 2-3 pages document describing features, guidelines, restrictions, requirements, dependencies, design ethos, etc?

Which approach will get you closer to your objective?

The same applies to LLM (when it comes to code generation). When well instructed, it can quickly generate a lot of working code, and apply the necessary fixes/changes you request inside that same context window.

It still can't generate senior-level code, but it saves hours when doing grunt work or prototyping ideas.

"Oh, but the code isn't perfect".

Nor is the code of the average jr dev, but their codes still make it to production in thousands of companies around the world.


I see it as a knowledge multiplier. You still need to know enough about the subject to verify the output.


They're sophisticated tools at much as any other software.

About 2 weeks ago I started on a streaming markdown parser for the terminal because none really existed. I've switched to human coding now but the first version was basically all llm prompting and a bunch of the code is still llm generated (maybe 80%). It's a parser, those are hard. There's stacks, states, lookaheads, look behinds, feature flags, color spaces, support for things like links and syntax highlighting... all forward streaming. Not easy

https://github.com/kristopolous/Streamdown


Exactly, thanks to all the money involved in such hype the incentives will always skew towards over spamming naive optimism about it's features.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: