I spent about a week doing an "experiment" greenfield app. I saw 4 types of issues:
0. It runs way too fast and far ahead. You need to slow it down, force planning only and explicitly present a multi-step (i.e. numbered plan) and say "we'll do #1 first, then do the rest in future steps".
take-away:
This is likely solved with experience and changing how I work - or maybe caring less? The problem is the model can produce much faster than you can consume, but it runs down dead ends that destroy YOUR context. I think if you were running a bunch of autonomous agents this would be less noticeable, but impact 1-3 negatively and get very expensive.
1. lots of "just plain wrong" details. You catch this developing or testing because it doesn't work, or you know from experience it's wrong just by looking at it. Or you've already corrected it and need to point out the previous context.
take-away:
If you were vibe coding you'd solve all these eventually. Addressing #0 with "MORE AI" would probably help (i.e. AI to play/validate, etc).
2. Serious runtime issues that are not necessarily bugs. Examples: it made a lot of client-side API endpoints public that didn't even need to exist, or at least needed to be scoped to the current auth. It missed basic filtering and SQL clauses that constrained data. It hardcoded important data (but not necessarily secrets) like ports, etc. It made assumptions that worked fine in development but could be big issues in public.
take-away:
AI starts to build traps here. Vibe coders are in big trouble because everything works but that's not really the end goal. Problems could range from 3am downtime call-outs to getting your infrastructure owned or data breaches. More serious: experienced devs who go all-in on autonomous coding might be three months from their last manual code review and be in the same position as a vibe coder. You'd need a week or more to onboard and figure out what was going on, and fix it, which is probably too late.
3. It made (at least) one huge architectural mistake (this is a pretty simple project so I'm not sure there's space for more). I saw it coming but kept going in the spirit of my experiment.
take-away:
TBD. I'm going to try and use AI to refactor this, but it is non trivial. It could take as long as the initial app did to fix. If you followed the current pro-AI narrative you'd only notice it when your app started to intermittently fail - or you got you cloud provider's bill.
I'm a product manager, and a lot of the things I see people do wrong is because they don't have any product management experience. It takes quite a bit of work to develop a really good theory of what should be in your functional spec. Edge cases come up all the time in real software engineering, and often handling all those cases is spread across multiple engineers. A good product manager has a view of all of it, expects many of those issues from the agent, and plans for coaching it through them.
I'm an engineer and I totally agree. Engineers + LLMs exacerbate the timeless problem of not understanding the reality behind the problem. Validating solutions against reality is hard and LLMs just hallucinate their way around unknowns.
I think that's an incredibly reductionist and sarcastic take. I'm also in Product, but was an engineer for over a decade prior. I find that having strong structured functional specifications and a good holistic understanding of the solution you're trying to build goes a long way with AI tooling. Just like any software project, eliminating false starts and getting a clear set of requirements up front can minimize engineering time required to complete something, as long as things don't change in the middle. When your cycle time is an afternoon instead of two quarters, that type of up front investment pays off much better.
I still think AI tooling is lacking, but you can get significantly better results by structuring your plans appropriately.
I understand that you are serious. I am also serious here.
Have you built anything purely with LLM which is novel and is used by people who expect that their data is managed securely, and the application is well maintained so they can trust it?
I have been writing specifications, rfcs, adrs, conducting architecture reviews, code reviews and what not for quite a bit of time now. Also I’ve driven cross organisational product initiatives etc. I’m experimenting with openspec with my team now on a brownfield project and have some good results.
Having said all that I seriously doubt that if you treat the english language spec and your pm oversight as the sole QA pillars of a stochastic model transformer you are making a mistake.
The issue is that validation needs presence and it is the limiting factor - common knowledge, but is part of the “physics”. Also maintenance gets really tricky if the codebase has warts in it - which it will have. I get much more easy to understand architecture out of an LLM driven code generation process if I follow it and course correct / update the spec process based on learnings.
Example: yesterday I’ve introduced a batch job and realized during the implementation phase that some refactoring is needed so the error boundary can be reused in the batch application from the main backend. This was unplanned and definitely not a functional requirement - could be documented as non-functional. There was a gap between the agent’s knowledge and mine even though the error handling pattern is well documented in the repository. Of course this can be documented better next time if we update the process of openspec writing but having these gaps is inevitable unless formal and half-formal definitions are introduced - but still there needs to be someone with “fresh eyes” in the loop.
I think it's just sarcasm coming from the stereotypical HN attitude that Product Managers only get in the way of the real work of engineering. Ignore it; they're basically proving your point.
0. It runs way too fast and far ahead. You need to slow it down, force planning only and explicitly present a multi-step (i.e. numbered plan) and say "we'll do #1 first, then do the rest in future steps".
take-away: This is likely solved with experience and changing how I work - or maybe caring less? The problem is the model can produce much faster than you can consume, but it runs down dead ends that destroy YOUR context. I think if you were running a bunch of autonomous agents this would be less noticeable, but impact 1-3 negatively and get very expensive.
1. lots of "just plain wrong" details. You catch this developing or testing because it doesn't work, or you know from experience it's wrong just by looking at it. Or you've already corrected it and need to point out the previous context.
take-away: If you were vibe coding you'd solve all these eventually. Addressing #0 with "MORE AI" would probably help (i.e. AI to play/validate, etc).
2. Serious runtime issues that are not necessarily bugs. Examples: it made a lot of client-side API endpoints public that didn't even need to exist, or at least needed to be scoped to the current auth. It missed basic filtering and SQL clauses that constrained data. It hardcoded important data (but not necessarily secrets) like ports, etc. It made assumptions that worked fine in development but could be big issues in public.
take-away: AI starts to build traps here. Vibe coders are in big trouble because everything works but that's not really the end goal. Problems could range from 3am downtime call-outs to getting your infrastructure owned or data breaches. More serious: experienced devs who go all-in on autonomous coding might be three months from their last manual code review and be in the same position as a vibe coder. You'd need a week or more to onboard and figure out what was going on, and fix it, which is probably too late.
3. It made (at least) one huge architectural mistake (this is a pretty simple project so I'm not sure there's space for more). I saw it coming but kept going in the spirit of my experiment.
take-away: TBD. I'm going to try and use AI to refactor this, but it is non trivial. It could take as long as the initial app did to fix. If you followed the current pro-AI narrative you'd only notice it when your app started to intermittently fail - or you got you cloud provider's bill.