Hi HN, I'm a senior MLE. I've been watching the hype around agent swarms and tools like Gas Town clash with reality of productizing them. I wrote this because I see companies burning millions in ''reasoning tax" (LLM-as-a-judge) instead of building deterministic safety rails. Would love to hear how others are handling output verification in their pipelines.
Doing post-mortems on my agent's failures over the holidays made me realize the problem isn't the model. It is the lack of a deterministic inference-time verification layer.
I spent the break reading the recent Stanford/Harvard paper on agentic adaptation [1]. Their research provides mathematical proof for what I experienced in Q4: supervising only final outputs is a dead end. Agents learn to "ignore tools and improve likelihood," meaning they learn to lie more convincingly to pass evaluations while the underlying logic rots.
I call this the Agent Lobotomy.
The agent I have in production today is significantly dumber than the one I demoed in December. I was forced to strip autonomy, remove context, and add human checkpoints because I could not trust the probabilistic output. We are stuck in an Autonomy Retreat, creating an Authority Bottleneck [2] where agents are relegated to assistive tasks because the tail risk of autonomous action is too high.
I built Steer (open source) to stop the bleed. In v0.4.0, I moved the architecture to an Agent Service Mesh pattern. Instead of decorating every function, you patch the framework (e.g. PydanticAI) at the entry point. It auto-discovers tools and enforces a reliability policy globally via deterministic Reality Locks.
The real unlock is the data. By capturing the delta between a Blocked Response and a Taught Fix, Steer acts as a synthetic data factory for DPO. It moves reliability from a runtime tax to a training asset, allowing you to eventually refactor your prompt monolith into fine-tuned model weights.
I’ve realized that a 3,000-token system prompt isn't "logic", it's legacy code that no one wants to touch. It’s brittle, hard to test, and expensive to run. It is Technical Debt.
My thesis is that we need to stop treating prompts as the "program" and start treating them as temporary specs that eventually get compiled into the model weights via fine-tuning.
I built Steer (open source) to automate this "refactoring" process. It helps you climb the "Deliberation Ladder":
1. The Floor (Validity): Use Steer's deterministic verifiers (regex, AST, JSON Schema) to block objective failures in real-time. Don't ask an LLM if JSON is valid; check it with code.
2. The Ceiling (Quality): Use `steer export` to turn those captured failures into a fine-tuning dataset, training the model to handle nuance and "vibes" without a massive prompt.
Curious if others are seeing this "Prompt Bloat" in production?
OP here. Last week I posted a discussion ("The Confident Idiot Problem") about why we need deterministic checks instead of just "LLM vibes" for reliability.
That thread [1] blew up, so I’m sharing the open-source implementation (v0.2) that solves it.
Steer is an active reliability layer for Python agents. It sits between your LLM and the user to enforce hard constraints.
Unlike passive observability tools that just log errors, Steer creates a feedback loop:
1. Catch: It uses deterministic verifiers (like Regex, AST parsing, JSON Schema) to block hallucinations in real-time.
2. Teach: You fix the behavior in a local dashboard (`steer ui`).
3. Train: v0.2 adds a "Data Engine" that exports these runtime failures into an OpenAI-ready fine-tuning dataset.
The goal isn't just to block errors; it's to use those errors to bootstrap a model that stops making them.
It is Python-native, local-first, and framework agnostic.
Exactly. We treat them like databases, but they are hallucination machines.
My thesis isn't that we can stop the hallucinating (non-determinism), but that we can bound it.
If we wrap the generation in hard assertions (e.g., assert response.price > 0), we turn 'probability' into 'manageable software engineering.' The generation remains probabilistic, but the acceptance criteria becomes binary and deterministic.
but the acceptance criteria becomes binary and deterministic.
Unfortunately, the use-case for AI is often where the acceptance criteria is not easily defined --- a matter of judgment. For example, "Does this patient have cancer?".
In cases where the criteria can be easily and clearly stipulated, AI often isn't really required.
You're 100% right. For a "judgment" task like "Does this patient have cancer?", the final acceptance criteria must be a human expert. A purely deterministic verifier is impossible.
My thesis is that even in those "fuzzy" workflows, the agent's process is full of small, deterministic sub-tasks that can and should be verified.
For example, before the AI even attempts to analyze the X-ray for cancer, it must:
1/ Verify it has the correct patient file (PatientIDVerifier).
2/ Verify the image is a chest X-ray and not a brain MRI (ModalityVerifier).
3/ Verify the date of the scan is within the relevant timeframe (DateVerifier).
These are "boring," deterministic checks. But a failure on any one of them makes the final "judgment" output completely useless.
steer isn't designed to automate the final, high-stakes judgment. It's designed to automate the pre-flight checklist, ensuring the agent has the correct, factually grounded information before it even begins the complex reasoning task. It's about reducing the "unforced errors" so the human expert can focus only on the truly hard part.
I don't agree that users see them as databases. Sure there are those who expect LLMs to be infallible and punish the technology when it disappoints them, but it seems to me that the overwhelmingly majority quickly learn what AI's shortcomings are, and treat them instead like intelligent entities who will sometimes make mistakes.
They didn't claim to know it, they said "it seems to me". Presumably they're extrapolating from their experience, or their expectations of how an average user would behave.
OP here. I wrote this because I got tired of agents confidently guessing answers when they should have asked for clarification (e.g. guessing "Springfield, IL" instead of asking "Which state?" when asked "weather in Springfield").
This approach kind of reminds me of taking an open-book test. Performing mandatory verification against a ground truth is like taking the test, then going back to your answers and looking up whether they match.
Unlike a student, the LLM never arrives at a sort of epistemic coherence, where they know what they know, how they know it, and how true it's likely to be. So you have to structure every problem into a format where the response can be evaluated against an external source of truth.
Thanks a lot for this. Also one question in case anyone could shed a bit of light: my understanding is that setting temperature=0, top_p=1 would cause deterministic output (identical output given identical input). For sure it won’t prevent factually wrong replies/hallucination, only maintains generation consistency (eq. classification tasks). Is this universally correct or is it dependent on model used? (or downright wrong understanding of course?)
> my understanding is that setting temperature=0, top_p=1 would cause deterministic output (identical output given identical input).
That's typically correct. Many models are implemented this way deliberately. I believe it's true of most or all of the major models.
> Is this universally correct or is it dependent on model used?
There are implementation details that lead to uncontrollable non-determinism if they're not prevented within the model implementation.
See e.g. the Pytorch docs for CUDA convolution determinism: https://docs.pytorch.org/docs/stable/notes/randomness.html#c...
That documents settings like this:
torch.backends.cudnn.deterministic = True
Parallelism can be a source of non-determinism if it's not controlled for, either implicitly via e.g. dependencies or explicitly.
You should use structured output rather than checking and rechecking for valid json. It can’t solve all of your problems but it can enforce a schema on the output format.