steer_dev's comments

steer_dev · 2026-02-17T14:34:11 1771338851

Hi HN, I'm a senior MLE. I've been watching the hype around agent swarms and tools like Gas Town clash with reality of productizing them. I wrote this because I see companies burning millions in ''reasoning tax" (LLM-as-a-judge) instead of building deterministic safety rails. Would love to hear how others are handling output verification in their pipelines.

steer_dev · 2026-01-20T21:02:34 1768942954

The Data:

Green Line (Capability): MMLU State-of-the-Art (48% → 94%).

Red Line (Control): Organizations with effective mitigation for inaccuracy/hallucination (lagging at ~52%).

Sources:

MMLU: Official Repository (hendrycks/test) and Model Technical Reports.

Risk: McKinsey “State of AI” Annual Reports (2023–2025).

The Gap: Intelligence is surging. Control is lagging. We are currently in the gap (High capability, low trust).

How are you solving the gap?

Potential Solution: Move verification outside the model using deterministic “Reality Locks” (Regex, SQL AST, Entropy)

Repo: https://github.com/imtt-dev/steer

steer_dev · 2026-01-08T21:23:56 1767907436

AI Slop (apologies, filler) isn't just annoying; it is In-Band Signaling Noise. The model mixes control signals (persona) with data (payload).

Prompting ("Be concise") is brittle. I switched to Shannon Entropy.

The Hypothesis: Code/Data is mathematically "messy" (High Entropy). Slop is smooth and predictable (Low Entropy).

I wrote a filter that blocks responses if entropy dips below ~3.5.

The Payoff: It captures the blocked slop as a dataset for DPO. I use the math to gather data today so I can fine-tune a natively quiet model tomorrow.

Repo: https://github.com/imtt-dev/steer

steer_dev · 2026-01-06T13:56:00 1767707760

Doing post-mortems on my agent's failures over the holidays made me realize the problem isn't the model. It is the lack of a deterministic inference-time verification layer.

I spent the break reading the recent Stanford/Harvard paper on agentic adaptation [1]. Their research provides mathematical proof for what I experienced in Q4: supervising only final outputs is a dead end. Agents learn to "ignore tools and improve likelihood," meaning they learn to lie more convincingly to pass evaluations while the underlying logic rots.

I call this the Agent Lobotomy.

The agent I have in production today is significantly dumber than the one I demoed in December. I was forced to strip autonomy, remove context, and add human checkpoints because I could not trust the probabilistic output. We are stuck in an Autonomy Retreat, creating an Authority Bottleneck [2] where agents are relegated to assistive tasks because the tail risk of autonomous action is too high.

I built Steer (open source) to stop the bleed. In v0.4.0, I moved the architecture to an Agent Service Mesh pattern. Instead of decorating every function, you patch the framework (e.g. PydanticAI) at the entry point. It auto-discovers tools and enforces a reliability policy globally via deterministic Reality Locks.

The real unlock is the data. By capturing the delta between a Blocked Response and a Taught Fix, Steer acts as a synthetic data factory for DPO. It moves reliability from a runtime tax to a training asset, allowing you to eventually refactor your prompt monolith into fine-tuned model weights.

I've put together three cookbooks showing how this stops the lobotomy in SQL and RAG workflows: 1/ Framework Patching: https://github.com/imtt-dev/steer/blob/main/steer/cookbook/p... 2/ SQL Security Lock: https://github.com/imtt-dev/steer/blob/main/steer/cookbook/s... 3/ RAG Grounding Guard: https://github.com/imtt-dev/steer/blob/main/steer/cookbook/r...

References: [1] https://arxiv.org/abs/2512.16301 [2] https://cloudedjudgement.substack.com/p/clouded-judgement-12...

steer_dev · 2025-12-18T14:01:40 1766066500

OP here.

I’ve realized that a 3,000-token system prompt isn't "logic", it's legacy code that no one wants to touch. It’s brittle, hard to test, and expensive to run. It is Technical Debt.

My thesis is that we need to stop treating prompts as the "program" and start treating them as temporary specs that eventually get compiled into the model weights via fine-tuning.

I built Steer (open source) to automate this "refactoring" process. It helps you climb the "Deliberation Ladder":

1. The Floor (Validity): Use Steer's deterministic verifiers (regex, AST, JSON Schema) to block objective failures in real-time. Don't ask an LLM if JSON is valid; check it with code.

2. The Ceiling (Quality): Use `steer export` to turn those captured failures into a fine-tuning dataset, training the model to handle nuance and "vibes" without a massive prompt.

Curious if others are seeing this "Prompt Bloat" in production?

Repo: https://github.com/imtt-dev/steer

steer_dev · 2025-12-16T13:51:58 1765893118

OP here. Last week I posted a discussion ("The Confident Idiot Problem") about why we need deterministic checks instead of just "LLM vibes" for reliability.

That thread [1] blew up, so I’m sharing the open-source implementation (v0.2) that solves it.

Steer is an active reliability layer for Python agents. It sits between your LLM and the user to enforce hard constraints.

Unlike passive observability tools that just log errors, Steer creates a feedback loop:

1. Catch: It uses deterministic verifiers (like Regex, AST parsing, JSON Schema) to block hallucinations in real-time.

2. Teach: You fix the behavior in a local dashboard (`steer ui`).

3. Train: v0.2 adds a "Data Engine" that exports these runtime failures into an OpenAI-ready fine-tuning dataset.

The goal isn't just to block errors; it's to use those errors to bootstrap a model that stops making them.

It is Python-native, local-first, and framework agnostic.

Repo: https://github.com/imtt-dev/steer

[1] https://news.ycombinator.com/item?id=46152838

steer_dev · 2025-12-05T01:25:03 1764897903

Exactly. We treat them like databases, but they are hallucination machines.

My thesis isn't that we can stop the hallucinating (non-determinism), but that we can bound it.

If we wrap the generation in hard assertions (e.g., assert response.price > 0), we turn 'probability' into 'manageable software engineering.' The generation remains probabilistic, but the acceptance criteria becomes binary and deterministic.

jqpabc123 · 2025-12-05T16:01:17 1764950477

but the acceptance criteria becomes binary and deterministic.

Unfortunately, the use-case for AI is often where the acceptance criteria is not easily defined --- a matter of judgment. For example, "Does this patient have cancer?".

In cases where the criteria can be easily and clearly stipulated, AI often isn't really required.

steer_dev · 2025-12-07T22:00:04 1765144804

You're 100% right. For a "judgment" task like "Does this patient have cancer?", the final acceptance criteria must be a human expert. A purely deterministic verifier is impossible.

My thesis is that even in those "fuzzy" workflows, the agent's process is full of small, deterministic sub-tasks that can and should be verified.

For example, before the AI even attempts to analyze the X-ray for cancer, it must: 1/ Verify it has the correct patient file (PatientIDVerifier). 2/ Verify the image is a chest X-ray and not a brain MRI (ModalityVerifier). 3/ Verify the date of the scan is within the relevant timeframe (DateVerifier).

These are "boring," deterministic checks. But a failure on any one of them makes the final "judgment" output completely useless.

steer isn't designed to automate the final, high-stakes judgment. It's designed to automate the pre-flight checklist, ensuring the agent has the correct, factually grounded information before it even begins the complex reasoning task. It's about reducing the "unforced errors" so the human expert can focus only on the truly hard part.

malfist · 2025-12-08T13:08:34 1765199314

Why do any of those checks with ai though? All of them you can get a less error prone answer without ai.

jennyholzer · 2025-12-08T13:22:51 1765200171

Robo-eugenics is the best answer I can come up with

multjoy · 2025-12-08T13:15:43 1765199743

AI doesn’t necessarily mean an LLM, which are the systems making things up.

squidbeak · 2025-12-08T13:09:35 1765199375

I don't agree that users see them as databases. Sure there are those who expect LLMs to be infallible and punish the technology when it disappoints them, but it seems to me that the overwhelmingly majority quickly learn what AI's shortcomings are, and treat them instead like intelligent entities who will sometimes make mistakes.

philipallstar · 2025-12-08T13:11:21 1765199481

> but it seems to me that the overwhelmingly majority

The overwhelming majority of what?

antonvs · 2025-12-08T14:15:27 1765203327

Of users. It's an implicit subject from the first sentence.

philipallstar · 2025-12-08T16:26:02 1765211162

But how do they know that, if it's of all users?

antonvs · 2025-12-08T17:29:32 1765214972

They didn't claim to know it, they said "it seems to me". Presumably they're extrapolating from their experience, or their expectations of how an average user would behave.

scotty79 · 2025-12-08T12:53:24 1765198404

> We treat them like databases, but they are hallucination machines.

Which is kind of crazy because we don't even treat people as databases. Or at least we shouldn't.

Maybe it's one of those things that will disappear form culture one funeral at a time.

hrimfaxi · 2025-12-08T13:16:18 1765199778

Humans demand more reliability from our creations than from each other.

steer_dev · 2025-12-05T01:23:43 1764897823

OP here. I wrote this because I got tired of agents confidently guessing answers when they should have asked for clarification (e.g. guessing "Springfield, IL" instead of asking "Which state?" when asked "weather in Springfield").

I built an open-source library to enforce these logic/safety rules outside the model loop: https://github.com/imtt-dev/steer

condiment · 2025-12-08T13:24:38 1765200278

This approach kind of reminds me of taking an open-book test. Performing mandatory verification against a ground truth is like taking the test, then going back to your answers and looking up whether they match.

Unlike a student, the LLM never arrives at a sort of epistemic coherence, where they know what they know, how they know it, and how true it's likely to be. So you have to structure every problem into a format where the response can be evaluated against an external source of truth.

amorroxic · 2025-12-08T13:25:12 1765200312

Thanks a lot for this. Also one question in case anyone could shed a bit of light: my understanding is that setting temperature=0, top_p=1 would cause deterministic output (identical output given identical input). For sure it won’t prevent factually wrong replies/hallucination, only maintains generation consistency (eq. classification tasks). Is this universally correct or is it dependent on model used? (or downright wrong understanding of course?)

antonvs · 2025-12-08T18:53:14 1765219994

> my understanding is that setting temperature=0, top_p=1 would cause deterministic output (identical output given identical input).

That's typically correct. Many models are implemented this way deliberately. I believe it's true of most or all of the major models.

> Is this universally correct or is it dependent on model used?

There are implementation details that lead to uncontrollable non-determinism if they're not prevented within the model implementation. See e.g. the Pytorch docs for CUDA convolution determinism: https://docs.pytorch.org/docs/stable/notes/randomness.html#c...

That documents settings like this:

    torch.backends.cudnn.deterministic = True

Parallelism can be a source of non-determinism if it's not controlled for, either implicitly via e.g. dependencies or explicitly.

janalsncm · 2025-12-08T19:33:15 1765222395

You should use structured output rather than checking and rechecking for valid json. It can’t solve all of your problems but it can enforce a schema on the output format.