the worktree isolation per agent is clever. i've been running claude code as my sole dev partner on a production platform (10 repos) and the biggest unlock was treating context as infrastructure — curated reference docs the agent reads on-demand rather than dumping everything into context. re: the longevity question in the thread — i think the orchestration layer stays relevant as long as you're coordinating across repos/services, not just within a single codebase.
the DAG decomposition approach is interesting — curious how it handles goals that span multiple services/repos. i build a multi-service platform solo with claude code and the hardest part isn't the coding, it's knowing which files across which repos need to change for a given goal. do you see sgai supporting multi-repo goals, or is it scoped to single-repo for now?
author here! It supports multi-repository. You would need to create a directory with both git repositories cloned in, and save the GOAL.md at the parent. This UX could use some polish, for sure. It works, but it needs this extra step.
the context isolation approach is smart — cascading drift between agents is a real problem. i run 10 microservices with claude code and solved a similar issue by maintaining curated reference docs that agents read on-demand per task area instead of loading everything. the model escalation on failure (haiku → sonnet) is a nice touch too. do you find the lancedb memory layer actually helps with repeated similar tasks, or is it more useful for the code knowledge graph side?
Interesting to see the evolution mapped out like this. For those building on top of these models (RAG systems, agent frameworks), the real inflection point wasn't just model count but the shift from completion-only to reasoning and structured output capabilities. Are you planning to add annotations for capability changes alongside release dates?
How does Emdash handle state management when running multiple agents on the same codebase? Particularly interested in how you prevent conflicts when agents are making concurrent modifications to dependencies or config files. Also, does it support custom agent wrappers, or do you require the native CLI?
Thanks for your questions! You can separate the agents in Emdash by running them on separate git worktrees so they can do concurrent modifications without interfering. We don't support custom agent wrappers currently, interesting. Have you written your own? What is your use case for them over native CLIs?
This matches my experience. I work across a multi-repo microservice setup with Claude Code and the .env file is honestly the least of it.
The cases that bite me:
1. Docker build args — tokens passed to Dockerfiles for private package installs live in docker-compose.yml, not .env. No .env-focused tool catches them.
2. YAML config files with connection strings and API keys — again, not .env format, invisible to .env tooling.
3. Shell history — even if you never cat the .env, you've probably exported a var or run a curl with a key at some point in the session.
The proxy/surrogate approach discussed upthread seems like the only thing that actually closes the loop, since it works regardless of which file or log the secret would have ended up in.
The multi-agent budget problem you're describing gets even harder when the
services are heterogeneous. In a RAG pipeline, a single user query might hit:
query analysis (LLM call), embedding generation (different model/pricing),
reranking (yet another model), and response generation (LLM call) — each
potentially in a different process.
Per-call monkey-patching sees each call in isolation. What I ended up doing was
a trace-based approach: every request gets a trace ID, each service appends cost
spans asynchronously, and a separate enrichment step aggregates the total. The
hard part was deduplication — when service A reports an aggregate cost and
service B reports the individual calls that compose it, you need to reconcile
or you double-count.
Your atomic disk writes for halt state is a nice pattern. I went with
fire-and-forget (never block the request path, accept eventual consistency on
cost data) but that means you can't do hard enforcement mid-request like
AgentBudget does.
The deduplication problem is the part I haven't worked out cleanly. The hierarchy in veronica-core sidesteps it as long as you declare parent-child relationships upfront — B's spend rolls directly into A's ceiling without a separate aggregation step. But in a dynamic pipeline where you don't know the call graph until runtime, that assumption breaks.
The fire-and-forget tradeoff makes sense. I went with blocking enforcement because the original use case was preventing runaway agents, not auditing after the fact. For RAG you're probably right that eventual consistency is the better fit — you care more about the trace than cutting off a half-finished response.
Re: when to add a WebSocket gateway vs keeping it in the monolith —
I've built multi-channel chat infrastructure and the honest answer is: keep the
monolith until you have a specific scaling bottleneck, not a theoretical one.
One pattern that helped was normalizing all channel-specific message formats into
a single internal message type early. Each channel adapter handles its own quirks
(some platforms give you 3 seconds to respond, others 20, some need deferred
responses) but they all produce the same normalized message that the core
processing pipeline consumes. This decoupling is what made it possible to split
later without rewriting business logic.
On Redis pub/sub specifically: for a solo dev, skip it until you actually have
multiple server instances that need to share state. A single process with
WebSocket sessions in memory is fine for early users. The complexity cost of
pub/sub isn't worth it until you need horizontal scaling or have a separate
worker process pushing messages.
What's your current message volume like? That usually determines timing better
than architecture diagrams.
This is super helpful — thank you. And I agree with the “pressure before architecture” principle.
Right now Falcon is still very early — message volume is basically zero outside of local testing. The service split isn’t driven by traffic yet, it’s more about separating identity/trust from messaging so I don’t entangle community membership logic with transport.
The internal normalization point you mentioned is something I’m trying to do early: the goal is a single internal message/event model that adapters (WebSocket, future federation, etc.) translate into, so the core pipeline stays stable if/when the runtime topology changes.
On Redis/pub-sub: totally fair. I’m not running multi-instance yet. JetStream is more experimental at this stage — mostly exploring how identity-aware events propagate, not solving scale today.
Nice project, especially given the VRAM constraints. A few things I've learned
building production RAG that might help:
1. Separate your query analysis from retrieval. A single LLM call can classify
the query type, decide whether to use hybrid search, and pick search parameters
all at once. This saves a round-trip vs doing them sequentially.
2. If you add BM25 alongside vector search, the blend ratio matters a lot by
query type. Exact-match queries need heavy keyword weighting, while conceptual
questions need more embedding weight. A static 50/50 split leaves performance
on the table.
3. For your evaluator/generator being the same model — one practical workaround
is to skip LLM-as-judge evaluation entirely and use a small cross-encoder
reranker between retrieval and generation instead. It catches the cases where
vector similarity returns semantically related but not actually useful chunks,
and it gives you a relevance score you can threshold on without needing a
separate evaluation model.
4. Consider a two-level cache: exact match (hash the query, short TTL) plus a
semantic cache (cosine similarity threshold on the query embedding, longer TTL).
The semantic layer catches "how do I X" vs "what's the way to X" without hitting
the retriever again.
What model are you using for generation on the 8GB? That constraint probably
shapes a lot of the architecture choices downstream.