More

zozbot234 · 2026-04-02T22:42:01 1775169721

But wait, I thought FrEe SpEeCh Is NoT FrEeDoM fRoM cOnSeQuEnCeS and FrEe SpEeCh DoEs NoT gUaRaNtEe YoU a PlAtFoRm. Now you're telling us that these are bad takes and freedom from arbitrary interference based on one's public opinions actually matters? That's so confusing!

zozbot234 · 2026-04-02T22:34:35 1775169275

Right, options go underwater precisely when the company is not doing well and you are at greatest risk of losing the job. That's not a great risk profile.

zozbot234 · 2026-04-02T22:09:45 1775167785

Kimi 2.5 is relatively sparse at 1T/32B; GLM 5 does 744B/40B so only slightly denser. Maybe you could try reducing active expert count on those to artificially increase sparsity, but I'm sure that would impact quality.

coder543 · 2026-04-02T22:55:12 1775170512

Reducing the expert count after training causes catastrophic loss of knowledge and skills. Cerebras does this with their REAP models (although it is applied to the total set of experts, not just routing to fewer experts each time), and it can be okay for very specific use cases if you measure which experts are needed for your use case and carefully choose to delete the least used ones, but it doesn't really provide any general insight into how a higher sparsity model would behave if trained that way from scratch.

zozbot234 · 2026-04-02T22:04:17 1775167457

Large MoE models are too heavily bottlenecked on typical discrete GPUs. You end up pushing just a few common/non-shared layers to GPU and running the MoE part on CPU, because the bandwidth of PCIe transfers to a discrete GPU is a killer bottleneck. Platforms with reasonable amounts of unified memory are more balanced despite the lower VRAM bandwidth, and can more easily run even larger models by streaming inactive weights from SSD (though this quickly becomes overkill as you get increasingly bottlenecked by storage bandwidth: you'd be better off then with a plain HEDT accessing lots of fast storage in parallel via abundant PCIe lanes).

zozbot234 · 2026-04-02T20:41:58 1775162518

The models are not technically comparable: the Qwen is dense, the Gemma is MoE. The ~33B models are the other way around!

zozbot234 · 2026-04-02T20:32:38 1775161958

If you want the model to have function calls available you need to run it in an agentic harness that can do the proper sandboxing etc. to keep things safe and provide the spec and syntax in your system prompt. This is true of any model: AI inference on its own can only involve guessing, not exact compute.

neonstatic · 2026-04-02T20:35:48 1775162148

Thanks, I am very new to this and just run models in LMStudio. I think it would be very useful to have a system prompt telling the model to run python scripts to calculate things LLMs are particularly bad at and run those scripts. Can you recommend a harness that you like to use? I suppose safety of these solutions is its own can of worms, but I am willing to try it.

Computer0 · 2026-04-02T20:53:16 1775163196

I use Claude Code. Codex and Opencode both work too. You could even do it with VScode Copilot.

zozbot234 · 2026-04-02T20:57:13 1775163433

These are typically coding oriented as opposed to general chat, so their system prompts may be needlessly heavy for that use case. I think the closest thing to a general solution is the emerging "claw" ecosystem, as silly as that sounds. Some of the newer "claws" do provide proper sandboxing.

zozbot234 · 2026-04-02T20:17:02 1775161022

As a matter of fact, there's been multiple reports of the Chinese doing informal, heavy "policing" of their own citizens abroad. Even if you aren't Chinese or linked to China yourself, this does affect the strength of that particular argument.

justinclift · 2026-04-02T21:25:50 1775165150

> ... this does affect the strength of that particular argument.

It doesn't really affect the strength of that particular argument.

And you're being misleading, seemingly on purpose. Please don't.

zozbot234 · 2026-04-02T18:40:34 1775155234

Expert streaming is something that has to be implemented by the inference engine/library, the model architecture itself has very little to do with it. It's a great idea (for local inference; it uses too much power at scale), but making it work really well is actually not that easy.

(I've mentioned this before but AIUI it would require some new feature definitions in GGUF, to allow for coalescing model data about any one expert-layer into a single extent, so that it can be accessed in bulk. That's what seems to make the new Flash-MoE work so well.)

vessenes · 2026-04-02T19:31:17 1775158277

I’ve been doing some low-key testing on smaller models, and it looks to me like it’s possible to train an MOE model with characteristics that are helpful for streaming… For instance, you could add a loss function to penalize expert swapping both in a single forward, pass and across multiple forward passes. So I believe there is a place for thinking about this on the model training side.

zozbot234 · 2026-04-02T19:45:05 1775159105

Penalizing expert swaps doesn't seem like it would help much, because experts vary by layer and are picked layer-wise. There's no guarantee that expert X in layer Y that was used for the previous token will still be available for this token's load from layer Y. The optimum would vary depending on how much memory you have at any given moment, and such. It's not obviously worth optimizing for.

zozbot234 · 2026-04-02T18:15:39 1775153739

You could always offload some layers to the NPU for lower power use and leave the rest to the GPU. If the latter is power throttled (common for prefill, not for decode) that will be a performance improvement.

zozbot234 · 2026-04-02T17:27:52 1775150872

Qwen is actually a pretty strong player in the Chinese market. There is an implied "salt the ground" play but it's mostly from hardware makers, who are trying to keep the big AI players honest and also stand to gain if local inference becomes popular.