If you don't care about how it's architectured, why you care about size? Compare it to Q3.5 397B-A17B.
Just like smaller size models are speed / cost optimization, so is MoE.
G4 26B-A4B goes 150 t/s on 4090/5090, 80 t/s on M5 Max. Q3.5 35B-A3B is comparably fast. They are flash-lite/nano class models.
G4 31B despite small increase in total parameter count is over 5 times slower. Q3.5 27B is comparably slow. They are approximating flash/mini class models (I believe sizes of proprietary models in this class are closer to Q3.5 122B-A10B or Llama 4 Scout 109B-A17B).
The implication is that there is (should be) a major speed difference - naively you'd expect the MoE to be 10x faster and cheaper, which can be pretty relevant on real world tasks.
If you do that, it expands your test matrix quadratically.
So, it makes sense if you have infinite testing budgets.
Personally, I prefer exhaustively testing the upgrade path, and investing in reducing the time it takes to push out a hot fix. Chicken bits are also good.
I haven’t heard of any real world situations where supporting downgrades of persistent formats led to best of class product stability.
So someone is debugging something with git bisect and stumbles on the old commit and gets pwned. Maybe that's why they force killed it? To avoid people going back in history and stumbling on it.
CPU/network throttling needs to be set for the product manager and management - that's the only way you might see real change.
We have some egregious slowness in our app that only shows up for our largest customers in production but none of our organizations in development have that much data. I created a load testing organization and keep considering adding management to it so they implicitly get the idea that fixing the slowness is important.
I made the argument multiple times that the right answer to many prompts would be a question, and it was allowed under some rare circumstances, but far too few.
I suspect in part because the provider also didn't want to create an easy cop out for the people working on the fine-tuning part (a lot of my work was auditing and reviewing output, and there was indeed a lot of really sloppy work, up to and including cut and pasting output from other LLMs - we know, because on more than one occasion I caught people who had managed to include part of Claudes website footer in their answer...)
https://github.com/Nano-Collective/nanocoder
reply