pidtom's comments

pidtom · 2026-04-03T17:01:11 1775235671

I built TurboQuant+ (https://github.com/TheTom/llama-cpp-turboquant), the llama.cpp implementation of this paper with extensions: asymmetric K/V compression, boundary layer protection, sparse V dequant, and this week weight compression (TQ4_1S) that shrinks models 28-42%% on disk with minimal quality loss. 5k+ stars, 50+ community testers across Metal, CUDA, and AMD HIP.

Cool to see the same WHT + Lloyd-Max math applied to vector search. The data-oblivious codebook property is exactly what makes it work for online KV cache compression too. No calibration, no training, just quantize and go.

If anyone is running local LLMs and wants to try it: https://github.com/TheTom/turboquant_plus/blob/main/docs/get...

lastdong · 2026-04-04T05:04:47 1775279087

TQ4_1S on model weights with minimal quality loss is really great. The MR discussion thread with results is specially interesting, with some models much more impacted than others in PPL increase, possibly size and model architecture play a part. Are there consolidated learnings from all the experiments? Thanks for this!

pidtom · 2026-04-02T01:32:06 1775093526

Hey that's me! AMA