Hacker Newsnew | past | comments | ask | show | jobs | submit | pidtom's commentslogin

I built TurboQuant+ (https://github.com/TheTom/llama-cpp-turboquant), the llama.cpp implementation of this paper with extensions: asymmetric K/V compression, boundary layer protection, sparse V dequant, and this week weight compression (TQ4_1S) that shrinks models 28-42%% on disk with minimal quality loss. 5k+ stars, 50+ community testers across Metal, CUDA, and AMD HIP.

Cool to see the same WHT + Lloyd-Max math applied to vector search. The data-oblivious codebook property is exactly what makes it work for online KV cache compression too. No calibration, no training, just quantize and go.

If anyone is running local LLMs and wants to try it: https://github.com/TheTom/turboquant_plus/blob/main/docs/get...


TQ4_1S on model weights with minimal quality loss is really great. The MR discussion thread with results is specially interesting, with some models much more impacted than others in PPL increase, possibly size and model architecture play a part. Are there consolidated learnings from all the experiments? Thanks for this!


Hey that's me! AMA


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: