Hacker Newsnew | past | comments | ask | show | jobs | submit | Patrick_Devine's commentslogin

They are nvidia-fp4 weights, but CUDA support isn't _quite_ ready yet, but we've got that cooking.

The 35b-a3b-coding-nvfp4 model has the recommended hyperparameters set for coding, not chatting. If you want to use it to chat you can pull the `35b-a3b-nvfp4` model (it doesn't need to re-download the weights again so it will pull quickly) which has the presence penalty turned on which will stop it from thinking so much. You can also try `/set nothink` in the CLI which will turn off thinking entirely.

Try it with mxfp8 or bf16. It's a decent model for doing tool calling, but I wouldn't recommend using it with 4 bit quantization.

I noticed the same thing. I'm assuming they forgot to photoshop out the chinese characters.


The Departing / Arrival airports plus a full track would be absolutely amazing.


5 years is normal-ish depreciation time frame. I know they are gaming GPUs, but the RTX 3090 came out ~ 4.5 years before the RTX 5090. The 5090 has double the performance and 1/3 more memory. The 3090 is still a useful card even after 5 years.


RTX 3090 MSRP: 1500 USD

RTX 5090 MSRP: 2000 USD


The instruct models are available on Ollama (e.g. `ollama run ministral-3:8b`), however the reasoning models still are a wip. I was trying to get them to work last night and it works for single turn, but is still very flakey w/ multi-turn.


The default ones on Ollama are MXFP4 for the feed forward network and use BF16 for the attention weights. The default weights for llama.cpp quantize those tensors as q8_0 which is why llama.cpp can eek out a little bit more performance at the cost of worse output. If you are using this for coding, you definitely want better output.

You can use the command `ollama show -v gpt-oss:120b` to see the datatype of each tensor.


We uploaded gemma3:270m-it-q8_0 and gemma3:270m-it-fp16 late last night which have better results. The q4_0 is the QAT model, but we're still looking at it as there are some issues.


Ollama only uses llamacpp for running legacy models. gpt-oss runs entirely in the ollama engine.

You don't need to use Turbo mode; it's just there for people who don't have capable enough GPUs.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: