The 35b-a3b-coding-nvfp4 model has the recommended hyperparameters set for coding, not chatting. If you want to use it to chat you can pull the `35b-a3b-nvfp4` model (it doesn't need to re-download the weights again so it will pull quickly) which has the presence penalty turned on which will stop it from thinking so much. You can also try `/set nothink` in the CLI which will turn off thinking entirely.
5 years is normal-ish depreciation time frame. I know they are gaming GPUs, but the RTX 3090 came out ~ 4.5 years before the RTX 5090. The 5090 has double the performance and 1/3 more memory. The 3090 is still a useful card even after 5 years.
The instruct models are available on Ollama (e.g. `ollama run ministral-3:8b`), however the reasoning models still are a wip. I was trying to get them to work last night and it works for single turn, but is still very flakey w/ multi-turn.
The default ones on Ollama are MXFP4 for the feed forward network and use BF16 for the attention weights. The default weights for llama.cpp quantize those tensors as q8_0 which is why llama.cpp can eek out a little bit more performance at the cost of worse output. If you are using this for coding, you definitely want better output.
You can use the command `ollama show -v gpt-oss:120b` to see the datatype of each tensor.
We uploaded gemma3:270m-it-q8_0 and gemma3:270m-it-fp16 late last night which have better results. The q4_0 is the QAT model, but we're still looking at it as there are some issues.
reply