More

Patrick_Devine · 2026-03-31T22:41:35 1774996895

They are nvidia-fp4 weights, but CUDA support isn't _quite_ ready yet, but we've got that cooking.

Patrick_Devine · 2026-03-31T22:39:22 1774996762

The 35b-a3b-coding-nvfp4 model has the recommended hyperparameters set for coding, not chatting. If you want to use it to chat you can pull the `35b-a3b-nvfp4` model (it doesn't need to re-download the weights again so it will pull quickly) which has the presence penalty turned on which will stop it from thinking so much. You can also try `/set nothink` in the CLI which will turn off thinking entirely.

Patrick_Devine · 2026-03-31T22:31:41 1774996301

Try it with mxfp8 or bf16. It's a decent model for doing tool calling, but I wouldn't recommend using it with 4 bit quantization.

Patrick_Devine · 2026-02-24T23:29:31 1771975771

I noticed the same thing. I'm assuming they forgot to photoshop out the chinese characters.

Patrick_Devine · 2026-02-18T02:23:41 1771381421

The Departing / Arrival airports plus a full track would be absolutely amazing.

Patrick_Devine · 2025-12-03T00:38:47 1764722327

5 years is normal-ish depreciation time frame. I know they are gaming GPUs, but the RTX 3090 came out ~ 4.5 years before the RTX 5090. The 5090 has double the performance and 1/3 more memory. The 3090 is still a useful card even after 5 years.

cubefox · 2025-12-03T14:53:34 1764773614

RTX 3090 MSRP: 1500 USD

RTX 5090 MSRP: 2000 USD

Patrick_Devine · 2025-12-02T21:59:13 1764712753

The instruct models are available on Ollama (e.g. `ollama run ministral-3:8b`), however the reasoning models still are a wip. I was trying to get them to work last night and it works for single turn, but is still very flakey w/ multi-turn.

Patrick_Devine · 2025-11-08T17:50:31 1762624231

The default ones on Ollama are MXFP4 for the feed forward network and use BF16 for the attention weights. The default weights for llama.cpp quantize those tensors as q8_0 which is why llama.cpp can eek out a little bit more performance at the cost of worse output. If you are using this for coding, you definitely want better output.

You can use the command `ollama show -v gpt-oss:120b` to see the datatype of each tensor.

Patrick_Devine · 2025-08-15T17:05:39 1755277539

We uploaded gemma3:270m-it-q8_0 and gemma3:270m-it-fp16 late last night which have better results. The q4_0 is the QAT model, but we're still looking at it as there are some issues.

Patrick_Devine · 2025-08-05T19:54:50 1754423690

Ollama only uses llamacpp for running legacy models. gpt-oss runs entirely in the ollama engine.

You don't need to use Turbo mode; it's just there for people who don't have capable enough GPUs.