I have an M4 Max with 48GB RAM. Anyone have any tips for good local models? Cont...

zozbot234 · 2026-03-31T11:18:58 1774955938

> it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world".

Qwen thinking likes to second-guess itself a LOT when faced with simple/vague prompts like that. (I'll answer it this way. Generating output. Wait, I'll answer it that way. Generating output. Wait, I'll answer it this way... lather, rinse, repeat.) I suppose this is their version of "super smart fancy thinking mode". Try something more complex instead.

drob518 · 2026-03-31T11:28:59 1774956539

Indeed. Qwen doesn’t just second guess itself, it third and fourth guesses itself.

Kichererbsen · 2026-03-31T15:15:50 1774970150

Solid Terry Pratchett reference right there.

domh · 2026-03-31T11:41:10 1774957270

OK thanks! That's helpful. I ignorantly assumed simpler prompt == faster first response.

functional_dev · 2026-03-31T14:39:11 1774967951

I did not know, that NVFP4 was handled at the silicon level... until I dug deeper here - https://vectree.io/c/llm-quantization-from-weights-to-bits-g...

duffyjp · 2026-03-31T16:56:33 1774976193

I still don't think I understand it. I saw those nvfp4 models up by chance yesterday and tried them on my Linux PC with a 5060TI 16gb. Ollama refused to pull them saying they were macOS only.

I assumed it was a meta-data bug and posted an issue, but apparently nvfp4 doesn't necessarily mean nvidia-fp4.

https://github.com/ollama/ollama/issues/15149

Patrick_Devine · 2026-03-31T22:41:35 1774996895

They are nvidia-fp4 weights, but CUDA support isn't _quite_ ready yet, but we've got that cooking.

kylehotchkiss · 2026-03-31T17:34:28 1774978468

I made my M2 Max generate a biryani recipe for me last night with 64gb ram and the baseline qwen3.5:35b model. I used the newest ollama with MLX.

https://gist.github.com/kylehotchkiss/8f28e6c75f22a56e8d2d31...

Under 3 minutes to get all that. The thinking is amusing, my laptop got quite warm, but for a 35b model on nearly 4 year old hardware, I see the light. This is the future.

Patrick_Devine · 2026-03-31T22:39:22 1774996762

The 35b-a3b-coding-nvfp4 model has the recommended hyperparameters set for coding, not chatting. If you want to use it to chat you can pull the `35b-a3b-nvfp4` model (it doesn't need to re-download the weights again so it will pull quickly) which has the presence penalty turned on which will stop it from thinking so much. You can also try `/set nothink` in the CLI which will turn off thinking entirely.

Octoth0rpe · 2026-03-31T11:11:34 1774955494

> it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world".

That's not an unsurprising result given the pretty ambiguous query, hence all the thinking. Asking "write a simple hello world program in python3" results in a much faster response for me (m4 base w/ 24gb, using qwen3.6:9b).

xienze · 2026-03-31T11:31:47 1774956707

Well, two things. First, “hi” isn’t a good prompt for these thinking models. They’ll have an identity crisis trying to answer it. Stupid, but it’s how it is. Stick to real questions.

Second, for the best performance on a Mac you want to use an MLX model.

domh · 2026-03-31T11:42:13 1774957333

Thanks! I assumed simpler == faster, but my ignorance is showing itself.

I am using the model they recommended in the blog post - which I assumed was using MLX?

fooker · 2026-03-31T13:42:52 1774964572

Avoid reasoning models in any situation where you have low tokens/second

EagnaIonat · 2026-03-31T14:47:10 1774968430

When MLX comes out you will see a huge difference. I currently moved to LMStudio as it currently supports MLX.