I have an M4 Max with 48GB RAM. Anyone have any tips for good local models? Context length? Using the model recommended in the blog post (qwen3.5:35b-a3b-coding-nvfp4) with Ollama 0.19.0 and it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world". Is this the best that's currently achievable with my hardware or is there something that can be configured to get better results?
> it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world".
Qwen thinking likes to second-guess itself a LOT when faced with simple/vague prompts like that. (I'll answer it this way. Generating output. Wait, I'll answer it that way. Generating output. Wait, I'll answer it this way... lather, rinse, repeat.) I suppose this is their version of "super smart fancy thinking mode". Try something more complex instead.
I still don't think I understand it. I saw those nvfp4 models up by chance yesterday and tried them on my Linux PC with a 5060TI 16gb. Ollama refused to pull them saying they were macOS only.
I assumed it was a meta-data bug and posted an issue, but apparently nvfp4 doesn't necessarily mean nvidia-fp4.
Under 3 minutes to get all that. The thinking is amusing, my laptop got quite warm, but for a 35b model on nearly 4 year old hardware, I see the light. This is the future.
The 35b-a3b-coding-nvfp4 model has the recommended hyperparameters set for coding, not chatting. If you want to use it to chat you can pull the `35b-a3b-nvfp4` model (it doesn't need to re-download the weights again so it will pull quickly) which has the presence penalty turned on which will stop it from thinking so much. You can also try `/set nothink` in the CLI which will turn off thinking entirely.
> it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world".
That's not an unsurprising result given the pretty ambiguous query, hence all the thinking. Asking "write a simple hello world program in python3" results in a much faster response for me (m4 base w/ 24gb, using qwen3.6:9b).
Well, two things. First, “hi” isn’t a good prompt for these thinking models. They’ll have an identity crisis trying to answer it. Stupid, but it’s how it is. Stick to real questions.
Second, for the best performance on a Mac you want to use an MLX model.