Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Spending more time than I should in a sunday playing with r1/o1/sonnet code generation, my impression is:

1. Sonnet is still the best model for me. It does less mistakes than o1 and r1 and one can ask it to make a plan and think about the request before writing code. I am not sure if the whole "reasoning/thinking" process of o1/r1 is as much of an advantage as it is supposed to be. And even if sonnet does mistakes too, iterations with sonnet are faster than with o1/r1 at least.

2. r1 is good (better than previous deepseek models imo and especially better at following instructions which was my problem with deepseek models so far). The smaller models are very interesting. But the thought process often turns to overcomplicate things and it thinks more than imo it should. I am not sure that all the thinking always helps to build a better context for writing the code, which is what the thinking is actually for if we want to be honest.

3. My main problem with deepseek is that the thinking blocks are huge and it is running out of context (I think? Or just kagi's provider is unstable?) after a few iterations. Maybe if the thinking blocks from previous answers where not used for computing new answers it would help. Not sure what o1 does for this, i doubt the previous thinking carries on in the context.

4. o1 seems around the same level as r1 imo if r1 does nothing weird, but r1 does more weird things (though I use it through github copilot and it does not give me the thinking blocks). I am pretty sure one can find something that o1 performs better and one that r1 performs better. It does not mean anything to me.

Maybe other uses have different results than code generation. Maybe web/js code generation would also give different results than mine. But I do not see something to really impress me in what I actually need these tools for (more than the current SOTA baseline that is sonnet).

I would like to play more with the r1 distilations locally though, and in general I would probably try to handle the thinking blocks context differently. Or maybe use aider with the dual model approach where an r1/sonnet combo seems to give great results. I think there is potential, but not just as such.

In general I do not understand the whole "panicking" thing. I do not think anybody panics over r1, it is very good but nothing more exceptional than what we have not seen so far, except if they thought that only american companies could produce SOTA-level models which was wrong already (previous deepseek and qwen models were already at similar levels). If anything, openai's and anthropic's models are more polished. It sounds a bit sensational to me, but then again who knows, I do not trust the grounding to reality that AI companies have, so they may be panicking indeed.



> Maybe if the thinking blocks from previous answers where not used for computing new answers it would help

Deepseek specifically recommends users ensure their setups do not feed the thinking portion back into the context because it can confuse the AI.

They also recommend against prompt engineering. Just make your request as simple and specific as possible.

I need to go try Claude now because everyone is raving about it. I’ve been throwing hard, esoteric coding questions at R1 and I’ve been very impressed. The distillations though do not hold a candle to the real R1 given the same prompts.


Does R1 code actually compiles and work as expected? - Even small local models are great at answering confidently and plausibly. Luckily coding responses are easily verifiable unlike more fuzzy topics.


The panic is because a lot of beliefs have been challenged by r1 and those who made investments on these beliefs will now face losses


Based on my personal testing for coding, I still found Claude Sonnet is the best for coding and its easy to understand the code written by Claude (I like their code structure or may at this time, I am used to Claude style).


I also feel the same. I like the way sonnet answers and writes code, and I think I liked qwen 2.5 coder because it reminded me of sonnet (I highly suspect it was trained on sonnet's output). Moreover, having worked with sonnet for several months, i have system prompts for specific languages/uses that help produce the output I want and work well with it, eg i can get it produce functions together with unit tests and examples written in a way very similar to what I would have written, which helps a lot understand and debug the code more easily (because doing manual changes I find inevitable in general). It is not easy to get to use o1/r1 then when their guidelines is to avoid doing exactly this kind of thing (system prompts, examples etc). And this is something that matches my limited experience with them, plus going back and forth to fix details is painful (in this i actually like zed's approach where you are able to edit their outputs directly).

Maybe a way to use them would be to pair them with a second model like aider does, i could see r1 producing something and then a second model work starting from their output, or maybe with more control over when it thinks and when not.

I believe these models must be pretty useful for some kinds of stuff different from how i use sonnet right now.


Sonnet isn't just better, it actually succeeds where R1 utterly fails after many minutes of "thinking" and back and forth prompting on a simple task writing go cli to do icmp ping without requiring root of suid or calling external ping cmd.

Faster too.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: