Claude is only better in some cherry picked standard eval benchmarks, which are becoming more useless every month due to the likelihood of these tests leaking into training data. If you look at the Chatbot Arena rankings where actual users blindly select the best answer from a random choice of models, the top 3 models are all from OpenAI. And the next best ones are from Google and X.
I'm subscribed to all of Claude, Gemini, and ChatGPT. Benchmarks aside, my go-to is always Claude. Subjectively speaking, it consistently gives better results than anything else out there. The only reason I keep the other subscriptions is to check in on them occasionally to see if they've improved.
I don't pay any attention to leaderboards. I pay for both Claude and ChatGPT and use them both daily for anything from Python coding to the most random questions I can think of. In my experience Claude is better (much better) that ChatGPT in almost all use cases. Where ChatGPT shines is the voice assistant - it still feels almost magical having a "human-like" conversation with the AI agent.
Anecdotally, I disagree. Since the release of the "new" 3.5 Sonnet, it has given me consistently better results than Copilot based on GPT-4o.
I've been using LLMs as my rubber duck when I get stuck debugging something and have exhausted my standard avenues. GPT-4o tends to give me very general advice that I have almost always already tried or considered, while Claude is happy to say "this snippet looks potentially incorrect; please verify XYZ" and it has gotten me back on track in maybe 4/5 cases.
Bullshit. Claude 3.5 Sonnet owns the competition according to the most useful benchmark: operating a robot body in the real world. No other model comes close.
This seems incorrect. I don't need Claude 3.5 Sonnet to operate a robot body for me, and don't know anyone else who does. And general-purpose robotics is not going to be the most efficient way to have robots do many tasks ever, and certainly not in the short term.
Of course not but the task requires excellent image understanding, large context window, a mix of structured and unstructured output, high level and spatial reasoning, and a conversational layer on top.
I find it’s predictive of relative performance in other tasks I use LLMs for. Claude is the best. The only shortcoming is its peculiar verbosity.
Definitely superior to anything OpenAI has and miles beyond the “open weights” alternatives like Llama.
The problem is that it also fails on fairly simple logic puzzles that ChatGPT can do just fine.
For example, even the new 3.5 Sonnet can't solve this reliably:
> Doom Slayer needs to teleport from Phobos to Deimos. He has his pet bunny, his pet cacodemon, and a UAC scientist who tagged along. The Doom Slayer can only teleport with one of them at a time. But if he leaves the bunny and the cacodemon together alone, the bunny will eat the cacodemon. And if he leaves the cacodemon and the scientist alone, the cacodemon will eat the scientist. How should the Doom Slayer get himself and all his companions safely to Deimos?
In fact, not only its solution is wrong, but it can't figure out why it's wrong on its own if you ask it to self-check.
In contrast, GPT-4o always consistently gives the correct response.