What type of prompts were you feeding it? My limited understanding is that reasoning models will outperform LLMs like GPT-4/Claude at certain tasks but not others. Prompts that have answers that are more fuzzy and less deterministic (ie. soft sciences) will see reasoning models underperform because their training revolves around RL with rewards.
It's no where close to Claude, and it's also not better than OpenAI.
I'm so confused as to how people judge these things.