I was reading the DeepSeek paper to understand the nitty-gritty of improving performance through RL on the base model instead of SFT. I love the fact that we wouldn’t need to rely as much on labeled data for tasks that occur rarely. However, I couldn’t help but notice the mention of the “aha moment” in the paper. Can someone mathematically explain why there is a checkpoint during training where the model learns to allocate more thinking time to a problem by reevaluating its initial approach? Is this behavior repeatable, or is it simply one of the "local minima" they encountered?