SFT forces the model to output _that_ reasoning trace you have in data. RL allow...

		piecerough on Jan 26, 2025 \| parent \| context \| favorite \| on: DeepSeek-R1: Incentivizing Reasoning Capability in... SFT forces the model to output _that_ reasoning trace you have in data. RL allows whatever reasoning trace and only penalizes it if it does not reach the same answer