Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

SFT forces the model to output _that_ reasoning trace you have in data. RL allows whatever reasoning trace and only penalizes it if it does not reach the same answer


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: