How Reasoning Evolves from Post-Training Data in Sequential Domains
Outperformed state-of-the-art open-source reasoning models in chess through SFT and RL on a 7B-parameter language model. The key focus of this work was to study how fine-tuning influenced post-RL reasoning (both quantitative and qualitative performance) using custom theoretically-inspired datasets.