> Group Relative Policy Optimization (GRPO), a variant reinforcement learning (RL) algorithm of Proximal Policy Optimization (PPO) (Schulman et al., 2017).
GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly
reducing training resources. By solely using a subset of English instruction tuning data, GRPO
obtains a substantial improvement over the strong DeepSeekMath-Instruct, including both
in-domain (GSM8K: 82.9% → 88.2%, MATH: 46.8% → 51.7%) and out-of-domain mathematical
tasks (e.g., CMATH: 84.6% → 88.8%) during the reinforcement learning phase
> Similarly, for code competition prompts, a compiler can be utilized to evaluate the model’s responses against a suite of predefined test cases, thereby generating objective feedback on
correctness
None of those are novel domains w/ their own novel syntax & semantic validators, not to mention the dearth of readily available sources of examples for sampling the baselines. So again, where does it say it works for a programming language with nothing but a grammar & a compiler?
You're not going to get less confused by doubling down. None of your claims are valid & this is because you haven't actually tried to do what you're suggesting. Taking a grammar & compiler & RLing will get you nowhere.
That GRPO works?
> Group Relative Policy Optimization (GRPO), a variant reinforcement learning (RL) algorithm of Proximal Policy Optimization (PPO) (Schulman et al., 2017). GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources. By solely using a subset of English instruction tuning data, GRPO obtains a substantial improvement over the strong DeepSeekMath-Instruct, including both in-domain (GSM8K: 82.9% → 88.2%, MATH: 46.8% → 51.7%) and out-of-domain mathematical tasks (e.g., CMATH: 84.6% → 88.8%) during the reinforcement learning phase
Page 2 of https://arxiv.org/pdf/2402.03300
That GRPO on code works?
> Similarly, for code competition prompts, a compiler can be utilized to evaluate the model’s responses against a suite of predefined test cases, thereby generating objective feedback on correctness
Page 4 of https://arxiv.org/pdf/2501.12948