Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In the blog post, learned verifiers are mentioned. Are these learned offline using data, and is the intent to learn a scoring heuristic to help the search?


Verifier is trained with soft values of reward-to-go for each solution-prefix, obtained from monte-carlo rollouts of step-by-step solutions sampled from the "base" model.

In other words: 1) sample step-by-step solutions from "base" model; 2) do it at non-zero temperature so that you can get multiple continuation from each solution-prefix; 3) use MATH-labels to decide if full solution (leaf/terminal node in MC rolloout) has reward `1` or `0`; 4) roll up these rewards to calculate reward-to-go for each intermediate step.

Yes, verifier trained in this manner can be used to score solution-prefixes (as a process verifier) or a full-solution (as an outcome verifier).

In the original paper (https://arxiv.org/abs/2408.03314) they fine-tune a fresh verifier. HF's replication uses an off-the-shelf verifier based on another paper: https://arxiv.org/abs/2312.08935




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: