Full blog is here: https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-...

dinp · on Dec 20, 2024

Great work! When I use models like o1, they work better than sonnet and 4o for tasks that require some thinking but the output is often very verbose. Is it possible to get the best of both worlds? The thinking takes place resulting in better performance but the output is straightforward to work with like with sonnet and 4o. Did you observe similar behaviour with the 1B and 3B models? How does the model behaviour change when used for normal tasks that don't require thinking?

Also how well do these models work to extract structured output? Eg- perform ocr on some hand written text with math, convert to html and format formulas correctly etc. Single shot prompting doesn't work well with such problems but splitting the steps into consecutive api calls works well.

srush · on Dec 20, 2024

That's a good point. We don't see that in our experiments because it's all in the math domain. However for OAI it's plausible that training for o1 might conflict with standard instruction training, leading to less human preferred output style.

dimitry12 · on Dec 20, 2024

In this paper and HF's replication the model used to produce solutions to MATH problems is off-the-shelf. It is induced to produce step-by-step CoT-style solutions by few-shot ICL prompts or by instructions.

Yes, the search process (beam-search of best-of-N) does produce verbose traces because there is branching involved when sampling "thoughts" from base model. These branched traces (including incomplete "abandoned" branches) can be shown to the user or hidden, if the approach is deployed as-is.

amitness · on Dec 21, 2024

OpenAI recommends using o1 to generate the verbose plan and then chain the verbose output to a cheaper model (e.g. gpt-4o-mini) to convert it into structured data / function calls / summary etc. They call it planner-executor pattern. [1]

[1] https://vimeo.com/showcase/11333741/video/1018737829

d4rkp4ttern · on Dec 22, 2024

The big question is whether or not o3 is using any type of “meta-generation” algorithm at inference time, I.e are there multiple invocations of the LLM generation at all, or does it generate an insanely long reasoning trace in a single autoregressive stream that some somehow implicitly has search-like behavior? In other words, is the search-like behavior learned entirely in post-training and only implicitly exhibited at inference time, or is it explicitly done at inference time?

Given the enormous compute costs of o3, my speculation has been that search is explicit, but I’ve seen this post from Nathan Lambert for example that speculates (in the context of o1) that it’s possible for search to be entirely “baked-into” a single single stream roll-out (which would depend on significant long-context innovations):

https://www.interconnects.ai/p/openais-o1-using-search-was-a...

If true this would be extremely interesting.

mccoyb · on Dec 20, 2024

In the blog post, learned verifiers are mentioned. Are these learned offline using data, and is the intent to learn a scoring heuristic to help the search?

dimitry12 · on Dec 20, 2024

Verifier is trained with soft values of reward-to-go for each solution-prefix, obtained from monte-carlo rollouts of step-by-step solutions sampled from the "base" model.

In other words: 1) sample step-by-step solutions from "base" model; 2) do it at non-zero temperature so that you can get multiple continuation from each solution-prefix; 3) use MATH-labels to decide if full solution (leaf/terminal node in MC rolloout) has reward `1` or `0`; 4) roll up these rewards to calculate reward-to-go for each intermediate step.

Yes, verifier trained in this manner can be used to score solution-prefixes (as a process verifier) or a full-solution (as an outcome verifier).

In the original paper (https://arxiv.org/abs/2408.03314) they fine-tune a fresh verifier. HF's replication uses an off-the-shelf verifier based on another paper: https://arxiv.org/abs/2312.08935

OakNinja · on Dec 20, 2024

Excellent and interesting post!

Minor gripe - The best-of-n | beam search illustration is not compatible with red-green color blindness. I can literally not see the difference between the Rejected and the Selected dots even if I zoom in.

srush · on Dec 20, 2024

Thanks for the feedback, and not minor. Sorry about that.