The model is trained to encourage re-evaluating the soundness of tokens produced during the "thinking phase".
The model state vector is kept in a state of open exploration. Influenced by the already emitted tokens but less strongly so.
The non-reasoning models were just trained with the goal of producing useful output on a first try and they did their best to maximize that fitness function.
The model is trained to encourage re-evaluating the soundness of tokens produced during the "thinking phase".
The model state vector is kept in a state of open exploration. Influenced by the already emitted tokens but less strongly so.
The non-reasoning models were just trained with the goal of producing useful output on a first try and they did their best to maximize that fitness function.