Enhancing mathematical reasoning with course of supervision

We have educated a mannequin to attain a brand new state-of-the-art in mathematical drawback fixing by rewarding every right step of reasoning (“course of supervision”) as an alternative of merely rewarding the right ultimate reply (“final result supervision”). Along with boosting efficiency relative to final result supervision, course of supervision additionally has an vital alignment profit: it instantly trains the mannequin to provide a chain-of-thought that’s endorsed by people.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button