Process Reward Models That Think

Step-by-step verifiers — also known as process reward models (PMR) — are a
key ingredient for test-time scaling. PRMs require step-level supervision,
making them expensive to train. This work aims to build data-efficient PRMs as
verbalized step-wise reward models that verify every step in the solution by
generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long
CoT verifier fine-tuned on orders of magnitude fewer process labels than those
required by discriminative PRMs. Our approach capitalizes on the inherent
reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and
discriminative verifiers — using only 1% of the process labels in PRM800K —
across several challenging benchmarks. Spécifiquement, ThinkPRM beats the
baselines on ProcessBench, MATH-500, and AIME ’24 under best-of-N selection and
reward-guided search. In an out-of-domain evaluation on a subset of
GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers
trained on the full PRM800K by 8% et 4.5%, respectively. Lastly, under the
same token budget, ThinkPRM scales up verification compute more effectively
compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of
ProcessBench. Our work highlights the value of generative, long CoT PRMs that
can scale test-time compute for verification while requiring minimal
supervision for training. Our code, données, and models will be released at
https://github.com/mukhal/thinkprm.

Cet article explore les excursions dans le temps et leurs implications.

Télécharger PDF:

2504.16828v1

Process Reward Models That Think

Plateforme en ligne

Links

Verbalus Mater

Process Reward Models That Think

Process Reward Models That Think

Plateforme en ligne

Links

Verbalus Mater

Se connecter

S'inscrire

— DÉBUT DU PROCHAIN ​​COURS EN LIGNE 15 JANVIER -

La vraie science derrière Voyage dans le temps 25% DTO

— DÉBUT DU PROCHAIN COURS EN LIGNE 15 JANVIER -

La vraie science derrière
Voyage dans le temps
25% DTO