Mise à l'échelle du calcul du temps de test, ou offrir un grand modèle de langage générateur
(LLM) calcul supplémentaire pendant l'inférence, typically employs the help of external
non-generative evaluators (c'est-à-dire, modèles de récompense). En même temps, Juges LLM,
modèles formés pour générer des évaluations et des critiques (explications) in natural
language, are becoming increasingly popular in automatic evaluation. Despite
judge empirical successes, their effectiveness as evaluators in test-time
scaling settings is largely unknown. Dans ce document, we introduce the Judge
Evaluation for Test-Time Scaling (JETTS) benchmark, which evaluates judge
performance in three domains (math reasoning, code generation, and instruction
following) under three task settings: response reranking, step-level beam
search, and critique-based response refinement. We evaluate 10 different judge
models (7B-70B parameters) for 8 different base generator models (6.7B-72B
parameters). Our benchmark shows that while judges are competitive with outcome
reward models in reranking, they are consistently worse than process reward
models in beam search procedures. Furthermore, though unique to LLM-judges,
their natural language critiques are currently ineffective in guiding the
generator towards better responses.
Cet article explore les excursions dans le temps et leurs implications.
Télécharger PDF:



