Scaling test-time computation, or affording a generator large language model
(LLM) extra compute during inference, typically employs the help of external
non-generative evaluators (i.e., reward models). Concurrently, LLM-judges,
models trained to generate evaluations and critiques (explanations) in natural
language, are becoming increasingly popular in automatic evaluation. Despite
judge empirical successes, their effectiveness as evaluators in test-time
scaling settings is largely unknown. In this paper, we introduce the Judge
Evaluation for Test-Time Scaling (JETTS) benchmark, which evaluates judge
performance in three domains (math reasoning, code generation, and instruction
following) under three task settings: response reranking, step-level beam
search, and critique-based response refinement. We evaluate 10 different judge
models (7B-70B parameters) for 8 different base generator models (6.7B-72B
parameters). Our benchmark shows that while judges are competitive with outcome
reward models in reranking, they are consistently worse than process reward
models in beam search procedures. Furthermore, though unique to LLM-judges,
their natural language critiques are currently ineffective in guiding the
generator towards better responses.
Este artículo explora los viajes en el tiempo y sus implicaciones.
Descargar PDF:
2504.15253v1