Reinforcement learning (RL) has become the core post-training technique for
large language models (LLMs). RL for LLMs involves two stages: generation and
training. The LLM first generates samples online, which are then used to derive
rewards for training. The conventional view holds that the colocated
architecture, where the two stages share resources via temporal multiplexing,
outperforms the disaggregated architecture, in which dedicated resources are
assigned to each stage. Jedoch, in real-world deployments, we observe that the
colocated architecture suffers from resource coupling, where the two stages are
constrained to use the same resources. This coupling compromises the
scalability and cost-efficiency of colocated RL in large-scale training. In
contrast, the disaggregated architecture allows for flexible resource
allocation, supports heterogeneous training setups, and facilitates
cross-datacenter deployment.
StreamRL is designed with disaggregation from first principles and fully
unlocks its potential by addressing two types of performance bottlenecks in
existing disaggregated RL frameworks: pipeline bubbles, caused by stage
dependencies, and skewness bubbles, resulting from long-tail output length
distributions. To address pipeline bubbles, StreamRL breaks the traditional
stage boundary in synchronous RL algorithms through stream generation and
achieves full overlapping in asynchronous RL. To address skewness bubbles,
StreamRL employs an output-length ranker model to identify long-tail samples
and reduces generation time via skewness-aware dispatching and scheduling.
Experiments show that StreamRL improves throughput by up to 2.66x compared to
existing state-of-the-art systems, and improves cost-effectiveness by up to
1.33x in a heterogeneous, cross-datacenter setting.
Dieser Artikel untersucht Zeitreisen und deren Auswirkungen.
PDF herunterladen:
2504.15930v1