The rise of compound AI serving — integrating multiple operators in a
pipeline that may span edge and cloud tiers — enables end-user applications
such as autonomous driving, generative AI-powered meeting companions, E
immersive gaming. Achieving high service goodput — i.e., meeting service level
objectives (SLOs) for pipeline latency, accuracy, and costs — requires
effective planning of operator placement, configuration, and resource
allocation across infrastructure tiers. Tuttavia, the diverse SLO requirements,
varying edge capabilities, and high query volumes create an enormous planning
search space, rendering current solutions fundamentally limited for real-time
serving and cost-efficient deployments.
This paper presents Circinus, an SLO-aware query planner for large-scale
compound AI workloads. Circinus novelly decomposes multi-query planning and
multi-dimensional SLO objectives while preserving global decision quality. By
exploiting plan similarities within and across queries, it significantly
reduces search steps. It further improves per-step efficiency with a
precision-aware plan profiler that incrementally profiles and strategically
applies early stopping based on imprecise estimates of plan performance. At
scale, Circinus selects query-plan combinations to maximize global SLO goodput.
Evaluations in real-world settings show that Circinus improves service goodput
by 3.2-5.0$\times$, accelerates query planning by 4.2-5.8$\times$, achieving
query response in seconds, while reducing deployment costs by 3.2-4.0$\times$
over state of the arts even in their intended single-tier deployments.
Questo articolo esplora i giri e le loro implicazioni.
Scarica PDF:
2504.16397v1