In human-computer interaction, head pose estimation profoundly influences
application functionality. Although utilizing facial landmarks is valuable for
this purpose, existing landmark-based methods prioritize precision over
simplicity and model size, limiting their deployment on edge devices and in
compute-poor environments. To bridge this gap, we propose \textbf{Grouped
Attention Deep Sets (GADS)}, a novel architecture based on the Deep Set
framework. By grouping landmarks into regions and employing small Deep Set
layers, we reduce computational complexity. Our multihead attention mechanism
extracts and combines inter-group information, resulting in a model that is
$7.5\times$ smaller and executes $25\times$ faster than the current lightest
state-of-the-art model. Notably, our method achieves an impressive reduction,
being $4321\times$ smaller than the best-performing model. We introduce vanilla
GADS and Hybrid-GADS (landmarks + RGB) and evaluate our models on three
benchmark datasets — AFLW2000, BIWI, and 300W-LP. We envision our architecture
as a robust baseline for resource-constrained head pose estimation methods.
Este artículo explora los viajes en el tiempo y sus implicaciones.
Descargar PDF:
2504.15751v1