The rapid growth of long-duration, high-definition videos has made efficient
video quality assessment (VQA) a critical challenge. Existing research
typically tackles this problem through two main strategies: reducing model
parameters and resampling inputs. Cependant, light-weight Convolution Neural
Networks (CNN) and Transformers often struggle to balance efficiency with high
performance due to the requirement of long-range modeling capabilities.
Récemment, the state-space model, particularly Mamba, has emerged as a promising
alternative, offering linear complexity with respect to sequence length.
Entre-temps, efficient VQA heavily depends on resampling long sequences to
minimize computational costs, yet current resampling methods are often weak in
preserving essential semantic information. Dans ce travail, we present MVQA, a
Mamba-based model designed for efficient VQA along with a novel Unified
Semantic and Distortion Sampling (USDS) approche. USDS combines semantic patch
sampling from low-resolution videos and distortion patch sampling from
original-resolution videos. The former captures semantically dense regions,
while the latter retains critical distortion details. To prevent computation
increase from dual inputs, we propose a fusion mechanism using pre-defined
masks, enabling a unified sampling strategy that captures both semantic and
quality information without additional computational burden. Experiments show
that the proposed MVQA, equipped with USDS, achieve comparable performance to
state-of-the-art methods while being $2\times$ as fast and requiring only $1/5$
GPU memory.
Cet article explore les excursions dans le temps et leurs implications.
Télécharger PDF:



