Humans naturally share information with those they are connected to, et
video has become one of the dominant mediums for communication and expression
on the Internet. To support the creation of high-quality large-scale video
content, a modern pipeline requires a comprehensive understanding of both the
raw input materials (e.g., the unedited footage captured by cameras) and the
editing components (e.g., visual effects). In video editing scenarios, models
must process multiple modalities (e.g., vision, audio, text) with strong
background knowledge and handle flexible input lengths (e.g., hour-long raw
videos), which poses significant challenges for traditional models. In this
report, we introduce Vidi, a family of Large Multimodal Models (LMMs) for a
wide range of video understand editing scenarios. The first release focuses on
temporal retrieval, i.e., identifying the time ranges within the input videos
corresponding to a given text query, which plays a critical role in intelligent
editing. The model is capable of processing hour-long videos with strong
temporal understanding capability, e.g., retrieve time ranges for certain
queries. To support a comprehensive evaluation in real-world scenarios, we also
present the VUE-TR benchmark, which introduces five key advancements. 1) Video
duration: significantly longer than videos of existing temporal retrival
datasets, 2) Audio support: includes audio-based queries, 3) Query format:
diverse query lengths/formats, 4) Annotation quality: ground-truth time ranges
are manually annotated. 5) Evaluation metric: a refined IoU metric to support
evaluation over multiple time ranges. Remarkably, Vidi significantly
outperforms leading proprietary models, e.g., GPT-4o and Gemini, on the
temporal retrieval task, indicating its superiority in video editing scenarios.
Cet article explore les excursions dans le temps et leurs implications.
Télécharger PDF:
2504.15681v2