Humans naturally share information with those they are connected to, E
video has become one of the dominant mediums for communication and expression
on the Internet. To support the creation of high-quality large-scale video
content, a modern pipeline requires a comprehensive understanding of both the
raw input materials (per esempio., the unedited footage captured by cameras) and the
editing components (per esempio., visual effects). In video editing scenarios, modelli
must process multiple modalities (per esempio., vision, audio, text) with strong
background knowledge and handle flexible input lengths (per esempio., hour-long raw
videos), which poses significant challenges for traditional models. In this
report, we introduce Vidi, a family of Large Multimodal Models (LMM) for a
wide range of video understand editing scenarios. The first release focuses on
temporal retrieval, i.e., identifying the time ranges within the input videos
corresponding to a given text query, which plays a critical role in intelligent
editing. The model is capable of processing hour-long videos with strong
temporal understanding capability, per esempio., retrieve time ranges for certain
queries. To support a comprehensive evaluation in real-world scenarios, we also
present the VUE-TR benchmark, which introduces five key advancements. 1) Video
durata: significantly longer than videos of existing temporal retrival
datasets, 2) Audio support: includes audio-based queries, 3) Query format:
diverse query lengths/formats, 4) Annotation quality: ground-truth time ranges
are manually annotated. 5) Evaluation metric: a refined IoU metric to support
evaluation over multiple time ranges. Remarkably, Vidi significantly
outperforms leading proprietary models, per esempio., GPT-4o and Gemini, on the
temporal retrieval task, indicating its superiority in video editing scenarios.
Questo articolo esplora i giri e le loro implicazioni.
Scarica PDF:



