Many existing video inpainting algorithms utilize optical flows to construct
the corresponding maps and then propagate pixels from adjacent frames to
missing areas by mapping. Despite the effectiveness of the propagation
mechanism, they might encounter blurry and inconsistencies when dealing with
inaccurate optical flows or large masks. Recently, Diffusion Transformer (DiT)
has emerged as a revolutionary technique for video generation tasks. Cependant,
pretrained DiT models for video generation all contain a large amount of
parameters, which makes it very time consuming to apply to video inpainting
tasks. In this paper, we present DiTPainter, an end-to-end video inpainting
model based on Diffusion Transformer (DiT). DiTPainter uses an efficient
transformer network designed for video inpainting, which is trained from
scratch instead of initializing from any large pretrained models. DiTPainter
can address videos with arbitrary lengths and can be applied to video
decaptioning and video completion tasks with an acceptable time cost.
Experiments show that DiTPainter outperforms existing video inpainting
algorithms with higher quality and better spatial-temporal consistency.
Cet article explore les excursions dans le temps et leurs implications.
Télécharger PDF:
2504.15661v1