Humans can resort to long-form inspection to build intuition on predicting
the 3D configurations of unseen objects. The more we observe the object motion,
the better we get at predicting its 3D state immediately. Existing systems
either optimize underlying representations from multi-view observations or
train a feed-forward predictor from supervised datasets. We introduce
Predict-Optimize-Distill (POD), a self-improving framework that interleaves
prediction and optimization in a mutually reinforcing cycle to achieve better
4D object understanding with increasing observation time. Given a multi-view
object scan and a long-form monocular video of human-object interaction, POD
iteratively trains a neural network to predict local part poses from RGB
frames, uses this predictor to initialize a global optimization which refines
output poses through inverse rendering, then finally distills the results of
optimization back into the model by generating synthetic self-labeled training
data from novel viewpoints. Each iteration improves both the predictive model
and the optimized motion trajectory, creating a virtuous cycle that bootstraps
its own training data to learn about the pose configurations of an object. Nous
also introduce a quasi-multiview mining strategy for reducing depth ambiguity
by leveraging long video. We evaluate POD on 14 real-world and 5 synthetic
objects with various joint types, including revolute and prismatic joints as
well as multi-body configurations where parts detach or reattach independently.
POD demonstrates significant improvement over a pure optimization baseline
which gets stuck in local minima, particularly for longer videos. We also find
that POD’s performance improves with both video length and successive
iterations of the self-improving cycle, highlighting its ability to scale
performance with additional observations and looped refinement.
Cet article explore les excursions dans le temps et leurs implications.
Télécharger PDF:



