Subject-driven image generation aims to synthesize novel scenes that
faithfully preserve subject identity from reference images while adhering to
textual guidance, yet existing methods struggle with a critical trade-off
between fidelity and efficiency. Tuning-based approaches rely on time-consuming
and resource-intensive subject-specific optimization, while zero-shot methods
fail to maintain adequate subject consistency. In this work, we propose
FreeGraftor, a training-free framework that addresses these limitations through
cross-image feature grafting. Specifically, FreeGraftor employs semantic
matching and position-constrained attention fusion to transfer visual details
from reference subjects to the generated image. Additionally, our framework
incorporates a novel noise initialization strategy to preserve geometry priors
of reference subjects for robust feature matching. Extensive qualitative and
quantitative experiments demonstrate that our method enables precise subject
identity transfer while maintaining text-aligned scene synthesis. Without
requiring model fine-tuning or additional training, FreeGraftor significantly
outperforms existing zero-shot and training-free approaches in both subject
fidelity and text alignment. Furthermore, our framework can seamlessly extend
to multi-subject generation, making it practical for real-world deployment. Our
code is available at https://github.com/Nihukat/FreeGraftor.
Questo articolo esplora i giri e le loro implicazioni.
Scarica PDF:
2504.15958v1