Pose-Diverse Multi-View Virtual Try-on from a Single Frontal Image via Diffusion Transformer
Abstract
We study multi-view virtual try-on framework from a single reference image of themselves and a single frontal image of a garment. While most existing approaches focus on single-view synthesis, their reliance on a single, fixed viewpoint limits their application in immersive environments that require diverse poses and viewpoints. The ability to generate a multi-view virtual try-on image is crucial for a comprehensive user experience, as if the user to inspect the garment from multiple views, including the back and sides, providing a similar experience to a real fitting room. In this paper, we propose a novel framework for pose-controllable, multi-view virtual try-on from a single image. Our method incorporates the proposed cross attention injection for high-quality synthesis and allows for fine-grained pose control. Our framework consists of two stages: the first stage generates a clean frontal try-on result using an off-the-shelf model, and the second stage creates diverse viewpoints using a diffusion transformer. Additionally, an attention injection mechanism ensures consistent identity and garment preservation across synthesized views. Unlike conventional methods that require multiple images of the user or the garment from various angles, our model eliminates these constraints by synthesizing multi-view results from a single input image pair. Our method not only generates realistic images but also enables users to virtually inspect the fit and arrangement of the garment from multiple angles without the need for additional data. Our extensive experiments demonstrate that our framework outperforms existing 3D-based multi-view virtual try-on methods in terms of image quality and pose diversity.