Diffusion in Diffusion:
Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation

By investigating the space-wise diffusion phenomenon during the step-wise reverse diffusion during denoising, We propose a training-free VT2I generation method by efficiently introducing the viusal condition to the pretrained T2I model.

View Paper

Abstract

Text-to-Image (T2I) generation with diffusion models allows users to control the semantic content in the synthesized images given text conditions. As a further step toward a more customized image creation application, we introduce a new multi-modality generation setting that synthesizes images based on not only the semantic-level textual input, but also on the pixel-level visual conditions. Existing literature first converts the given visual information to semantic-level representation by connecting it to languages, and then incorporates it into the original denoising process. Seemingly intuitive, such methodological design loses the pixel values during the semantic transition, thus failing to fulfill the task scenario where the preservation of low-level vision is desired (e.g., ID of a given face image). To this end, we propose Cyclic One-Way Diffusion (COW), a training-free framework for creating customized images with respect to semantic-text and pixel-visual conditioning. Notably, we observe that sub-regions of an image impose mutual interference, just like physical diffusion, to achieve ultimate harmony along the denoising trajectory. Thus we propose to repetitively utilize the given visual condition in a cyclic way, by planting the visual condition as a high-concentration "seed" at the initialization step of the denoising process, and "diffuse" it into a harmonious picture by controlling a one-way information flow from the visual condition. We repeat the destroy-and-construct process multiple times to gradually but steadily impose the internal diffusion process within the image. Experiments on the challenging one-shot face and text-conditioned image synthesis task demonstrate our superiority in terms of speed, image quality, and conditional fidelity compared to learning-based text-vision conditional methods.

Internal Diffusion Phenomenon

To illustrate the internal diffusion within the image during the diffusion denoise process, We inverse pictures of pure gray and white back to xt, merge them together with different layouts and then regenerate them back to x0 via deterministic denoising. Different columns indicate different replacement steps t. The resulting images show how regions within an image diffuse into each other during denoising. Interestingly, the diffused area between gray and white can sometimes form a blurry face, which we assume comes from the tendency of the face model to converge to human faces.

Method

We stick the given visual condition on a predefined background and inverse it as the seed initialization of the denoising starting point. In the Cyclic One-Way Diffusion process, we "destroy" and "construct" the image in a cyclic way and ensure a one-way diffusion by consistently replacing it with corresponding xt.

Grow a Background for the Visual Condition

To demonstrate the effect of COW on the internal diffusion within the image, we conduct experiments with &eta set to 0 from T-th to t1-th step to slow down the generation process. At the beginning of cycles, the visual condition face is isolated while the image is getting to fit the textual condition by forming more richer scenes. As the cycle goes on, we find the generated images show a clear procedure for growing a body from the given face condition. The background and visual condition are getting more and more harmonious as the completion of the human body. It vividly demonstrates that the information in the visual condition keeps spreading and diffusing to the surrounding region along with cycles. In addition, the model is capable of understanding the semantics of the implanted visual conditions properly.

Tradeoffs between Text and Visual Conditions

Our approach demonstrate the ability to trade off between visual and text conditions. For example, when given a photo of a young woman but the text is "an old person", our method can make the woman older to meet the text description by adding wrinkles, changing skin elasticity and hair color, etc. while maintaining the identity information of the given woman. Besides, even if we diffuse the visual conditions via constant replacement, we can still obtain Van Gogh-style face and blurred face pictures guided by the text condition.

Generalized Visual Condition

We directly apply our method to other visual conditions than the human face to show its scalability. Here we show results of trees and cats with different text conditions. Our method harmoniously grows a whole body out of the visual condition under the guidance of the text condition. What's more, our method has the ability to achieve style transfer results (such as "Van Gogh", and "oil picture") and transplant one class onto the other (e.g., "trees" to "potted plant"), seeking a balance between preserving the visual condition and meeting given text condition.