Project Description
In this project, we play around with diffusion models.
Sample Captions and Created Images
Here are a few model outputs for 3 text prompts. The first set of 3 photos is for 5 inference steps. For the first two photos, the text prompt seems to be reflected somewhat well. Although the images do still seem a bit noisy and not perfectly resembling of the prompt. In fact for the last one, we see that the image does seem to reflect the texture of a rocket ship, but it doesn't exactly resemble a rocket ship. It's clear that more inference steps are necessary for a better output.
As you can see, with more inference steps, we get much better and clearer output that resembles the text prompt much better.
Forward Process
The forward process involves taking a clean image and adding noise to it. Given a clean image x_0, we get a noisy image x_t by sampling from a Gaussian with mean (sqrt(alphabar_t)) * x_0 and variance 1-alphabar_t.
Denoising with Gaussian Blur Filtering
Here we show the results of denoising these images using Gaussian blur filtering to remove the noise at the same timesteps of noise (250, 500, 750) that we displayed noisy images for in the last part.
One Step Denoising
In this part, we use a pretrained diffusion model to recover Gaussian noise from an image and then remove that noise to denoise the image. This is essentially one step denoising because it is not iterative, and we are simply passing in the timestep of noise into the model so it can remove all the noise conditioned on that timestep at once.
Iterative Denoising
I implemented iterative denoising for a diffusion model using **strided timesteps** to skip redundant steps, improving efficiency while preserving image quality. Starting with a noisy image at a high timestep, I used a U-Net to predict the noise and estimate the clean image, iteratively refining it across progressively less noisy timesteps. The denoising process interpolates between the noisy input and the predicted clean signal using diffusion-specific parameters like \( \alpha_t \), \( \beta_t \), and \( \bar{\alpha}_t \). By skipping steps with strided timesteps, the model efficiently reconstructs high-quality images without the need for full-timestep computations. I visualized the gradual denoising process and compared the results to single-step denoising and Gaussian blurring, demonstrating the superiority of the iterative approach. This project highlights the power of diffusion models in balancing computational efficiency and high-quality image restoration.
Diffusion Model Sampling
If we repeat part 1.4, starting from random noise and timestep = 0, we can effectively generate images from scratch or "sample" this diffusion model. Here are 5 results for the text prompt "a high quality photo".
CFG (Classifier Free Guidance)
In this part, I implemented **Classifier-Free Guidance (CFG)** to improve the quality of generated images by combining conditional and unconditional noise estimates. The denoising process involves calculating a weighted combination of these estimates, controlled by a guidance scale \( \gamma \), where higher values amplify the conditional guidance for better image fidelity. I created a function, `iterative_denoise_cfg`, that iteratively refines noisy inputs using both a meaningful text prompt and an empty ("null") prompt to guide the generation process. Using this function, I generated five images with a CFG scale of \( \gamma = 7 \), which balances conditional alignment and diversity. This approach produces sharper, more detailed images compared to standard diffusion methods, demonstrating the effectiveness of classifier-free guidance.
Image to Image translation
In this part, I used **SDEdit** to edit images by adding varying levels of noise and then iteratively denoising them using **Classifier-Free Guidance (CFG)**. For each noise level (\(1, 3, 5, 7, 10, 20\)), I added noise to the original image to push it away from the image manifold and used a diffusion model with the prompt "a high quality photo" to guide it back. The denoising process forced the noisy images to align with the natural image manifold, gradually restoring details while preserving edits induced by the noise. I repeated this procedure for three test images, producing a range of outputs for each noise level, showing progressively refined versions as the noise level decreased. This approach demonstrates the model’s ability to creatively edit and restore images while balancing between the original content and new generative adjustments.
Editing Hand-Drawn Images and Web Images
I experimented with projecting non-realistic images onto the natural image manifold using iterative denoising with Classifier-Free Guidance. I edited a web-sourced image of a car and two hand-drawn images—a sports car and a pair of sunglasses—at various noise levels, transforming them into photorealistic versions.
Inpainting
I implemented an inpainting function based on the RePaint paper to reconstruct missing regions in images using a binary mask. By iteratively denoising the input image, I replaced the masked regions with new content generated by the diffusion model while keeping the unmasked areas unchanged. I applied this technique to inpaint the top of the Campanile and experimented with two additional custom images and masks for creative edits.
Text-Conditional Image to Image Translation
In this task, I extended the SDEdit approach to include text prompts, enabling guided transformations of images based on specific descriptions. By varying noise levels (\([1, 3, 5, 7, 10, 20]\)), I projected test images onto the natural image manifold while aligning them with a provided text prompt, such as "a rocket ship." I applied this method to the test image and two additional custom images, creating results that blend the original structure with the semantic guidance of the text prompt.
Visual Anagrams
This code implements a **visual anagram creation** process where an image changes its semantic content depending on its orientation, based on prompts like "an oil painting of an old man" and "an oil painting of people around a campfire." The key idea is to iteratively denoise an image with **Classifier-Free Guidance (CFG)** for two different text prompts: one for the original image orientation and one for its flipped version. The denoising is guided by conditional and unconditional noise estimates, combined with a scaling factor to control the strength of the guidance. For each timestep \(t\), the code computes the conditional noise estimates for both prompts using the U-Net. It also calculates the unconditional noise estimates by passing the same images through the model with an empty prompt. These estimates are combined using CFG to calculate the guided noise estimates, where: \[ \text{noise\_est\_cfg} = \text{uncond\_noise\_est} + \text{scale} \cdot (\text{cond\_noise\_est} - \text{uncond\_noise\_est}). \] For the flipped version of the image, the process is repeated after flipping the image vertically, and the resulting noise estimate is flipped back to align with the original image orientation. The final noise estimate is then averaged across the two prompts (original and flipped) to harmonize the two semantic interpretations. The resulting noise estimate is used to compute the clean image estimate for the current timestep using the equation: \[ \text{clean\_est} = \frac{\text{image} - \text{noise\_est\_final} \cdot \sqrt{1 - \alpha_t}}{\sqrt{\alpha_t}}. \] The predicted image for the previous timestep is computed by interpolating between the clean image estimate and the current noisy image, with variance correction added using the predicted variance from the U-Net. This iterative denoising process ensures that the final image contains visual elements corresponding to both prompts, depending on the orientation of the image. The code successfully demonstrates how to generate creative visual anagrams by leveraging the powerful capabilities of diffusion models.
Visual Anagrams
This code implements **Factorized Diffusion** to create hybrid images that combine low and high-frequency components from two separate noise estimates generated using different prompts. Specifically, it creates an image that appears as a "lithograph of a skull" when viewed from a distance but resembles "a lithograph of waterfalls" when viewed up close. The process involves iteratively denoising the image while separating the frequency components of two noise estimates using Gaussian blurring. For each timestep \(t\), the code calculates the **conditional noise estimates** (\(\epsilon_{1}, \epsilon_{2}\)) for the two prompts and their respective **unconditional noise estimates** using the U-Net. These estimates are combined using **Classifier-Free Guidance (CFG)**, where the final guided noise for each prompt is: \[ \text{noise\_est} = \text{uncond\_noise} + \text{scale} \cdot (\text{cond\_noise} - \text{uncond\_noise}). \] The low-frequency component is extracted by applying a Gaussian blur (\(kernel\_size=33, sigma=2\)) to the first noise estimate. The high-frequency component is derived by subtracting the blurred version of the second noise estimate from itself. These components are combined to form the final noise estimate: \[ \text{noise\_est\_final} = \text{low\_pass} + \text{high\_pass}. \] The denoised image for the current timestep is computed using: \[ \text{clean\_est} = \frac{\text{image} - \text{noise\_est\_final} \cdot \sqrt{1 - \alpha_t}}{\sqrt{\alpha_t}}, \] where \(\alpha_t\) and \(\beta_t = 1 - \alpha_t\) are timestep-dependent noise parameters. The predicted image for the previous timestep is calculated by interpolating between the clean estimate and the noisy image, while incorporating the predicted variance to stabilize the results. This process is repeated iteratively across the timesteps, gradually refining the image to combine the high and low frequencies of the prompts. The result is a visually compelling hybrid image that seamlessly integrates the features of the two concepts, leveraging the capabilities of diffusion models for creative compositional outputs.
Project 5B: Diffusion Models from Scratch
Training a Single Step Denoising UNet
This project involves training a single-step denoising UNet on the MNIST dataset to map noisy images back to clean ones using an \( L_2 \) loss function. The **UNet architecture** is built using various tensor operations such as **nn.Conv2d** (for convolutions), **nn.ConvTranspose2d** (for upsampling), **nn.BatchNorm2d** (for normalization), **nn.GELU** (for activation), and **nn.AvgPool2d** (for pooling). Key building blocks include: 1. **Conv**: Performs convolution to adjust channel dimensions while keeping height and width constant. 2. **DownConv**: Reduces the spatial resolution by a factor of 2 using strided convolutions. 3. **UpConv**: Upsamples the spatial resolution by a factor of 2 using transposed convolutions. 4. **Flatten** and **Unflatten**: Flatten 7x7 tensors into 1x1 and reverse the process, respectively. 5. **ConvBlock**, **DownBlock**, and **UpBlock**: Deeper versions of Conv, DownConv, and UpConv that include additional layers to increase learnable parameters. The UNet's hidden dimension \( D = 128 \) determines the number of channels in intermediate layers, which control the model's capacity. Skip connections are used to concatenate features from earlier downsampling layers to later upsampling layers, preserving fine-grained spatial details during reconstruction. To train the model, clean MNIST digits are noised using a Gaussian process with a controllable noise parameter \( \sigma \). Each batch is dynamically noised during training to improve generalization. The training is performed over 5 epochs with a batch size of 256 using the **Adam optimizer** and a learning rate of \( 1 \times 10^{-4} \). Loss values are recorded throughout training, and sample denoising results are visualized after the 1st and 5th epochs. The denoiser's performance is also tested on **out-of-distribution noise levels**, where \( \sigma \) is varied beyond the training values, to evaluate its generalization. Results are visualized as digits with varying noise levels and their corresponding denoised outputs. To prevent data loss, checkpoints are saved to Google Drive after each epoch, storing the model and optimizer states for resumption. The project demonstrates how diffusion models can reconstruct clean images and explores their robustness to unseen noise distributions.
Time Conditioned UNet
I implemented a **time-conditioned UNet** to train a **Diffusion Probabilistic Model (DDPM)** that can denoise noisy MNIST images and reconstruct clean ones. To include time conditioning, I added **FCBlocks**, which embed the scalar timestep \( t \) into a higher-dimensional vector using linear layers and GELU activations. My UNet architecture has a downsampling path, a bottleneck where I injected time-conditioning, and an upsampling path where I added skip connections and additional time-conditioning layers. The final layer outputs a single-channel \( 28 \times 28 \) image. For training, I used a forward diffusion process to add Gaussian noise to clean images using a noise schedule (\( \beta \)) and trained the UNet to predict the noise, optimizing with Mean Squared Error (MSE) loss. I trained the model on the MNIST dataset for 20 epochs with a batch size of 128 using the Adam optimizer and an exponentially decayed learning rate. To ensure stability, I saved model checkpoints and visualized the loss curve, which showed consistent improvement. I also implemented the reverse diffusion algorithm to generate new samples, starting from random noise and iteratively reconstructing clean images. After 5 and 20 epochs, I visualized the generated samples, which closely resembled MNIST digits. This project demonstrated how diffusion models can effectively denoise images and how time conditioning enhances the model's ability to handle noise dynamically across timesteps.
Class Conditioned UNet
I extended my UNet to include **class-conditioning** to improve control over image generation, enabling it to generate specific digits (0-9) based on a one-hot encoded class vector. I added two new **FCBlocks** to handle the class-conditioning vector alongside the time-conditioning vector, allowing the model to incorporate both class and time information. To maintain flexibility, I implemented **dropout**, where the class-conditioning vector is randomly set to zero 10% of the time, enabling the model to function without class-conditioning if needed. During training, I modulated key parts of the UNet, such as the bottleneck (`unflatten`) and upsampling path (`up1`), by combining class-conditioning (\( c \)) and time-conditioning (\( t \)) vectors. For sampling, I used **classifier-free guidance**, where both conditional and unconditional outputs were combined with a scaling factor to improve results. I trained the class-conditioned UNet on the MNIST dataset for 20 epochs, following the same procedure as the time-conditioned UNet. The training loss curve demonstrated consistent improvement, and I visualized generated samples for each digit after every 5 epochs. The results showed better control over digit generation, with clear differentiation between the classes. This approach provided enhanced flexibility for targeted image generation while maintaining the robustness of the diffusion model.