What is Stable Diffusion?
Stable Diffusion is a generative artificial intelligence (generative AI) model that produces unique photorealistic images from text and image prompts. It originally launched in 2022. Besides images, you can also use the model to create videos and animations. The model is based on diffusion technology and uses latent space. This significantly reduces processing requirements, and you can run the model on desktops or laptops equipped with GPUs. Stable Diffusion can be fine-tuned to meet your specific needs with as little as five images through transfer learning.
Stable Diffusion is available to everyone under a permissive license. This differentiates Stable Diffusion from its predecessors.
Why is Stable Diffusion important?
Stable Diffusion is important because it’s accessible and easy to use. It can run on consumer-grade graphics cards. For the first time, anyone can download the model and generate their images. You also have control over key hyperparameters, such as the number of denoising steps and the degree of noise applied.
Stable Diffusion is user-friendly, and you don't need additional information to create images. It has an active community, so Stable Diffusion has ample documentation and how-to tutorials. The software release is under the Creative ML OpenRAIL-M license, which lets you use, change and redistribute modified software. If you release derivative software, you have to release it under the same license and include a copy of the original Stable Diffusion license.
How does Stable Diffusion work?
As a diffusion model, Stable Diffusion differs from many other image generation models. In principle, diffusion models use Gaussian noise to encode an image. Then, they use a noise predictor together with a reverse diffusion process to recreate the image.
Apart from having the technical differences of a diffusion model, Stable Diffusion is unique in that it doesn’t use the pixel space of the image. Instead, it uses a reduced-definition latent space.
The reason for this is that a color image with 512x512 resolution has 786,432 possible values. By comparison, Stable Diffusion uses a compressed image that is 48 times smaller at 16,384 values. This significantly reduces processing requirements. And it’s why you can use Stable Diffusion on a desktop with an NVIDIA GPU with 8 GB of RAM. The smaller latent space works because natural images aren't random. Stable Diffusion uses variational autoencoder (VAE) files in the decoder to paint fine details like eyes.
Stable Diffusion V1 was trained using three datasets collected by LAION through the Common Crawl. This includes the LAION-Aesthetics v2.6 dataset of images with an aesthetic rating of 6 or higher.
What architecture does Stable Diffusion use?
The main architectural components of Stable Diffusion include a variational autoencoder, forward and reverse diffusion, a noise predictor, and text conditioning.
Variational autoencoder
The variational autoencoder consists of a separate encoder and decoder. The encoder compresses the 512x512 pixel image into a smaller 64x64 model in latent space that's easier to manipulate. The decoder restores the model from latent space into a full-size 512x512 pixel image.
Forward diffusion
Forward diffusion progressively adds Gaussian noise to an image until all that remains is random noise. It’s not possible to identify what the image was from the final noisy image. During training, all images go through this process. Forward diffusion is not further used except when performing an image-to-image conversion.
Reverse diffusion
This process is essentially a parameterized process that iteratively undoes the forward diffusion. For example, you could train the model with only two images, like a cat and a dog. If you did, the reverse process would drift towards either a cat or dog and nothing in between. In practice, model training involves billions of images and uses prompts to create unique images.
Noise predictor (U-Net)
A noise predictor is key for denoising images. Stable Diffusion uses a U-Net model to perform this. U-Net models are convolutional neural networks originally developed for image segmentation in biomedicine. In particular, Stable Diffusion uses the Residual Neural Network (ResNet) model developed for computer vision.
The noise predictor estimates the amount of noise in the latent space and subtracts this from the image. It repeats this process a specified number of times, reducing noise according to user-specified steps. The noise predictor is sensitive to conditioning prompts that help determine the final image.
Text conditioning
The most common form of conditioning is text prompts. A CLIP tokenizer analyzes each word in a textual prompt and embeds this data into a 768-value vector. You can use up to 75 tokens in a prompt. Stable Diffusion feeds these prompts from the text encoder to the U-Net noise predictor using a text transformer. By setting the seed to a random number generator, you can generate different images in the latent space.
What can Stable Diffusion do?
Stable Diffusion represents a notable improvement in text-to-image model generation. It’s broadly available and needs significantly less processing power than many other text-to-image models. Its capabilities include text-to-image, image-to-image, graphic artwork, image editing, and video creation.
Text-to-image generation
This is the most common way people use Stable Diffusion. Stable Diffusion generates an image using a textual prompt. You can create different images by adjusting the seed number for the random generator or changing the denoising schedule for different effects.
Image-to-image generation
Using an input image and text prompt, you can create images based on an input image. A typical case would be to use a sketch and a suitable prompt.
Creation of graphics, artwork and logos
Using a selection of prompts, it’s possible to create artwork, graphics and logos in a wide variety of styles. Naturally, it's not possible to predetermine the output, although you can guide logo creation using a sketch.
Image editing and retouching
You can use Stable Diffusion to edit and retouch photos. Using AI Editor, load an image and use an eraser brush to mask the area you want to edit. Then, by generating a prompt defining what you want to achieve, edit or inpaint the picture. For example, you can repair old photos, remove objects from pictures, change subject features, and add new elements to the picture.
Video creation
Using features such as Deforum from GitHub, it’s possible for you to create short video clips and animations with Stable Diffusion. Another application is to add different styles to a movie. It’s also possible for you to animate photos by creating an impression of motion, like with flowing water.
How can AWS help with Stable Diffusion?
Amazon Bedrock is the easiest way to build and scale generative AI applications with foundation models. Amazon Bedrock is a fully managed service that makes leading foundation models including Stable Diffusion available through an API, so you can choose from various FMs to find the model that's best suited for your use case. With Bedrock, you can speed up developing and deploying scalable, reliable, and secure generative AI applications without managing infrastructure.
Amazon SageMaker JumpStart, which is a ML hub offering models, algorithms, and solutions, provides access to hundreds of foundation models, including top performing publicly available foundation models such as Stable Diffusion. New foundation models continue to be added, including Stable Diffusion XL 1.0, the latest version of the image generation model.