Showing posts with label stable diffusion. Show all posts
Showing posts with label stable diffusion. Show all posts

Sunday, 12 May 2024

How to Use Stable Diffusion Effectively

 From the prompt to the picture, Stable Diffusion is a pipeline with many components and parameters. All these components working together creates the output. If a component behave differently, the output will change. Therefore, a bad setting can easily ruin your picture. In this post, you will see:

  • How the different components of the Stable Diffusion pipeline affects your output
  • How to find the best configuration to help you generate a high quality picture

Let’s get started.

How to Use Stable Diffusion Effectively.
Photo by Kam Idris. Some rights reserved.

Overview

This post is in three parts; they are:

  • Importance of a Model
  • Selecting a Sampler and Scheduler
  • Size and the CFG Scale

Importance of a Model

If there is one component in the pipeline that has the most impact, it must be the model. In the Web UI, it is called the “checkpoint”, named after how we saved the model when we trained a deep learning model.

The Web UI supports multiple Stable Diffusion model architectures. The most common architecture nowadays is the version 1.5 (SD 1.5). Indeed, all version 1.x share a similar architecture (each model has 860M parameters) but are trained or fine-tuned under different strategies.

Architecture of Stable Diffusion 1.x. Figure from Rombach et al (2022)

There is also Stable Diffusion 2.0 (SD 2.0), and its updated version 2.1. This is not a “revision” from version 1.5, but a model trained from scratch. It uses a different text encoder (OpenCLIP instead of CLIP); therefore, they would understand keywords differently. One noticeable difference is that OpenCLIP knows fewer names of celebrities and artists. Hence, the prompt from Stable Diffusion 1.5 may be obsolete in 2.1. Because the encoder is different, SD2.x and SD1.x are incompatible, while they share a similar architecture.

Next comes the Stable Diffusion XL (SDXL). While version 1.5 has a native resolution of 512×512 and version 2.0 increased it to 768×768, SDXL is at 1024×1024. You are not suggested to use a vastly different size than their native resolution. SDXL is a different architecture, with a much larger 6.6B parameters pipeline. Most notably, the models have two parts: the Base model and the Refiner model. They come in pairs, but you can swap out one of them for a compatible counterpart, or skip the refiner if you wish. The text encoder used combines CLIP and OpenCLIP. Hence, it should understand your prompt better than any older architecture. Running SDXL is slower and requires much more memory, but usually in better quality.

Architecture of SDXL. Figure from Podell et al (2023)

What matters to you is that you should classify your models into three incompatible families: SD1.5, SD2.x, and SDXL. They behave differently with your prompt. You will also find that SD1.5 and SD2.x would need a negative prompt for a good picture, but it is less important in SDXL. If you’re using SD2.x models, you will also notice that you can select your refiner in the Web UI.

Images generated with the prompt, ‘A fast food restaurant in a desert with name “Sandy Burger”’, using SD 1.5 with different random seed. Note that none of them spelled the name correctly.

Images generated with the prompt, ‘A fast food restaurant in a desert with name “Sandy Burger”’, using SD 2.0 with different random seed. Note that not all of them spelled the name correctly.

Images generated with the prompt, ‘A fast food restaurant in a desert with name “Sandy Burger”’, using SDXL with different random seed. Note that all of them spelled the name correctly.

One characteristic of Stable Diffusion is that the original models are less capable but adaptable. Therefore, a lot of third-party fine-tuned models are produced. Most significant are the models specializing in certain styles, such as Japanese anime, western cartoons, Pixar-style 2.5D graphics, or photorealistic pictures.

You can find models on Civitai.com or Hugging Face Hub. Search with keywords such as “photorealistic” or “2D” and sorting by rating would usually help.

Selecting a Sampler and Scheduler

Image diffusion is to start with noise and replaces the noise strategically with pixels until the final picture is produced. It is later found that this process can be represented as a stochastic differential equation. Solving the equation numerically is possible, and there are different algorithms of varying accuracy.

The most commonly used sampler is Euler. It is traditional but still useful. Then, there is a family of DPM samplers. Some new samplers, such as UniPC and LCM, have been introduced recently. Each sampler is an algorithm. It is to run for multiple steps, and different parameters are used in each step. The parameters are set using a scheduler, such as Karras or exponential. Some samplers have an alternative “ancestral” mode, which adds randomness to each step. This is useful if you want more creative output. Those samplers usually bear a suffix “a” in their name, such as “Euler a” instead of “Euler”. The non-ancestral samplers converge, i.e., they will cease changing the output after certain steps. Ancestral samplers would give a different output if you increase the step size.

Selecting sampler, scheduler, steps, and other parameters in the Stable Diffusion Web UI

As a user, you can assume Karras is the scheduler for all cases. However, the scheduler and step size would need some experimentation. Either Euler or DPM++2M should be selected because they balance quality and speed best. You can start with a step size of around 20 to 30; the more steps you choose, the better the output quality in terms of details and accuracy, but proportionally slower.

Size and CFG Scale

Recall that the image diffusion process starts from a noisy picture, gradually placing pixels conditioned by the prompt. How much the conditioning can impact the diffusion process is controlled by the parameter CFG scale (classifier-free guidance scale).

Unfortunately, the optimal value of CFG scale depends on the model. Some models work best with a CFG scale of 1 to 2, while others are optimized for 7 to 9. The default value is 7.5 in the Web UI. But as a general rule, the higher the CFG scale, the stronger the output image conforms to your prompt.

If your CFG scale is too low, the output image may not be what you expected. However, there is another reason you do not get what you expected: The output size. For example, if you prompt for a picture of a man standing, you may get a headshot of a half-body shot instead unless you set the image size to a height significantly greater than the width. The diffusion process sets the picture composition in the early steps. It is easier to devise a standing man on a taller canvas.

Generating a half-body shot if provided a square canvas.

Generating a full body shot with the same prompt, same seed, and only the canvas size is changed.

Similarly, if you give too much detail to something that occupies a small part of the image, those details would be ignored because there are not enough pixels to render those details. That is why SDXL, for example, is generally better than SD 1.5 since you usually use a larger pixel size.

As a final remark, generating pictures using image diffusion models involves randomness. Always start with a batch of several pictures to make sure the bad output is not merely due to the random seed.

Further Readings

This section provides more resources on the topic if you want to go deeper.

Summary

In this post, you learned about some subtle details that affects the image generation in Stable Diffusion. Specifically, you learned:

  • The difference between different versions of Stable Diffusion
  • How the scheduler and sampler affects the image diffusion process
  • How the canvas size may affect the output

Using OpenPose with Stable Diffusion

 We have just learned about ControlNet. Now, let’s explore the most effective way to control your character based on human pose. OpenPose is a great tool that can detect body keypoint locations in images and video. By integrating OpenPose with Stable Diffusion, we can guide the AI in generating images that match specific poses.

In this post, you will learn about ControlNet’s OpenPose and how to use it to generate similar pose characters. Specifically, we will cover:

  • What is Openpose, and how does it work?
  • How to use ControlNet Hugging Face Spaces to generate precise images using the reference image.
  • How to set up OpenPose in Stable Diffusion WebUI and use it to create high-quality images.
  • Various OpenPose processors focus on certain parts of the body.

Let’s get started.

Using OpenPose with Stable Diffusion
Photo by engin akyurt. Some rights reserved.

Overview

This post is in four parts; they are:

  • What is ControlNet OpenPose?
  • ControlNet in Hugging Face Space
  • OpenPose Editor in Stable Diffusion Web UI
  • Image to Image Generation

What is ControlNet OpenPose?

OpenPose is a deep learning model to detect human pose from an image. Its output are the positions of several keypoints (such as elbows, wrists, and knees) of the human in the picture. The OpenPose model in ControlNet is to accept the keypoints as the additional conditioning to the diffusion model and produce the output image with human aligned with those keypoints. Once you can specify the precise position of keypoints, it allows you to generate realistic images of human poses based on a skeleton image. You can use it to create artistic photos, animations, or illustrations of different poses.

ControlNet in Hugging Face Spaces

To try out the capability of ControlNet OpenPose model, you can use the free online demo on Hugging Face Spaces:

To start, you need to create the pose keypoints. This can be done easily by uploading an image and let the OpenPose model to detect them. First, you can download Yogendra Singh‘s photo and then upload it to the ControlNet Spaces. This ControlNet helps you to pin down the pose, but you still need to provide a text prompt to generate a picture. Let’s write the simple prompt “A woman is dancing in the rain.” and press the run button.

Using OpenPose ControlNet model on Hugging Face Spaces

Due to the random nature of image generation, you may want to do multiple attempts. You may also polish the prompt to give more details, such as the lighting, the scene, and the outfit that the woman is wearing. You can even expand the “Advanced options” panel at the bottom to provide more settings, such as negative prompts.

Settings in the “Advanced options” panel

In the example above, you can see that a high quality image of a woman dancing in the rain from a skeleton image is generated, in the similar pose as your uploaded image. Below are three other generations under the same prompt, all are exceptional and accurately follow the pose of the reference image.

Other generated images from the same prompt

OpenPose Editor from Stable Diffusion Web UI

You can also use the OpenPose ControlNet model from the Stable Diffusion Web UI. Indeed, not only you can upload an image to get the pose, you can edit the pose before applying to the diffusion model. In this section, you will learn how to set up OpenPose locally and generate images using OpenPose Editor.

Before you start using the OpenPose editor, you have to install it and download the model file.

  1. Make sure you have installed the ControlNet extension, if not please check the previous post.
  2. Install OpenPose Editor extension: At the “Extensions” tab on WebUI,  click on “Install from URL” and enter the following URL to install:
    • https://github.com/fkunn1326/openpose-editor
  3. Go to Hugging Face repository: https://hf.co/lllyasviel/ControlNet-v1-1/tree/main
  4. Download the OpenPose model “control_v11p_sd15_openpose.pth
  5. Put the model file in the the SD WebUI directory in stable-diffusion-webui/extensions/sd-webui-controlnet/models or stable-diffusion-webui/models/ControlNet 

Now that you have everything set up and a new tab named “OpenPose Editor” is added to the Web UI. Navigate to the “OpenPose Editor” tab and adjust the canvas width and height to your preference. Next, you can start modifying the skeleton image on the right using your mouse. It’s a straightforward process.

Let’s try to create a picture of a man carrying a large gun. You can make changes to the skeleton image to make it looks like the following:

Creating a pose with the OpenPose Editor

Then, click on the “Send to text2img” button. It will take you to text2img with the skeleton image added to the ControlNet panel.

The created pose on the ControlNet panel

Then, select “Enable” for this ControlNet model and make sure the “OpenPose” option is checked. You can also check “Low VRAM” and “Pixel Perfect”. The former is useful if your computer does not have enough memory on the GPU and the latter is to ask the ControlNet model to use the optimal resolution to match the output.

Next, you set up the positive and negative prompt, make changes to the size of the output image, the sampling method, and sampling steps. For example, the positive prompt can be

detailed, masterpiece, best quality, Astounding, Enchanting, Striking, tom clancy’s the division, man_holding_gun, us_marine, beach background

and the negative prompt can be

worst quality, low quality, lowres, monochrome, greyscale, multiple views, comic, sketch, bad anatomy, deformed, disfigured, watermark, multiple_views, mutation hands, watermark, bad facial

The image below, using size 912×512 and sampler DDIM for 30 steps, turned out to be perfectly matching the similar pose, with good details.

Output using OpenPose ControlNet model

Image to Image Generation

If you tried the ControlNet model in the Web UI, you should notice there are multiple OpenPose preprocessors. In the following, let’s explore some of them to focus on the face and upper body.

We will use the photo by Andrea Piacquadio from Pexels.com as a reference image. In the Web UI, let’s switch to the “img2img” tab and upload the reference image. Then at the ControlNet panel, enable and select “OpenPose” as the control type. By default in img2img, you will share the reference image with ControlNet. Next, change the Preprocessor to “openpose_face” in the ControNet panel, as follows:

Using “openpose_face” as the preprocessor

Afterward, set the positive prompt to match the style of the reference image and generate the image. Instead of a picture holding a tablet, let’s make the woman holding a phone:

detailed, best quality, Astounding, Enchanting, Striking, new_york, buildings, city, phone on the ear

Below is what you might get:

Image generated with img2img

We got a high quality result with the similar pose. You have to play around the prompt to match the pose. The preprocessor used here is “openpose_face” which means the pose as well as the face. Therefore, the generated picture matched the reference in the limb positions as well as facial expression.

Let’s change the Preprocessor to “openpose_faceonly” to focus on facial features only. In this way, only the keypoints on the face are recognized and no information about the body pose will be applied from the ControlNet model. Now, set the prompt to

detailed, best quality, Astounding, Enchanting, Striking, new_york, buildings, city

An improved result is generated accurately by following each keyword in the prompt, but the body pose is vastly different from the previous:

Image generated with the ControlNet provided only the facial keypoints

To understand why this is the case, you can check the output image from the preprocessor, as follows. The top image was generated using the “openpose_face” preprocessor, while the bottom image was generated using “openpose_faceonly”. Similarly, you can understand the output of various preprocessors by analyzing both skeleton structures.

Keypoints generated from different OpenPose preprocessors

Further Readings

This section provides more resources on the topic if you are looking to go deeper.

Summary

In this post, we delved deeper into the world of ControlNet OpenPose and how we can use it to get precise results. Specifically, we covered:

  • What is OpenPose, and how can it generate images immediately without setting up anything?
  • How to use Stable Diffusion WebUI and OpenPose Editor to generate an image of a custom pose by modifying the prompt and skeleton image.
  • Multiple OpenPose preprocessors to generate the image using full-face and face-only preprocessors in Stable Diffusion WebUI.

 
Connect broadband