Sunday 12 May 2024

Applied Deep Learning in Python Mini-Course

 Deep learning is a fascinating field of study and the techniques are achieving world class results in a range of challenging machine learning problems.

It can be hard to get started in deep learning.

Which library should you use and which techniques should you focus on?

In this post you will discover a 14-part crash course into deep learning in Python with the easy to use and powerful Keras library.

This mini-course is intended for python machine learning practitioners that are already comfortable with scikit-learn on the SciPy ecosystem for machine learning.Deep learning is a fascinating field of study and the techniques are achieving world class results in a range of challenging machine learning problems.

It can be hard to get started in deep learning.

Which library should you use and which techniques should you focus on?

In this post you will discover a 14-part crash course into deep learning in Python with the easy to use and powerful Keras library.

This mini-course is intended for python machine learning practitioners that are already comfortable with scikit-learn on the SciPy ecosystem for machine learning.

How to Use Stable Diffusion Effectively

 From the prompt to the picture, Stable Diffusion is a pipeline with many components and parameters. All these components working together creates the output. If a component behave differently, the output will change. Therefore, a bad setting can easily ruin your picture. In this post, you will see:

  • How the different components of the Stable Diffusion pipeline affects your output
  • How to find the best configuration to help you generate a high quality picture

Let’s get started.

How to Use Stable Diffusion Effectively.
Photo by Kam Idris. Some rights reserved.

Overview

This post is in three parts; they are:

  • Importance of a Model
  • Selecting a Sampler and Scheduler
  • Size and the CFG Scale

Importance of a Model

If there is one component in the pipeline that has the most impact, it must be the model. In the Web UI, it is called the “checkpoint”, named after how we saved the model when we trained a deep learning model.

The Web UI supports multiple Stable Diffusion model architectures. The most common architecture nowadays is the version 1.5 (SD 1.5). Indeed, all version 1.x share a similar architecture (each model has 860M parameters) but are trained or fine-tuned under different strategies.

Architecture of Stable Diffusion 1.x. Figure from Rombach et al (2022)

There is also Stable Diffusion 2.0 (SD 2.0), and its updated version 2.1. This is not a “revision” from version 1.5, but a model trained from scratch. It uses a different text encoder (OpenCLIP instead of CLIP); therefore, they would understand keywords differently. One noticeable difference is that OpenCLIP knows fewer names of celebrities and artists. Hence, the prompt from Stable Diffusion 1.5 may be obsolete in 2.1. Because the encoder is different, SD2.x and SD1.x are incompatible, while they share a similar architecture.

Next comes the Stable Diffusion XL (SDXL). While version 1.5 has a native resolution of 512×512 and version 2.0 increased it to 768×768, SDXL is at 1024×1024. You are not suggested to use a vastly different size than their native resolution. SDXL is a different architecture, with a much larger 6.6B parameters pipeline. Most notably, the models have two parts: the Base model and the Refiner model. They come in pairs, but you can swap out one of them for a compatible counterpart, or skip the refiner if you wish. The text encoder used combines CLIP and OpenCLIP. Hence, it should understand your prompt better than any older architecture. Running SDXL is slower and requires much more memory, but usually in better quality.

Architecture of SDXL. Figure from Podell et al (2023)

What matters to you is that you should classify your models into three incompatible families: SD1.5, SD2.x, and SDXL. They behave differently with your prompt. You will also find that SD1.5 and SD2.x would need a negative prompt for a good picture, but it is less important in SDXL. If you’re using SD2.x models, you will also notice that you can select your refiner in the Web UI.

Images generated with the prompt, ‘A fast food restaurant in a desert with name “Sandy Burger”’, using SD 1.5 with different random seed. Note that none of them spelled the name correctly.

Images generated with the prompt, ‘A fast food restaurant in a desert with name “Sandy Burger”’, using SD 2.0 with different random seed. Note that not all of them spelled the name correctly.

Images generated with the prompt, ‘A fast food restaurant in a desert with name “Sandy Burger”’, using SDXL with different random seed. Note that all of them spelled the name correctly.

One characteristic of Stable Diffusion is that the original models are less capable but adaptable. Therefore, a lot of third-party fine-tuned models are produced. Most significant are the models specializing in certain styles, such as Japanese anime, western cartoons, Pixar-style 2.5D graphics, or photorealistic pictures.

You can find models on Civitai.com or Hugging Face Hub. Search with keywords such as “photorealistic” or “2D” and sorting by rating would usually help.

Selecting a Sampler and Scheduler

Image diffusion is to start with noise and replaces the noise strategically with pixels until the final picture is produced. It is later found that this process can be represented as a stochastic differential equation. Solving the equation numerically is possible, and there are different algorithms of varying accuracy.

The most commonly used sampler is Euler. It is traditional but still useful. Then, there is a family of DPM samplers. Some new samplers, such as UniPC and LCM, have been introduced recently. Each sampler is an algorithm. It is to run for multiple steps, and different parameters are used in each step. The parameters are set using a scheduler, such as Karras or exponential. Some samplers have an alternative “ancestral” mode, which adds randomness to each step. This is useful if you want more creative output. Those samplers usually bear a suffix “a” in their name, such as “Euler a” instead of “Euler”. The non-ancestral samplers converge, i.e., they will cease changing the output after certain steps. Ancestral samplers would give a different output if you increase the step size.

Selecting sampler, scheduler, steps, and other parameters in the Stable Diffusion Web UI

As a user, you can assume Karras is the scheduler for all cases. However, the scheduler and step size would need some experimentation. Either Euler or DPM++2M should be selected because they balance quality and speed best. You can start with a step size of around 20 to 30; the more steps you choose, the better the output quality in terms of details and accuracy, but proportionally slower.

Size and CFG Scale

Recall that the image diffusion process starts from a noisy picture, gradually placing pixels conditioned by the prompt. How much the conditioning can impact the diffusion process is controlled by the parameter CFG scale (classifier-free guidance scale).

Unfortunately, the optimal value of CFG scale depends on the model. Some models work best with a CFG scale of 1 to 2, while others are optimized for 7 to 9. The default value is 7.5 in the Web UI. But as a general rule, the higher the CFG scale, the stronger the output image conforms to your prompt.

If your CFG scale is too low, the output image may not be what you expected. However, there is another reason you do not get what you expected: The output size. For example, if you prompt for a picture of a man standing, you may get a headshot of a half-body shot instead unless you set the image size to a height significantly greater than the width. The diffusion process sets the picture composition in the early steps. It is easier to devise a standing man on a taller canvas.

Generating a half-body shot if provided a square canvas.

Generating a full body shot with the same prompt, same seed, and only the canvas size is changed.

Similarly, if you give too much detail to something that occupies a small part of the image, those details would be ignored because there are not enough pixels to render those details. That is why SDXL, for example, is generally better than SD 1.5 since you usually use a larger pixel size.

As a final remark, generating pictures using image diffusion models involves randomness. Always start with a batch of several pictures to make sure the bad output is not merely due to the random seed.

Further Readings

This section provides more resources on the topic if you want to go deeper.

Summary

In this post, you learned about some subtle details that affects the image generation in Stable Diffusion. Specifically, you learned:

  • The difference between different versions of Stable Diffusion
  • How the scheduler and sampler affects the image diffusion process
  • How the canvas size may affect the output

More Prompting Techniques for Stable Diffusion

 The image diffusion model, in its simplest form, generates an image from the prompt. The prompt can be a text prompt or an image as long as a suitable encoder is available to convert it into a tensor that the model can use as a condition to guide the generation process. Text prompts are probably the easiest way to provide conditioning. It is easy to provide, but you may not find it easy enough to generate a picture that matches your expectations. In this post, you will learn:

  • How to construct your prompt
  • Elements of an effective prompt

Let’s get started.

More Prompting Techniques for Stable Diffusion
Photo by Simon English. Some rights reserved.

Overview

This post is in three parts; they are:

  • Using an Interrogator
  • Creating an Effective Prompt
  • Experimenting with Prompts

Using an Interrogator

If you start from scratch, it may not be easy to describe the picture in your mind. It is not easy because not everyone can effectively convey their idea in words. Moreover, the Stable Diffusion model may also not understand your prompt as you expected.

Undeniably, starting with something and modifying it would be easier. You can copy the prompt from other people’s success stories online. You can also provide a sample picture and let the Stable Diffusion Web UI build a prompt. This feature is called the “interrogator”.

Let’s download an image to the hard disk. Go to the “img2img” tab on Web UI, upload that image, and click the “Interrogate CLIP” button with a paperclip icon.

The interrogate buttons at the img2img tab in the Web UI

You should see a prompt is generated as:

a man standing on a mountain top looking at the mountains below him and a backpack on his back, with a backpack on his shoulder, Constant Permeke, a stock photo, sense of awe, postminimalism

This helps a lot to let you kickstart your prompt engineering. You can see that the first part of the prompt describes the picture. Then “Constant Permeke” was a painter. “Postminimalism” is an art movement. Together with “a stock photo”, their roles are to control the style. The term “sense of awe” controls the feeling, which hints the man is back to the camera and facing the wonder of nature.

Indeed, next to “Interrogate CLIP,” there’s another interrogate button on Web UI. The one with a cardboard box icon is “Interrogate Deepbooru”, based on a different image captioning model. For the same picture, you would see the prompt generated as:

1boy, backpack, bag, blue_sky, boots, building, city, cityscape, cliff, cloud, cloudy_sky, day, facing_away, field, from_behind, grass, hill, horizon, house, island, lake, landscape, male_focus, mountain, mountainous_horizon, ocean, outdoors, river, rock, scenery, sky, snow, solo, standing, tree, water, waterfall, waves

You have a sequence of keywords rather than a sentence. You can edit the prompts for your use, or use the generated prompt as your inspiration.

How good are the interrogate models? You should not expect to get back the original image from the prompt, but close. Repeating the prompt at the txt2img tab would give you this:

Picture generated using the prompt suggested by CLIP interrogator

Not too bad. But if you use the prompt created by Deepbooru, you probably see it less accurate:

Picture generated using the prompt suggested by Deepbooru interrogator

Creating an Effective Prompt

The CLIP model works well for photographs, while the Deepbooru model is for illustration, anime, and comics. However, using the prompt with an appropriate model is important. For example, if you are intended to produce anime-style pictures, using an anime checkpoint such as Counterfeit is helpful.

Let’s revisit the prompt generated by the CLIP model. Why is the original picture not generated?

A good prompt should mention three S:

  • Subject: What is in the foreground, and its setting
  • Scene: What is in the background, including the composition and the use of color
  • Style: The abstract description of the picture, including the medium

Indeed, there’s the fourth S: be specific. You should mention in detail what you see but not what you know. You should not say what is not shown in the picture. For example, do not mention what is in the backpack because you cannot see from the photo. You should mention not just a man but also his outfit. Describing the invisible and intangible (such as the man’s emotion) is usually unhelpful. If you need a thesaurus to help you, you can try an online prompt builder or even ChatGPT.

Using ChatGPT to help brainstorm about a text prompt for image generation

Let’s try to enrich the prompt:

  • Subject: a man standing on a mountain top, looking at the mountains below him, with a backpack, red jacket, shorts, back to viewer
  • Scene: bright blue sky, white cloud, next to a stack of rocks, sense of awe
  • Style: photorealistic, high details, wide angle, postminimalism, Constant Permeke

Combining all these, you may find the output to be like:

A picture generated by Stable Diffusion but not accurately following the prompt

Not perfect. The prompt provided many details, but the model doesn’t match everything. Of course, increasing the parameter “CFG Scale” can help since this asks the model to follow your prompt more closely. The other way to improve is to see what your model produces and emphasize the keywords that your model missed. You can use the syntax (keyword:weight)  to adjust the weight; the default weight is 1.0.

Several issues are missing in the picture above. The picture is a close-up of the man, so it is not a wide-angle shot. The man did not wear black shorts. Let’s emphasize both of these. Usually, increasing the weight from 1.0 to 1.1 helps. You would try a heavier weight only when you confirm you need that.

A better picture after adjusting the weights of keywords in the prompt

The picture above shows that the prompt (black_shorts:1.1)  is used. The underscore is intentional since it would be interpreted as space, but to enforce that the two words are interpreted together. Hence, it is more likely that “black” is known as an adjective for the noun “shorts”.

Sometimes, you try very hard, but the model does not follow your prompt accurately. You can work on the negative prompt to enforce what you do not want. For example, you see the man is not fully back to you. You can say “face” as a negative prompt, meaning you do not want to see his face.

Using negative prompts helps generate a better picture

Experimenting with Prompts

Creating pictures with stable diffusion may require patience and a lot of experimentation. This is because different models may work differently to the same prompt, and there is randomness in the image diffusion process. You may want to try different models, try different prompts, or even repeat the generation multiple times.

Some tools may save you time in this experimenting process. The easiest is to generate multiple pictures at once, each with a different random seed. If you set the batch size to greater than 1 and leave the seed to 1 (which means to generate a new seed each time), you can create multiple pictures in one click. Note that this will consume more memory on the GPU. If you have run out of memory, you can increase the batch count instead, which is to run multiple iterations of image generation. Slower, but use less memory.

Setting batch size and batch count, while keeping the seed to -1, generates multiple images at once

Once you find a good candidate out of the many generated, you can click on the picture to find the seed used. Then, to polish the picture further, you should fix the seed while modifying the prompt. Alter the prompt slightly each time so that you can slowly steer the generation to create the image you want.

The Web UI will report the parameters used to generate a picture, from which you can find the seed.

But how should you modify the prompt? One way is to try different combinations of the keywords. In the Web UI, you can use the “prompt matrix” script to help speed up this experimentation. You set the prompt into different parts separated by the pipe character ( | ), i.e.,

a man standing on a mountain top, looking at the mountains below him, with a backpack, red jacket, (black_shorts:1.1), (back to viewer:1.1), bright blue sky, white cloud, next to a stack of rocks, sense of awe, (wide angle:1.1) | photorealistic, high details | postminimalism, Constant Permeke

Then, at the bottom of the txt2img tab, select “Prompt matrix” in the Script section. Because the above prompt is set as the positive prompt, pick “positive” in the “Select prompt” section. Click “Generate,” and you will see multiple pictures generated:

Experimentation of different prompts using the prompt matrix script

The “prompt matrix” enumerates all combinations from your prompt, with each part as a unit. Note that the seed and all other parameters were fixed; only the prompt varies. This is essential to have a fair comparison of the effect of prompts.

Further Readings

This section provides more resources on the topic if you want to go deeper.

Summary

In this post, you learned about some techniques that helps you create a better picture in Stable Diffusion. Specifically, you learned:

  • How to use an interrogator to generate a prompt from an existing image
  • The three S for an effective prompt: subject, scene, and style
  • How to experiment with prompts effectively
Connect broadband

Applied Deep Learning in Python Mini-Course

 Deep learning is a fascinating field of study and the techniques are achieving world class results in a range of challenging machine learn...