Stable Diffusion General

Have you tried out Stable Diffusion?


  • Total voters
    49
Sometimes you still see Dreamboothed models but it's mostly just another thing that LoRAs replaced.
Although LoRAs still use Dreambooth to generate a new checkpoint, with all the same constraints.

The trick with LoRA is that it's a diff against the base checkpoint. So instead of storing the whole new thing you trained, you can store just a summary of the biggest changes; with a customizable cutoff point. That's why LoRA models are so much smaller, though also why you can only use one LoRA at a time; in effect you're loading a different model, for generating this specific image only.
 

Although LoRAs still use Dreambooth to generate a new checkpoint, with all the same constraints.

The trick with LoRA is that it's a diff against the base checkpoint. So instead of storing the whole new thing you trained, you can store just a summary of the biggest changes; with a customizable cutoff point. That's why LoRA models are so much smaller, though also why you can only use one LoRA at a time; in effect you're loading a different model, for generating this specific image only.
That would explain why my attempt to use two different LoRAs in ComfyUI by chaining them together didn't work very well. I wonder how A1111's "additional networks" panel that allow for using multiple LoRAs works.
 
That would explain why my attempt to use two different LoRAs in ComfyUI by chaining them together didn't work very well. I wonder how A1111's "additional networks" panel that allow for using multiple LoRAs works.
Generally speaking you just apply both. As Baughn notes, LoRAs are very similar to little models, just shrunk down until you only get their deltas. This means that you can do most of the stuff that you could do to a model you can also do to a LoRA. This includes things like merging LoRAs with models, other LoRAs, such and such. In fact, usage of LoRAs is basically you applying/merging the weight deltas on the network.

Check the additional_networks code.
Python:
    for lora in self.text_encoder_loras + self.unet_loras:
      if type(lora) == LoRAModule:
        lora.apply_to()                           # ensure remove reference to original Linear: reference makes key of state_dict
        self.add_module(lora.lora_name, lora)
      else:
        # SD2.x MultiheadAttention merge weights to MHA weights
        lora_info: LoRAInfo = lora
        if lora_info.module_name not in mha_loras:
          mha_loras[lora_info.module_name] = {}

        lora_dic = mha_loras[lora_info.module_name]
        lora_dic[lora_info.lora_name] = lora_info
        if len(lora_dic) == 4:
          # calculate and apply
          w_q_dw = state_dict.get(lora_info.module_name + '_q_proj.lora_down.weight')
          if w_q_dw is not None:                       # corresponding LoRa module exists
            w_q_up = state_dict[lora_info.module_name + '_q_proj.lora_up.weight']
            w_k_dw = state_dict[lora_info.module_name + '_k_proj.lora_down.weight']
            w_k_up = state_dict[lora_info.module_name + '_k_proj.lora_up.weight']
            w_v_dw = state_dict[lora_info.module_name + '_v_proj.lora_down.weight']
            w_v_up = state_dict[lora_info.module_name + '_v_proj.lora_up.weight']
            w_out_dw = state_dict[lora_info.module_name + '_out_proj.lora_down.weight']
            w_out_up = state_dict[lora_info.module_name + '_out_proj.lora_up.weight']

            sd = lora_info.module.state_dict()
            qkv_weight = sd['in_proj_weight']
            out_weight = sd['out_proj.weight']
            dev = qkv_weight.device

            def merge_weights(weight, up_weight, down_weight):
              # calculate in float
              scale = lora_info.alpha / lora_info.dim
              dtype = weight.dtype
              weight = weight.float() + lora_info.multiplier * (up_weight.to(dev, dtype=torch.float) @ down_weight.to(dev, dtype=torch.float)) * scale
              weight = weight.to(dtype)
              return weight

            q_weight, k_weight, v_weight = torch.chunk(qkv_weight, 3)
            if q_weight.size()[1] == w_q_up.size()[0]:
              q_weight = merge_weights(q_weight, w_q_up, w_q_dw)
              k_weight = merge_weights(k_weight, w_k_up, w_k_dw)
              v_weight = merge_weights(v_weight, w_v_up, w_v_dw)
              qkv_weight = torch.cat([q_weight, k_weight, v_weight])

              out_weight = merge_weights(out_weight, w_out_up, w_out_dw)

              sd['in_proj_weight'] = qkv_weight.to(dev)
              sd['out_proj.weight'] = out_weight.to(dev)

              lora_info.module.load_state_dict(sd)
            else:
              # different dim, version mismatch
              print(f"shape of weight is different: {lora_info.module_name}. SD version may be different")

            for t in ["q", "k", "v", "out"]:
              del state_dict[f"{lora_info.module_name}_{t}_proj.lora_down.weight"]
              del state_dict[f"{lora_info.module_name}_{t}_proj.lora_up.weight"]
              alpha_key = f"{lora_info.module_name}_{t}_proj.alpha"
              if alpha_key in state_dict:
                del state_dict[alpha_key]
          else:
            # corresponding weight not exists: version mismatch
            pass
As you can see, it's literally just a for loop through all your loras and then merging the weights and storing them in your state dictionary. Later on, once you convert the state-dict to compvis, you just swap in the altered weights from your state_dict. LoRA implementation can also be applied to controlnets as well. See:

So the controlnet model becomes a SD1.X LoRA.

And if you're wondering, yes that means you can also turn models into loras. Generally speaking they're kind of shitty LoRAs unless you're extracting from a Dreambooth finetuned model (and even then tbh they're kind of shitty because LoRAs tend to work better if you caption your training data) but they'll work.
 
Last edited:
A potentially interesting new way to run Stable Diffusion, now entirely on the CPU! The benchmarks show that it's slower than using a GPU and may not support all CPUs (need to look into that more), but it should make Stable Diffusion accessible to a wider range of people.
GitHub - bes-dev/stable_diffusion.openvino

The big downside is that it only comes with a command-line interface, which isn't terribly user friendly. That's manageable if you have experience with a Python IDE and fixable if you're experience with one of the GUI libraries for Python.

Edit: The other big downside is that it's set up to only work with downloading models from HuggingFace. The insistence some people have on forcing models to be downloaded from online every single time the script is run instead of using locally stored files baffles me.
 
Last edited:
I like Loras, but just to ask, what are their limitations compared to a fully trained dreambooth model?
There aren't really any. There's a setting you use to determine how large the Lora should be; in the limit it'll be 4GB, and basically an entire model. A poor one, because it was never designed for that, but you don't need to try it.

Somewhere around the 100MB mark you're unlikely to notice any difference between a Lora and its corresponding checkpoint.

That's of course only assuming you're applying it to the same model you trained against. Loras famously don't need to be used that way, but YMMV when you don't.
 
A potentially interesting new way to run Stable Diffusion, now entirely on the CPU! The benchmarks show that it's slower than using a GPU and may not support all CPUs (need to look into that more), but it should make Stable Diffusion accessible to a wider range of people.
GitHub - bes-dev/stable_diffusion.openvino

The big downside is that it only comes with a command-line interface, which isn't terribly user friendly. That's manageable if you have experience with a Python IDE and fixable if you're experience with one of the GUI libraries for Python.

Edit: The other big downside is that it's set up to only work with downloading models from HuggingFace. The insistence some people have on forcing models to be downloaded from online every single time the script is run instead of using locally stored files baffles me.

Is using the CPU to run Stable Diffusion really that new? Cmdr2 has had the ability to run from the CPU for months, and it also has a GUI to boot. Not to mention that like most other UIs, Cmdr2 uses locally stored files for models.

(It does run slow on the CPU, but I usually limit the speed on the CPU when using it instead of the GPU for image generation. Otherwise CPU power draw goes wild and the CPU will get to temperatures above 80 degrees Celsius. Doubling the clock speed when boosting can quadruple the power consumption. An interesting note: You can set the exact same seed number, prompt, step number and guidance strength for CPU and GPU image generation, but the resulting images will not be identical.)
 
Last edited:
Is using the CPU to run Stable Diffusion really that new? Cmdr2 has had the ability to run from the CPU for months, and it also has a GUI to boot. Not to mention that like most other UIs, Cmdr2 uses locally stored files for models.

(It does run slow on the CPU, but I usually limit the speed on the CPU when using it instead of the GPU for image generation. Otherwise CPU power draw goes wild and the CPU will get to temperatures above 80 degrees Celsius. Doubling the clock speed when boosting can quadruple the power consumption. An interesting note: You can set the exact same seed number, prompt, step number and guidance strength for CPU and GPU image generation, but the resulting images will not be identical.)
I wasn't aware of Cmdr2, so it was new to me. Cmdr2 looks like it might be the better method.
Edit: The link to Cmdr2, for anyone who's interested: GitHub - cmdr2/stable-diffusion-ui: Easiest 1-click way to install and use Stable Diffusion on your own computer. Provides a browser UI for generating images from text prompts and images. Just enter your text prompt, and see the generated image.
 
Last edited:
Cmdr2 looks interesting, as a way to run Stable Diffusion on my home server, and would let me avoid shelling out 500+ AUD for a 8-12GB NVIDIA GPU for said machine; currently it only has a CPU. It also might melt my home server, though, or ratchet up the power use sky high, neither of which are... enjoyable consequences.

I've recently completed an IRC bot to talk to the A111 UI interface of SD, see, then realised I have nothing to run it on beyond my gaming desktop; not ideal for general use amongst friends on an IRC server I frequent. Hmm...

The field is moving crazy fast, though. Been wild following it all.
 
Do note that using Cmdr2's CPU mode for image generation will involve minutes per image.

Also, the CPU will boost to full speed during generation, unless you decide to limit clock speeds or power draw beforehand. Something to take note of if you want to avoid thermal throttling, or if you just hate the idea of CPU temperatures approaching 100 degrees Celsius.
 
Last edited:
Do note that using Cmdr2's CPU mode for image generation will involve minutes per image.
I just shelled out for a new GPU instead, after thinking on things, yeah. My shiny RTX 3600 12GB should arrive via delivery in a few days, after which I'll swap the current gaming GPU into my home server for stable diffusion usage.
 
What is a Unet?
WHAT IS A UNET (Without going into the math because I don't understand it)

Recall this from the OP?
Like most latent diffusion models, Stable Diffusion consists of three parts: 1. A Variational AutoEncoder (If you have worked with video encoding in any way this might sound a bit familiar), 2. a UNet, which is a type of artifical neural network originally designed to do image segmentation for biomedical purposes at the University of Freiberg, and 3. a Contrastive Language–Image Pre-training (CLIP) encoder, aka a language neural model.
That's what we're going to talk about.

Unets are the primary thing responsible for creating coherent latent space for your image generation. Not a coherent image, because that's what the VAE does, but it's the part that's responsible for resolving that tangled mess of latent space into something that the VAE can process without it just seeming like jumbled noise. The first thing we have to understand is that a Unet is primarily built and designed for the purpose of image segmentation. What is image segmentation? It's when you segment (i.e. partition) an image into constituent components.


(A picture of a cat before and after run through image segmentation)

Basically what we're doing is removing a lot of details from the image and then reducing it down to just the thing we want. This might sound counterintuitive, you might think. Why are we removing details instead of adding them? Well, we don't exactly stare from a white canvas like an actual artist would. If you remember once more from the OP, Stable Diffusion is a denoising process, like the type used for making images less blurry, which means we are actually removing noise i.e. details from a noise filled latent space in order to get to the final image. This means what our unet is doing is actually removing excess noise until we get to the image we want.



Okay so now that we understand what the role of the Unet is, let's discuss what it is. A UNet is composed of two paths, an encoder, which is responsible for contracting the image, and a decoder which is responsible for expanding the image. As the image goes through contraction and expansion it loses fidelity, noise, details, or whatever you want to call it. However, the UNet's architecture is also different from traditional encoders and decoders. It's also the reason it's called a U-net.



It's shaped like a giant 'U'. The UNet layers, not only follow their 'path' of encoding and decoding but also 'talk' directly to their counterparts on the other side. This means that we are more capable of preserving the general shape and details of the original while continuing to shed ourselves of noise. This is why UNets are a lot stronger than many other image segmentation architecture when it comes to Stable Diffusion's needs.



Understanding basic components of a UNet is pretty useful because it's used in a lot of places and not just Stable Diffusion. But when it comes to stable diffusion checkpoints, specifically checkpoint merging (averaging the weights from different checkpoints to hybridize the strengths of both) it helps us understand and conceptualize an extremely powerful merging technique. You see, the UNet of Stable Diffusion is organized in shallow and deep blocks. And this changes everything.

BLOCK WEIGHT MERGING

Block weight merging was first conceptuliazed, as far as I can tell by the Stable Diffusion amateur researcher Kohya S., who also wrote most of the scripts that people use to train LoRAs (which has now been superseded as far as the cutting edge goes by LoCon and then subsequently LyCORIS by the time of this post) nowadays.
note.com

Stable DiffusionのモデルをU-Netの深さに応じて比率を変えてマージする|Kohya S.|note

概要 Stable Diffusionでは、複数のモデルの重みをマージすると中間的な出力が得られることが知られています。 以下の比較結果が大変参考になります。 またマージ用のスクリプトは以下で公開されています。 また(ざっくりとした私の理解では)Stable Diffusionは大きくText Encoder (CLIP)、Denoising Auto Encoder (U-Net)、Auto Encoderからなります。このうち、U-Netが画像をノイズから画像を生成する部分担当しています。 U-Netは広く使われているネットワーク構造で、構造がその名の通りU字型

Kohya S. proposed the idea of merging deep and shallow layers separately. This allows us to somewhat finetune the merging process by selecting what exactly we want from both models. Say we want more of the compositional direction from one model and more of the details from the other. We can do this. Take a look at this completely nonauthoritive experimental musing about what each block layer possibly does.





As block-weight merging is even now, experimental, and the details of what exactly any part does in a particular UNet is a black box, it's best not to treat these images as authoritative as it is completely inferred but one basic concept can be understood. The deeper it lives to the backbone of the UNet, aka the middle layer, the more authority this layer posses. The shallower it is, the more it's responsible for fine granular details. This concept's results are best understood in the popular anime model 'AbyssOrangeMix'. AbyssOrangeMix uses AnythingV3, a popular Chinese checkpoint merged out of the NovelAI leak as a base, and applies a block merge of a realistic model onto it. This partially fuses the Unet layers responsible for color direction and fine grain granular details including shading, texture, etc of the realistic model onto AnythingV3. This gives us an anime model with incredibly detailed textures and is primarily responsible for the aesthetic style that has become the defining quality for 'AI Anime Art' these days.

 
scripts that people use to train LoRAs (which has now been superseded as far as the cutting edge goes by LoCon and then subsequently LyCORIS by the time of this post) nowadays.
Did... did the people naming these training models get inspired by Lycoris Recoil for the third one? Or is it a coincidence?

But yeah, neat to know how you can influence models to take X from model A and y from model B.
 
Did... did the people naming these training models get inspired by Lycoris Recoil for the third one? Or is it a coincidence?

But yeah, neat to know how you can influence models to take X from model A and y from model B.
Yeah it's a lycoris recoil meme.

github.com

GitHub - KohakuBlueleaf/LyCORIS: Lora beYond Conventional methods, Other Rank adaptation Implementations for Stable diffusion.

Lora beYond Conventional methods, Other Rank adaptation Implementations for Stable diffusion. - GitHub - KohakuBlueleaf/LyCORIS: Lora beYond Conventional methods, Other Rank adaptation Implementati...
 
Last edited:
Incidentally besides gaussian noise, do you know there's other latent noise that researchers have been experimenting with?

github.com

GitHub - arpitbansal297/Cold-Diffusion-Models: Official implementation of Cold-Diffusion for different transformations in pytorch.

Official implementation of Cold-Diffusion for different transformations in pytorch. - arpitbansal297/Cold-Diffusion-Models

My favorite is animorphs because it's just so ridiculous. It's the third one:
How TF do 2 and 4 qualify as noise?

I admit I don't really understand how using static lets it generate images, but I always figured the static required some level of, well, static, for the whole thing to work.
 
How TF do 2 and 4 qualify as noise?

I admit I don't really understand how using static lets it generate images, but I always figured the static required some level of, well, static, for the whole thing to work.
How I look at the tech is very anthropomorphizied, but here is my explanation:
The vital point of the recent innovations in image AI tech is that we figured out that asking an AI to draw something from scratch is a bad idea, somehow, and we instead need to frame the task as 'Someone accidentally spilled some noise all over this image of an anime lady with big boobs. Could you clean it for me?'. That the image we then hand the AI is pure useless randomness with no image beneath doesn't actually matter. The magic is entirely in pretending that the image already exists and can be found if the AI looks hard enough and takes full advantage of our """helpful hints"""(which are actually just our wishes for what image we want).

So what is shown is that we can instead go for 'Someone accidentally spilled some cat pictures all over this...' without messing up the trick.
 
Don't worry, you can still be in for a miserable time if you have the wrong pytorch version. God, a fucking week later, this takes the piss.
Uh, any advice on this? I'm currently beating my head against it, and I seem to be making progress, but I'd still love to hear what worked for you. :/
 
Oh, a fellow Linux user. You have any links or guide for starting out? I have the double-whammy of having Linux and AMD, so it's been a bit hard trying to find a clear and easy guide.

Not really. I've been trying to follow the instructions in the github readmes and in the the threadmarks here, but I haven't gotten anything working yet. :/
 
Linux and webui! Though I'm thinking I might just try building from source or whatever. :/
If you're using linux, shouldn't you only need to install python with whatever your distro's install method is? For some reason Stable diffusion webui uses virtual environments instead of just throwing the whole thing into a container so generally you can resolve the issue by just deleting the venv. But if you are actually having weird problems with venv, someone threw all of the major uis in a docker container:
github.com

GitHub - AbdBarho/stable-diffusion-webui-docker: Easy Docker setup for Stable Diffusion with user-friendly UI

Easy Docker setup for Stable Diffusion with user-friendly UI - GitHub - AbdBarho/stable-diffusion-webui-docker: Easy Docker setup for Stable Diffusion with user-friendly UI

Oh, a fellow Linux user. You have any links or guide for starting out? I have the double-whammy of having Linux and AMD, so it's been a bit hard trying to find a clear and easy guide.

View: https://twitter.com/PellyNV/status/1661035100581113858
>me when my GPU provider has no ML support.
 
Back
Top