Although LoRAs still use Dreambooth to generate a new checkpoint, with all the same constraints.
The trick with LoRA is that it's a diff against the base checkpoint. So instead of storing the whole new thing you trained, you can store just a summary of the biggest changes; with a customizable cutoff point. That's why LoRA models are so much smaller, though also why you can only use one LoRA at a time; in effect you're loading a different model, for generating this specific image only.
Although LoRAs still use Dreambooth to generate a new checkpoint, with all the same constraints.
The trick with LoRA is that it's a diff against the base checkpoint. So instead of storing the whole new thing you trained, you can store just a summary of the biggest changes; with a customizable cutoff point. That's why LoRA models are so much smaller, though also why you can only use one LoRA at a time; in effect you're loading a different model, for generating this specific image only.
That would explain why my attempt to use two different LoRAs in ComfyUI by chaining them together didn't work very well. I wonder how A1111's "additional networks" panel that allow for using multiple LoRAs works.
That would explain why my attempt to use two different LoRAs in ComfyUI by chaining them together didn't work very well. I wonder how A1111's "additional networks" panel that allow for using multiple LoRAs works.
Generally speaking you just apply both. As Baughn notes, LoRAs are very similar to little models, just shrunk down until you only get their deltas. This means that you can do most of the stuff that you could do to a model you can also do to a LoRA. This includes things like merging LoRAs with models, other LoRAs, such and such. In fact, usage of LoRAs is basically you applying/merging the weight deltas on the network.
Check the additional_networks code.
Python:
for lora in self.text_encoder_loras + self.unet_loras:
if type(lora) == LoRAModule:
lora.apply_to() # ensure remove reference to original Linear: reference makes key of state_dict
self.add_module(lora.lora_name, lora)
else:
# SD2.x MultiheadAttention merge weights to MHA weights
lora_info: LoRAInfo = lora
if lora_info.module_name not in mha_loras:
mha_loras[lora_info.module_name] = {}
lora_dic = mha_loras[lora_info.module_name]
lora_dic[lora_info.lora_name] = lora_info
if len(lora_dic) == 4:
# calculate and apply
w_q_dw = state_dict.get(lora_info.module_name + '_q_proj.lora_down.weight')
if w_q_dw is not None: # corresponding LoRa module exists
w_q_up = state_dict[lora_info.module_name + '_q_proj.lora_up.weight']
w_k_dw = state_dict[lora_info.module_name + '_k_proj.lora_down.weight']
w_k_up = state_dict[lora_info.module_name + '_k_proj.lora_up.weight']
w_v_dw = state_dict[lora_info.module_name + '_v_proj.lora_down.weight']
w_v_up = state_dict[lora_info.module_name + '_v_proj.lora_up.weight']
w_out_dw = state_dict[lora_info.module_name + '_out_proj.lora_down.weight']
w_out_up = state_dict[lora_info.module_name + '_out_proj.lora_up.weight']
sd = lora_info.module.state_dict()
qkv_weight = sd['in_proj_weight']
out_weight = sd['out_proj.weight']
dev = qkv_weight.device
def merge_weights(weight, up_weight, down_weight):
# calculate in float
scale = lora_info.alpha / lora_info.dim
dtype = weight.dtype
weight = weight.float() + lora_info.multiplier * (up_weight.to(dev, dtype=torch.float) @ down_weight.to(dev, dtype=torch.float)) * scale
weight = weight.to(dtype)
return weight
q_weight, k_weight, v_weight = torch.chunk(qkv_weight, 3)
if q_weight.size()[1] == w_q_up.size()[0]:
q_weight = merge_weights(q_weight, w_q_up, w_q_dw)
k_weight = merge_weights(k_weight, w_k_up, w_k_dw)
v_weight = merge_weights(v_weight, w_v_up, w_v_dw)
qkv_weight = torch.cat([q_weight, k_weight, v_weight])
out_weight = merge_weights(out_weight, w_out_up, w_out_dw)
sd['in_proj_weight'] = qkv_weight.to(dev)
sd['out_proj.weight'] = out_weight.to(dev)
lora_info.module.load_state_dict(sd)
else:
# different dim, version mismatch
print(f"shape of weight is different: {lora_info.module_name}. SD version may be different")
for t in ["q", "k", "v", "out"]:
del state_dict[f"{lora_info.module_name}_{t}_proj.lora_down.weight"]
del state_dict[f"{lora_info.module_name}_{t}_proj.lora_up.weight"]
alpha_key = f"{lora_info.module_name}_{t}_proj.alpha"
if alpha_key in state_dict:
del state_dict[alpha_key]
else:
# corresponding weight not exists: version mismatch
pass
As you can see, it's literally just a for loop through all your loras and then merging the weights and storing them in your state dictionary. Later on, once you convert the state-dict to compvis, you just swap in the altered weights from your state_dict. LoRA implementation can also be applied to controlnets as well. See:
So the controlnet model becomes a SD1.X LoRA.
And if you're wondering, yes that means you can also turn models into loras. Generally speaking they're kind of shitty LoRAs unless you're extracting from a Dreambooth finetuned model (and even then tbh they're kind of shitty because LoRAs tend to work better if you caption your training data) but they'll work.
A potentially interesting new way to run Stable Diffusion, now entirely on the CPU! The benchmarks show that it's slower than using a GPU and may not support all CPUs (need to look into that more), but it should make Stable Diffusion accessible to a wider range of people. GitHub - bes-dev/stable_diffusion.openvino
The big downside is that it only comes with a command-line interface, which isn't terribly user friendly. That's manageable if you have experience with a Python IDE and fixable if you're experience with one of the GUI libraries for Python.
Edit: The other big downside is that it's set up to only work with downloading models from HuggingFace. The insistence some people have on forcing models to be downloaded from online every single time the script is run instead of using locally stored files baffles me.
There aren't really any. There's a setting you use to determine how large the Lora should be; in the limit it'll be 4GB, and basically an entire model. A poor one, because it was never designed for that, but you don't need to try it.
Somewhere around the 100MB mark you're unlikely to notice any difference between a Lora and its corresponding checkpoint.
That's of course only assuming you're applying it to the same model you trained against. Loras famously don't need to be used that way, but YMMV when you don't.
A potentially interesting new way to run Stable Diffusion, now entirely on the CPU! The benchmarks show that it's slower than using a GPU and may not support all CPUs (need to look into that more), but it should make Stable Diffusion accessible to a wider range of people. GitHub - bes-dev/stable_diffusion.openvino
The big downside is that it only comes with a command-line interface, which isn't terribly user friendly. That's manageable if you have experience with a Python IDE and fixable if you're experience with one of the GUI libraries for Python.
Edit: The other big downside is that it's set up to only work with downloading models from HuggingFace. The insistence some people have on forcing models to be downloaded from online every single time the script is run instead of using locally stored files baffles me.
Is using the CPU to run Stable Diffusion really that new? Cmdr2 has had the ability to run from the CPU for months, and it also has a GUI to boot. Not to mention that like most other UIs, Cmdr2 uses locally stored files for models.
(It does run slow on the CPU, but I usually limit the speed on the CPU when using it instead of the GPU for image generation. Otherwise CPU power draw goes wild and the CPU will get to temperatures above 80 degrees Celsius. Doubling the clock speed when boosting can quadruple the power consumption. An interesting note: You can set the exact same seed number, prompt, step number and guidance strength for CPU and GPU image generation, but the resulting images will not be identical.)
Is using the CPU to run Stable Diffusion really that new? Cmdr2 has had the ability to run from the CPU for months, and it also has a GUI to boot. Not to mention that like most other UIs, Cmdr2 uses locally stored files for models.
(It does run slow on the CPU, but I usually limit the speed on the CPU when using it instead of the GPU for image generation. Otherwise CPU power draw goes wild and the CPU will get to temperatures above 80 degrees Celsius. Doubling the clock speed when boosting can quadruple the power consumption. An interesting note: You can set the exact same seed number, prompt, step number and guidance strength for CPU and GPU image generation, but the resulting images will not be identical.)
Cmdr2 looks interesting, as a way to run Stable Diffusion on my home server, and would let me avoid shelling out 500+ AUD for a 8-12GB NVIDIA GPU for said machine; currently it only has a CPU. It also might melt my home server, though, or ratchet up the power use sky high, neither of which are... enjoyable consequences.
I've recently completed an IRC bot to talk to the A111 UI interface of SD, see, then realised I have nothing to run it on beyond my gaming desktop; not ideal for general use amongst friends on an IRC server I frequent. Hmm...
The field is moving crazy fast, though. Been wild following it all.
Do note that using Cmdr2's CPU mode for image generation will involve minutes per image.
Also, the CPU will boost to full speed during generation, unless you decide to limit clock speeds or power draw beforehand. Something to take note of if you want to avoid thermal throttling, or if you just hate the idea of CPU temperatures approaching 100 degrees Celsius.
I just shelled out for a new GPU instead, after thinking on things, yeah. My shiny RTX 3600 12GB should arrive via delivery in a few days, after which I'll swap the current gaming GPU into my home server for stable diffusion usage.
WHAT IS A UNET (Without going into the math because I don't understand it)
Recall this from the OP?
Like most latent diffusion models, Stable Diffusion consists of three parts: 1. A Variational AutoEncoder (If you have worked with video encoding in any way this might sound a bit familiar), 2. a UNet, which is a type of artifical neural network originally designed to do image segmentation for biomedical purposes at the University of Freiberg, and 3. a Contrastive Language–Image Pre-training (CLIP) encoder, aka a language neural model.
Unets are the primary thing responsible for creating coherent latent space for your image generation. Not a coherent image, because that's what the VAE does, but it's the part that's responsible for resolving that tangled mess of latent space into something that the VAE can process without it just seeming like jumbled noise. The first thing we have to understand is that a Unet is primarily built and designed for the purpose of image segmentation. What is image segmentation? It's when you segment (i.e. partition) an image into constituent components.
(A picture of a cat before and after run through image segmentation)
Basically what we're doing is removing a lot of details from the image and then reducing it down to just the thing we want. This might sound counterintuitive, you might think. Why are we removing details instead of adding them? Well, we don't exactly stare from a white canvas like an actual artist would. If you remember once more from the OP, Stable Diffusion is a denoising process, like the type used for making images less blurry, which means we are actually removing noise i.e. details from a noise filled latent space in order to get to the final image. This means what our unet is doing is actually removing excess noise until we get to the image we want.
Okay so now that we understand what the role of the Unet is, let's discuss what it is. A UNet is composed of two paths, an encoder, which is responsible for contracting the image, and a decoder which is responsible for expanding the image. As the image goes through contraction and expansion it loses fidelity, noise, details, or whatever you want to call it. However, the UNet's architecture is also different from traditional encoders and decoders. It's also the reason it's called a U-net.
It's shaped like a giant 'U'. The UNet layers, not only follow their 'path' of encoding and decoding but also 'talk' directly to their counterparts on the other side. This means that we are more capable of preserving the general shape and details of the original while continuing to shed ourselves of noise. This is why UNets are a lot stronger than many other image segmentation architecture when it comes to Stable Diffusion's needs.
Understanding basic components of a UNet is pretty useful because it's used in a lot of places and not just Stable Diffusion. But when it comes to stable diffusion checkpoints, specifically checkpoint merging (averaging the weights from different checkpoints to hybridize the strengths of both) it helps us understand and conceptualize an extremely powerful merging technique. You see, the UNet of Stable Diffusion is organized in shallow and deep blocks. And this changes everything.
BLOCK WEIGHT MERGING
Block weight merging was first conceptuliazed, as far as I can tell by the Stable Diffusion amateur researcher Kohya S., who also wrote most of the scripts that people use to train LoRAs (which has now been superseded as far as the cutting edge goes by LoCon and then subsequently LyCORIS by the time of this post) nowadays.
Kohya S. proposed the idea of merging deep and shallow layers separately. This allows us to somewhat finetune the merging process by selecting what exactly we want from both models. Say we want more of the compositional direction from one model and more of the details from the other. We can do this. Take a look at this completely nonauthoritive experimental musing about what each block layer possibly does.
As block-weight merging is even now, experimental, and the details of what exactly any part does in a particular UNet is a black box, it's best not to treat these images as authoritative as it is completely inferred but one basic concept can be understood. The deeper it lives to the backbone of the UNet, aka the middle layer, the more authority this layer posses. The shallower it is, the more it's responsible for fine granular details. This concept's results are best understood in the popular anime model 'AbyssOrangeMix'. AbyssOrangeMix uses AnythingV3, a popular Chinese checkpoint merged out of the NovelAI leak as a base, and applies a block merge of a realistic model onto it. This partially fuses the Unet layers responsible for color direction and fine grain granular details including shading, texture, etc of the realistic model onto AnythingV3. This gives us an anime model with incredibly detailed textures and is primarily responsible for the aesthetic style that has become the defining quality for 'AI Anime Art' these days.
scripts that people use to train LoRAs (which has now been superseded as far as the cutting edge goes by LoCon and then subsequently LyCORIS by the time of this post) nowadays.
I admit I don't really understand how using static lets it generate images, but I always figured the static required some level of, well, static, for the whole thing to work.
I admit I don't really understand how using static lets it generate images, but I always figured the static required some level of, well, static, for the whole thing to work.
How I look at the tech is very anthropomorphizied, but here is my explanation:
The vital point of the recent innovations in image AI tech is that we figured out that asking an AI to draw something from scratch is a bad idea, somehow, and we instead need to frame the task as 'Someone accidentally spilled some noise all over this image of an anime lady with big boobs. Could you clean it for me?'. That the image we then hand the AI is pure useless randomness with no image beneath doesn't actually matter. The magic is entirely in pretending that the image already exists and can be found if the AI looks hard enough and takes full advantage of our """helpful hints"""(which are actually just our wishes for what image we want).
So what is shown is that we can instead go for 'Someone accidentally spilled some cat pictures all over this...' without messing up the trick.
Oh, a fellow Linux user. You have any links or guide for starting out? I have the double-whammy of having Linux and AMD, so it's been a bit hard trying to find a clear and easy guide.
Oh, a fellow Linux user. You have any links or guide for starting out? I have the double-whammy of having Linux and AMD, so it's been a bit hard trying to find a clear and easy guide.
Not really. I've been trying to follow the instructions in the github readmes and in the the threadmarks here, but I haven't gotten anything working yet. :/
If you're using linux, shouldn't you only need to install python with whatever your distro's install method is? For some reason Stable diffusion webui uses virtual environments instead of just throwing the whole thing into a container so generally you can resolve the issue by just deleting the venv. But if you are actually having weird problems with venv, someone threw all of the major uis in a docker container:
Oh, a fellow Linux user. You have any links or guide for starting out? I have the double-whammy of having Linux and AMD, so it's been a bit hard trying to find a clear and easy guide.