Enhancing Diffusers Package deal for Excessive-High quality Picture Era | by Andrew Zhu | Apr, 2023

Goodbye Babel, generated by Andrew Zhu utilizing Diffusers in pure Python

Stable Diffusion WebUI from AUTOMATIC1111 has confirmed to be a robust device for producing high-quality pictures utilizing the Diffusion mannequin. Nevertheless, whereas the WebUI is simple to make use of, information scientists, machine studying engineers, and researchers typically require extra management over the picture technology course of. That is the place the diffusers package deal from huggingface is available in, offering a approach to run the Diffusion mannequin in Python and permitting customers to customise their fashions and prompts to generate pictures to their particular wants.

Regardless of its potential, the Diffusers package deal has a number of limitations that stop it from producing pictures nearly as good as these produced by the Secure Diffusion WebUI. Probably the most vital of those limitations embrace:

  • The lack to make use of customized fashions within the .safetensor file format;
  • The 77 immediate token limitation;
  • An absence of LoRA help;
  • And the absence of picture scale-up performance (also called HighRes in Secure Diffusion WebUI);
  • Low efficiency and excessive VRAM utilization by default.

This text goals to deal with these limitations and allow the Diffusers package deal to generate high-quality pictures corresponding to these produced by the Secure Diffusion WebUI. With the enhancement options supplied, information scientists, machine studying engineers, and researchers can get pleasure from better management and suppleness of their picture technology processes whereas additionally reaching distinctive outcomes. Within the following sections, we are going to discover the assorted methods and methods that can be utilized to beat these limitations and unlock the complete potential of the Diffusers package deal.

Word that please observe this hyperlink to put in all required CUDA and Python packages if it’s your first time operating Secure Diffusion.

1. Load Up Native Mannequin recordsdata in .safetensor Format

Customers can simply spin up diffusers to generate a picture like this:

from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")"cuda")
picture = pipeline("A cute cat taking part in piano").pictures[0]"image_of_cat_playing_piano.png")

You might not fulfill with both the output picture or the efficiency. Let’s take care of the issues one after the other. First, let’s load up a customized mannequin in .safetensor format positioned anyplace in your machine. you can’t simply load the mannequin file like this:

pipeline = DiffusionPipeline.from_pretrained("/mannequin/custom_model.safetensors")

Listed below are the detailed steps to covert .safetensor file to diffusers format:

Step 1. Pull all diffusers code from GitHub

git clone

Step 2. Beneath the scripts folder find the file:

In your terminal, run this command to transform .safetensor file to Diffusers format. Keep in mind to vary the — checkpoint_path worth to characterize your case.

python --from_safetensors --checkpoint_path="D:stable-diffusion-webuimodelsStable-diffusiondeliberate_v2.safetensors" --dump_path='D:sd_modelsdeliberate_v2' --device='cuda:0'

Step 3. Now you possibly can load up the pipeline utilizing the newly transformed mannequin file, right here is the whole code:

from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained(
picture = pipeline("A cute cat taking part in piano").pictures[0]"image_of_cat_playing_piano.png")

You must have the ability to convert and use any fashions you obtain from huggingface or

Cat taking part in piano generated by the above code

2. Increase the Efficiency of Diffusers

Producing high-quality pictures is usually a time-consuming course of even for the newest 3xxx and 4xxx Nvidia RTX GPUs. By default, Diffuers package deal comes with non-optimized settings. Two options may be utilized to tremendously enhance efficiency.

Right here is the interplay velocity earlier than making use of the next resolution, solely about 2.x iterations per second in RTX 3070 TI 8G RAM to generate a 512×512 picture

  • Use Half Precision Weights

The primary resolution is to make use of half precision weights. Half precision weights use 16-bit floating-point numbers as an alternative of the standard 32-bit numbers. This reduces the reminiscence required for storing weights and hastens computation, which might considerably enhance the efficiency of the Diffusers package deal.

Based on this video, lowering float precision from FP32 to FP16 can even allow the Tensor Cores.

I had one other article to check out how briskly GPU Tensor cores can enhance the computation velocity.

Right here is tips on how to allow FP16 in diffusers, Simply including two traces of code will enhance the efficiency by 500%, with virtually no picture high quality impacts.

from diffusers import DiffusionPipeline
import torch # <----- Line 1 added
pipeline = DiffusionPipeline.from_pretrained(
,torch_dtype = torch.float16 # <----- Line 2 Added
picture = pipeline("A cute cat taking part in piano").pictures[0]"image_of_cat_playing_piano.png")

Now the iteration velocity boosts to 10.x iteration per second. A 5x instances sooner.

Xformers is an open-source library that gives a set of high-performance transformers for varied pure language processing (NLP) duties. It’s constructed on high of PyTorch and goals to supply environment friendly and scalable transformer fashions that may be simply built-in into present NLP pipelines. (These days, are there any fashions that don’t use Transformer? :P)

Set up Xformers by pip set up xformers , then we will simply swap diffusers to make use of xformers by one line code.

pipeline.enable_xformers_memory_efficient_attention() <--- one line added

This one-line code boosts efficiency by one other 20%.

3. Take away the 77 immediate tokens limitation

Within the present model of Diffusers, there’s a limitation of 77 immediate tokens that can be utilized within the technology of pictures.

Luckily, there’s a resolution to this downside. By utilizing the “lpw_stable_diffusion” pipeline supplied by the group, you possibly can unlock the 77 immediate token limitation and generate high-quality pictures with longer prompts.

To make use of the “lpw_stable_diffusion” pipeline, you should utilize the next code:

pipeline = DiffusionPipeline.from_pretrained(
custom_pipeline="lpw_stable_diffusion", #<--- code added

On this code, we’re initializing a brand new DiffusionPipeline object utilizing the “from_pretrained” methodology. We’re specifying the trail to the pre-trained mannequin and setting the “custom_pipeline” argument to “lpw_stable_diffusion”. This tells Diffusers to make use of the “lpw_stable_diffusion” pipeline, which unlocks the 77 immediate token limitation.

Now, let’s use a protracted immediate string to try it out. Right here is the whole code:

from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
,custom_pipeline = "lpw_stable_diffusion" #<--- code added
,torch_dtype = torch.float16
immediate = """
Babel tower falling down, strolling on the starlight, dreamy extremely large shot
, atmospheric, hyper real looking, epic composition, cinematic, octane render
, artstation panorama vista pictures by Carr Clifton & Galen Rowell, 16K decision
, Panorama veduta picture by Dustin Lefevre & tdraw, detailed panorama portray by Ivan Shishkin
, DeviantArt, Flickr, rendered in Enscape, Miyazaki, Nausicaa Ghibli, Breath of The Wild
, 4k detailed publish processing, artstation, rendering by octane, unreal engine
picture = pipeline(immediate).pictures[0]"goodbye_babel_tower.png")

And you’ll get a picture like this:

Goodby Babel, generated by Andrew Zhu utilizing diffusers

For those who nonetheless see a warning message like: Token indices sequence size is longer than the desired most sequence size for this mannequin ( *** > 77 ) . Working this sequence by the mannequin will lead to indexing errors. It’s regular, you possibly can simply ignore it.

4. Use Customized LoRA with Diffusers

Regardless of the claims of LoRA support in Diffusers, customers nonetheless face limitations with regards to loading native LoRA recordsdata within the .safetensor file format. This is usually a vital impediment for customers to make use of the LoRA from the group.

To beat this limitation, I’ve created a operate that permits customers to load LoRA recordsdata with weighted numbers in actual time. This operate can be utilized to load LoRA recordsdata and their corresponding weights to a Diffusers mannequin, enabling the technology of high-quality pictures with LoRA information.

Right here is the operate physique:

from safetensors.torch import load_file
def __load_lora(
state_dict = load_file(lora_path)
LORA_PREFIX_UNET = 'lora_unet'

alpha = lora_weight
visited = []

# immediately replace weight in diffusers mannequin
for key in state_dict:

# as we've got set the alpha beforehand, so simply skip
if '.alpha' in key or key in visited:

if 'textual content' in key:
layer_infos = key.break up('.')[0].break up(LORA_PREFIX_TEXT_ENCODER+'_')[-1].break up('_')
curr_layer = pipeline.text_encoder
layer_infos = key.break up('.')[0].break up(LORA_PREFIX_UNET+'_')[-1].break up('_')
curr_layer = pipeline.unet

# discover the goal layer
temp_name = layer_infos.pop(0)
whereas len(layer_infos) > -1:
curr_layer = curr_layer.__getattr__(temp_name)
if len(layer_infos) > 0:
temp_name = layer_infos.pop(0)
elif len(layer_infos) == 0:
besides Exception:
if len(temp_name) > 0:
temp_name += '_'+layer_infos.pop(0)
temp_name = layer_infos.pop(0)

# org_forward(x) + lora_up(lora_down(x)) * multiplier
pair_keys = []
if 'lora_down' in key:
pair_keys.append(key.change('lora_down', 'lora_up'))
pair_keys.append(key.change('lora_up', 'lora_down'))

# replace weight
if len(state_dict[pair_keys[0]].form) == 4:
weight_up = state_dict[pair_keys[0]].squeeze(3).squeeze(2).to(torch.float32)
weight_down = state_dict[pair_keys[1]].squeeze(3).squeeze(2).to(torch.float32)
curr_layer.weight.information += alpha *, weight_down).unsqueeze(2).unsqueeze(3)
weight_up = state_dict[pair_keys[0]].to(torch.float32)
weight_down = state_dict[pair_keys[1]].to(torch.float32)
curr_layer.weight.information += alpha *, weight_down)

# replace visited checklist
for merchandise in pair_keys:

return pipeline

The logic is extracted from the of the diffusers git repo.

Take one of many well-known LoRA:MoXin for instance. you should utilize the __load_lora operate like this:

from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
,custom_pipeline = "lpw_stable_diffusion"
,torch_dtype = torch.float16
lora = (r"D:sd_modelsLoraMoxin_10.safetensors",0.8)
pipeline = __load_lora(pipeline=pipeline,lora_path=lora[0],lora_weight=lora[1])"cuda")

immediate = """
shukezouma,detrimental area,shuimobysim
a department of flower, conventional chinese language ink portray
picture = pipeline(immediate).pictures[0]"a department of flower.png")

The immediate will generate a picture like this:

a department of flower, generated by Andrew Zhu utilizing diffusers

You may name a number of instances of __load_lora() to load a number of LoRAs for one technology.

With this operate, now you can load LoRA recordsdata with weighted numbers in actual time and use them to generate high-quality pictures with Diffusers. The LoRA loading is fairly quick, often taking just one–2 seconds, means higher than changing and utilizing(which is able to generate one other mannequin file in GB measurement).

5. Use Customized Texture Inversions with Diffusers

Utilizing customized Texture Inversions with Diffusers package deal is usually a highly effective approach to generate high-quality pictures. Nevertheless, the official documentation of Diffusers means that customers want to coach their very own Textual Inversions which might take as much as an hour on a V100 GPU. This will not be sensible for a lot of customers who wish to generate pictures rapidly.

So I investigated it and located an answer that may allow diffusers to make use of a textual inversion similar to in Secure Diffusion WebUI. Beneath is the operate I created to load a customized Textual Inversion.

def load_textual_inversion(
, text_encoder
, tokenizer
, token = None
, weight = 0.5
Use this operate to load textual inversion mannequin in mannequin initilization stage
or picture technology stage.
loaded_learned_embeds = torch.load(learned_embeds_path, map_location="cpu")
string_to_token = loaded_learned_embeds['string_to_token']
string_to_param = loaded_learned_embeds['string_to_param']

# separate token and the embeds
trained_token = checklist(string_to_token.keys())[0]
embeds = string_to_param[trained_token]
embeds = embeds[0] * weight

# forged to dtype of text_encoder
dtype = text_encoder.get_input_embeddings().weight.dtype

# add the token in tokenizer
token = token if token just isn't None else trained_token
num_added_tokens = tokenizer.add_tokens(token)
if num_added_tokens == 0:
#print(f"The tokenizer already incorporates the token {token}.The brand new token will change the earlier one")
elevate ValueError(f"The tokenizer already incorporates the token {token}. Please move a special `token` that's not already within the tokenizer.")

# resize the token embeddings

# get the id for the token and assign the embeds
token_id = tokenizer.convert_tokens_to_ids(token)
text_encoder.get_input_embeddings().weight.information[token_id] = embeds
return (tokenizer,text_encoder)

Within the load_textual_inversion() operate, that you must present the next arguments:

  • learned_embeds_path: Path to the pre-trained textual inversion mannequin file in .pt or .bin format.
  • text_encoder: Textual content encoder object obtained from the Diffusion Pipeline.
  • tokenizer: Tokenizer object obtained from the Diffusion Pipeline.
  • token: Optionally available argument specifying the immediate token. By default, it’s set to None. it’s the key phrase that may set off the textual inversion in your immediate
  • weight: Optionally available argument specifying the burden of the textual inversion. By default, I set it to 0.5. you possibly can change to different worth as wanted.

Now you can use the operate with a diffusers pipeline like this:

from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
,custom_pipeline = "lpw_stable_diffusion"
,torch_dtype = torch.float16
,safety_checker = None

textual_inversion_path = r""

tokenizer = pipeline.tokenizer
text_encoder = pipeline.text_encoder
learned_embeds_path = textual_inversion_path
, tokenizer = tokenizer
, text_encoder = text_encoder
, token = 'styleempire'

immediate = """
styleempire,award successful lovely avenue, storm,((darkish storm clouds))
, fluffy clouds within the sky, shaded flat illustration, digital artwork
, trending on artstation, extremely detailed, wonderful element, intricate
, ((lens flare)), (backlighting), (bloom)
neg_prompt = """
cartoon, 3d, ((disfigured)), ((dangerous artwork)), ((deformed)), ((poorly drawn))
, ((additional limbs)), ((shut up)), ((b&w)), bizarre colours, blurry
, hat, cap, glasses, sun shades, lightning, face

generator = torch.Generator("cuda").manual_seed(1)
picture = pipeline(
,negative_prompt =neg_prompt
,generator = generator

Right here is the results of making use of an Empire Style Textual Inversion.

The left’s trendy avenue turns to an previous London type.

6. Upscale Pictures

Diffusers package deal is nice for producing high-quality pictures, however picture upscaling just isn’t its major operate. Nevertheless, the Secure-Diffusion-WebUI presents a characteristic known as HighRes, which permits customers to upscale their generated pictures to 2x or 4x. It might be nice if Diffusers customers might get pleasure from the identical characteristic. After some analysis and testing, I discovered that the SwinRI mannequin is a superb choice for picture upscaling, and it will probably simply upscale pictures to 2x or 4x after they’re generated.

To make use of the SwinRI mannequin for picture upscaling, we will use the code from the GitHub repository of JingyunLiang/SwinIR. For those who simply need codes, downloading fashions/, utils/ and is sufficient. Following the readme guideline, you possibly can upscale pictures like magic.

Here’s a pattern of how properly SwinRI can scale up a picture.

Left: unique picture, Proper: 4x SwinRI upscaled picture

Many different open-source options can be utilized to enhance picture high quality. Right here checklist three different fashions that I attempted that return fantastic outcomes.

RealSR can scale up a picture 4 instances virtually nearly as good as SwinRI, and its execution efficiency is the quickest, as an alternative of invoking PyTorch and CUDA. The creator compiles the code and CUDA utilization to binary immediately. My observations reveal that the RealSR can upscale a mage in about simply 2–4 seconds.

CodeFormer is nice at restoring blurred or damaged faces, it will probably additionally take away noise and improve background particulars. This resolution and algorithm is broadly utilized in different purposes, together with Secure-Diffusion-WebUI

One other highly effective open-source resolution that archives superb outcomes of face restoration, and it’s quick too. GFPGAN can also be built-in into Secure-Diffusion-WebUI.

7. Optimize Diffusers CUDA Reminiscence Utilization

When utilizing Diffusers to generate pictures, it’s essential to think about the CUDA reminiscence utilization, particularly once you wish to load different fashions to additional course of the generated pictures. For those who attempt to load one other mannequin like SwinIR to upscale pictures, you would possibly encounter a RuntimeError: CUDA out of reminiscence as a result of Diffuser mannequin nonetheless occupying the CUDA reminiscence.

To mitigate this problem, there are a number of options to optimize CUDA reminiscence utilization. The next two options I discovered work the perfect:

  • Sliced Consideration for Extra Reminiscence Financial savings

Sliced consideration is a way that reduces the reminiscence utilization of self-attention mechanisms in transformers. By partitioning the eye matrix into smaller blocks, the reminiscence necessities are lowered. This method can be utilized with the Diffusers package deal to scale back the reminiscence footprint of the Diffuser mannequin.

To make use of it in Diffusers, merely one line code:


Often, you received’t have two fashions operating on the identical time, the concept is to dump the mannequin information to the CPU reminiscence briefly and release CUA reminiscence area for different fashions, and solely load as much as VRAM once you begin utilizing the mannequin.

To make use of dynamically offload information to CPU reminiscence in Diffusers, use this line code:


After making use of this, at any time when Diffusers end the picture technology job, the mannequin information will likely be offloaded to CPU reminiscence routinely till the subsequent time calling.


The article discusses tips on how to enhance the efficiency and capabilities of the Diffusers package deal, The article covers a number of options to widespread points confronted by Diffusers customers, together with loading native .safetensor fashions, boosting efficiency, eradicating the 77 immediate tokens limitation, utilizing customized LoRA and Textual Inversion, upscaling pictures, and optimizing CUDA reminiscence utilization.

By making use of these options, Diffusers customers can generate high-quality pictures with higher efficiency and extra management over the method. The article additionally consists of code snippets and detailed explanations for every resolution.

For those who can efficiently apply these options and code in your case, there could possibly be an extra profit, which I profit quite a bit, is that you could be implement your individual options by studying the Diffusers supply code and perceive higher how Secure Diffusion works. To me, studying, discovering, and implementing these options is a enjoyable journey. Hope these options can even allow you to and want you get pleasure from with Secure Diffusion and diffusers package deal.

Right here present the immediate that generates the heading picture:

Babel tower falling down, strolling on the starlight, dreamy extremely large shot
, atmospheric, hyper real looking, epic composition, cinematic, octane render
, artstation panorama vista pictures by Carr Clifton & Galen Rowell, 16K decision
, Panorama veduta picture by Dustin Lefevre & tdraw, detailed panorama portray by Ivan Shishkin
, DeviantArt, Flickr, rendered in Enscape, Miyazaki, Nausicaa Ghibli, Breath of The Wild
, 4k detailed publish processing, artstation, rendering by octane, unreal engine

Dimension: 600 * 800
Seed: 3977059881
Scheduler (or Sampling methodology): DPMSolverMultistepScheduler
Sampling steps: 25
CFG Scale (or Steering Scale): 7.5
SwinRI mannequin: 003_realSR_BSRGAN_DFO_s64w8_SwinIR-M_x4_GAN.pth

License and Code Reuse

The options supplied on this article had been achieved by intensive supply studying, later evening testing, and logical design. It is very important be aware that on the time of writing (April 2023), loading LoRA and Textual Inversion options and code included on this article are the one working variations throughout the web.

For those who discover the code introduced on this article helpful and wish to reuse it in your undertaking, paper, or article, please reference again to this Medium article. The code introduced right here is licensed underneath the MIT license, which allows you to use, copy, modify, merge, publish, distribute, sublicense, and/or promote copies of the software program, topic to the situations of the license.

Please be aware that the options introduced on this article will not be the optimum or best approach to obtain the specified outcomes, and are topic to vary as new developments and enhancements are made. It’s all the time advisable to totally take a look at and validate any code earlier than implementing it in a manufacturing atmosphere.


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Check Also
Back to top button