Community Blog Accelerating Image Generation in Stable Diffusion with TensorRT and Alibaba Cloud ACK

Accelerating Image Generation in Stable Diffusion with TensorRT and Alibaba Cloud ACK

This article explains how to leverage TensorRT to speed up image generation in Stable Diffusion using the Alibaba Cloud ACK cloud-native AI suite.

By Jing Gu (Zibai)

How Does TensorRT Accelerate Stable Diffusion?

Generative AI's image generation technology has seen rapid development in recent years. It can create images based on human language descriptions and is extensively used in industries such as fashion, architecture, animation, advertising, and gaming.

Stable Diffusion WebUI is one of the most popular projects on GitHub utilizing generative AI for image creation. It encodes text using ClipText, then employs UNet and Scheduler to perform diffusion in the latent space, and finally uses an Autoencoder Decoder to transform the diffused information generated in the second step into an image.

Stable Diffusion Pipeline

The main challenge of the Stable Diffusion model is the slow image generation. To address this, Stable Diffusion has employed various methods to accelerate image generation, allowing real-time image generation. Stable Diffusion uses an encoder to convert an image from 3*512*512 to 4*64*64, which greatly reduces the amount of computation. Performing diffusion in the latent space greatly decreases computational complexity while ensuring the quality of image generation. Generating a complex description image on a GPU takes about 4 seconds, which is still considered slow for many consumer applications.

TensorRT, provided by NVIDIA, is a high-performance deep learning inference framework that enhances the concurrency of latency-sensitive applications by optimizing the compiler and runtime. It can optimize almost all deep neural networks, including CNNs, RNNs, and Transformers. The specific optimizations include:

• Reduces the mixed precision and supports FP32, TF32, FP16, and INT8
• Optimizes the GPU memory bandwidth
• Automatically adjusts the kernel function to select the best algorithm for the target GPU
• Provides dynamic Tensor memory allocation to improve memory usage
• Supports scaling to handle multiple computing streams
• Temporal fusion: Optimizes RNNs with time steps

The following figure shows the basic process of TensorRT, which can be divided into the building period and the running period.

TensorRT Pipeline

Practice Based on Alibaba Cloud ACK

Cloud-native AI Suite

The cloud-native AI suite is a Container Service for Kubernetes (ACK) offering provided by Alibaba Cloud, utilizing cloud-native AI technologies and products to assist enterprises in deploying cloud-native AI systems swiftly and efficiently.

This article explains how to leverage TensorRT to speed up image generation in Stable Diffusion using the Alibaba Cloud ACK cloud-native AI suite.

Environment Configuration

  1. Install the cloud-native AI suite by referring to the documentation.
  2. Log on to the Container Service console. In the left-side navigation pane, choose Clusters > Applications > Cloud-native AI Suite. When the development console is ready, click AI Developer Console.
  3. In the left-side navigation pane of the AI Developer Console, click Notebook. In the upper-right corner of the Notebook page, click Create Notebook to create a new notebook environment. Specifications of notebook resources: 16 C CPU, 32 GB memory, and 16 GB GPU memory.


Prepare the Stable Diffusion and TensorRT Environment

1.  Enter the following command in the notebook that you created to install the required dependencies.

!pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
!pip install --upgrade "torch <2.0.0"
!pip install --upgrade "tensorrt>=8.6"
!pip install --upgrade "accelerate" "diffusers==0.21.4" "transformers"
!pip install --extra-index-url https://pypi.ngc.nvidia.com --upgrade "onnx-graphsurgeon" "onnxruntime" "polygraphy"
!pip install  polygraphy==0.47.1 -i https://pypi.ngc.nvidia.com

2.  Download the dataset.

import diffusers
import torch
import tensorrt
from diffusers.pipelines.stable_diffusion import StableDiffusionPipeline
from diffusers import DDIMScheduler

# By default, the dataset is downloaded from the huggingface. If your machine cannot access the huggingface, you can also use a local model. 
# If you use a local model, replace runwayml/stable-diffusion-v1-5 with the address of the local model.
model_path = "runwayml/stable-diffusion-v1-5"
scheduler = DDIMScheduler.from_pretrained(model_path, subfolder="scheduler")

3.  Use TensorRT to generate a serialized network (the internal representation of the TRT computing graph)

# Use a custom pipeline.
pipe_trt = StableDiffusionPipeline.from_pretrained(

# Set the cache address.
# An engine folder is generated under the cache address, which contains the clip.plan, unet.plan, and vae.plan files. The initial generation of plan files on A10 takes about 35 minutes.
pipe_trt.set_cached_folder(model_path, revision='fp16')
pipe_trt = pipe_trt.to("cuda")

4.  Use the compiled model for inference.

# Generate an image.
prompt = "A beautiful ship is floating in the clouds, unreal engine, cozy indoor lighting, artstation, detailed, digital painting, cinematic"
neg_prompt = "ugly"

import time
start_time = time.time()
image = pipe_trt(prompt, negative_prompt=neg_prompt).images[0]
end_time = time.time()
print("time: "+str(round(end_time-start_time, 2))+"s")

Generating a single image takes 2.31 seconds.

Performance Test

The performance test is based on the lambda-diffusers project on GitHub. The number of prompts is 1, and the batch size is 50. Repeat 100 times. The GPU specification is A10, and the corresponding ECS instance type is ecs.gn7i-c8g1.2xlarge.


The experimental results indicate that enabling xformers and TensorRT optimization reduces the average image generation time in Stable Diffusion by 44.7% and decreases GPU memory usage by 37.6%.


0 1 0
Share on

You may also like