Community Blog Accelerating Large Language Model Inference: High-performance TensorRT-LLM Inference Practices

Accelerating Large Language Model Inference: High-performance TensorRT-LLM Inference Practices

This article introduces how TensorRT-LLM improves the efficiency of large language model inference by using quantization, in-flight batching, attention, and graph rewriting.

By Zibai

1. How TensorRT-LLM Improves the LLM Inference Efficiency

Large language models (LLMs) are massive deep learning models pre-trained on extensive datasets. Underlying converters are comprised of a set of neural networks with self-attention encoders and decoders. These components extract meaning from a series of texts and understand the relationships between words and phrases within the texts.

The key bottleneck of LLM inference lies in the shortage of GPU memory resources. Thus, a variety of acceleration frameworks primarily emphasize reducing peak GPU memory usage and enhancing GPU utilization.

TensorRT-LLM[1] is an LLM inference optimization framework launched by NVIDIA. It provides a set of Python APIs for defining LLMs and uses the latest optimization techniques to convert LLMs to TensorRT Engines. The optimized TensorRT Engines are directly used for inference.

TensorRT-LLM primarily leverages the following four optimization techniques to enhance LLM model inference efficiency.

1.1 Quantization

Model quantization is a technique that reduces the GPU memory usage during model inference by decreasing the precision of the original model.

TensorRT supports multiple precisions for various models. The supported quantization precisions for some mainstream models are listed below.


W8A8 SQ uses the SmoothQuant technique[2], which reduces the model weight and activation layer to the quantization precision of INT8 without reducing the accuracy of model inference, significantly reducing GPU memory consumption.

W4A16 or W8A16 means that the model weight is at the quantization precision of INT4 or INT8, and the activation layer is at the quantization precision of FP16.

W4A16 AWQ and W4A16 GPTQ respectively implement the quantization methods mentioned in AWQ[3] and GPT[4]. The model weight is at the quantization precision of INT4 and the activation layer of FP16.

1.2 In-flight Batching

The traditional batching technique is static, which means that one batch can be performed only after all sequences in batching are inferred. The following figure shows an inference process in which the maximum output token is 8 and the batch size is 4. Static batching is used. The S3 sequence has completed inference at T5, but it cannot be processed until the S2 sequence is completed at T8. This results in a significant waste of resources.


In-flight batching is also known as continuous batching or iteration-level batching. It can improve inference throughput and reduce inference latency. The continuous batching process is as follows. When the S3 sequence is processed, a new sequence S5 is inserted for processing to improve resource utilization. For more information, see Orca: A Distributed Serving System for Transformer-Based Generative Models[5].


1.3 Attention

The attention mechanism is used to extract key/important information from sequences and plays a vital role in tasks such as emotion recognition, translation, and question-answering. Attention mechanisms can be divided into MHA (Multi-head Attention), MQA (Multi-query Attention)[6] and GQA (Group-query Attention)[7] mechanisms according to the evolutionary order. Both MQA and GQA are variants of MHA.


MHA is a standard multi-head mechanism. Each query stores a KV, which requires a large amount of GPU memory. All queries in MQA share one KV, so some details are prone to be lost during inference. GQA groups queries, and each group shares one KV, which effectively avoids the problems of MHA and MQA.

The TensorRT-LLM supports MHA, MQA, and GQA. For the implementation, see tensorrt_llm.functional.gpt_attention.

1.4 Graph Rewriting

TensorRT-LLM optimizes neural networks to improve execution efficiency when compiling LLMs to TensorRT Engines.

2. Practice Based on Alibaba Cloud ACK

2.1 Cloud-native AI Suite

The cloud-native AI suite is a solution provided by Alibaba Cloud Container Service for Kubernetes (ACK) that integrates cloud-native AI technologies and products. It helps enterprises efficiently implement cloud-native AI systems.

This article will describe how to utilize TensorRT-LLM to optimize LLM model inference based on Alibaba Cloud's ACK cloud-native AI suite.

2.2 Environment Configuration

  1. Refer to the documentation to install the cloud-native AI suite[8].
  2. Log on to the ACK console[9]. In the left-side navigation pane, choose Clusters > Applications > Cloud-native AI Suite. After the AI Developer Console is ready, click AI Developer Console.
  3. In the left-side navigation pane of the AI Developer Console, click Notebook. In the upper-right corner of the Notebook page, click Create Notebook to create a new notebook environment. The notebook requires the following resources: a 12-core CPU, 40 GB memory, and 24 GB GPU memory. (The required node specification is ecs.gn7i-c16g1.4xlarge[10].)


2.3 Preparing the TensorRT-LLM Environment

1) Build the image required for the notebook.

FROM docker.io/nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get upgrade -y && \
    apt-get install -y --no-install-recommends \
    libgl1 libglib2.0-0 wget git curl vim \
    python3.10 python3-pip python3-dev build-essential \
    openmpi-bin libopenmpi-dev jupyter-notebook jupyter

RUN pip3 install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com
RUN pip3 install --upgrade jinja2==3.0.3 pynvml>=11.5.0

RUN rm -rf /var/cache/apt/ && apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* && \
    rm -rf /root/.cache/pip/ && rm -rf /*.whl

RUN git clone https://github.com/NVIDIA/TensorRT-LLM.git --branch v0.7.1

ENTRYPOINT ["sh","-c","jupyter notebook --allow-root --notebook-dir=/root --port=8888 --ip= --ServerApp.token=''"]

2) Download the model. In this article, Baichuan2-7B-Base is chosen as an example.

a) Confirm that tensorrt_llm is installed.

! python3 -c "import tensorrt_llm; print(tensorrt_llm.__version__)"
# 0.7.1

b) Install baichuan dependencies.

! cd /root/TensorRT-LLM/examples/baichuan
! pip3 install -r requirements.txt

c) Download the Baichuan2-7B-Chat model.

! yum install git-lfs
! GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/baichuan-inc/Baichuan2-7B-Chat.git
! cd Baichuan2-7B-Chat/
! git lfs pull

d) Compile the model to TensorRT Engines and set the model weight to the quantization precision of INT8. Model conversion takes about 5 minutes.

! cd /root/TensorRT-LLM/examples/baichuan
# Build the Baichuan V2 7B model using a single GPU and apply INT8 weight-only quantization.
! python3 build.py --model_version v2_7b \
                --model_dir ./Baichuan2-7B-Chat \
                --dtype float16 \
                --use_gemm_plugin float16 \
                --use_gpt_attention_plugin float16 \
                --use_weight_only \
                --output_dir ./tmp/baichuan_v2_7b/trt_engines/int8_weight_only/1-gpu/

e) Use the Tensort Engines that you created for inference.

# With INT8 weight-only quantization inference
! python3 ../run.py --input_text " What is the second-highest mountain in the world?" \
                 --max_output_len=50 \
                 --tokenizer_dir=./Baichuan2-7B-Chat \

Expected output:

Input [Text 0]: "What is the second-highest mountain in the world?"
Output [Text 0 Beam 0]: "The second-highest mountain in the world is Chogori Peak (K2) of the Karakoram Mountains, with an altitude of 8,611 meters. "

2.4 Performance Testing

1) Use the built-in benchmark of TensorRT-LLM.

Add the baichuan2_7b_chat configuration to _allowed_configs dict. For code, see Reference[11].

Note: The 0.7.1 benchmark does not support the baichuan2 model. Therefore, you need to modify the allowed_configs configuration.

! cd /root/TensorRT-LLM/benchmarks/python
! vim allowed_configs.py
#   "baichuan2_7b_chat":

Run the benchmark:

! python3 benchmark.py \
    -m baichuan2_7b_chat \
    --mode plugin \
    --engine_dir /root/TensorRT-LLM/examples/baichuan/tmp/baichuan_v2_7b/trt_engines/int8_weight_only/1-gpu \
    --batch_size 1 \
--input_output_len "32,50;128,50"
# batch_size refers to the concurrency.
# input_output_len refers to the length of the input and output, and multiple test cases are separated with semicolons.

Expected outputs:

[BENCHMARK] model_name baichuan2_7b_chat world_size 1 num_heads 32 num_kv_heads 32 num_layers 32 hidden_size 4096 vocab_size 125696 precision float16 batch_size 1 input_length 32 output_length 50 gpu_peak_mem(gb) 8.682 build_time(s) 0 tokens_per_sec 60.95 percentile95(ms) 821.977 percentile99(ms) 822.093 latency(ms) 820.348 compute_cap sm86 generation_time(ms) 798.45 total_generated_tokens 49.0 generation_tokens_per_second 61.369
[BENCHMARK] model_name baichuan2_7b_chat world_size 1 num_heads 32 num_kv_heads 32 num_layers 32 hidden_size 4096 vocab_size 125696 precision float16 batch_size 1 input_length 128 output_length 50 gpu_peak_mem(gb) 8.721 build_time(s) 0 tokens_per_sec 59.53 percentile95(ms) 841.708 percentile99(ms) 842.755 latency(ms) 839.852 compute_cap sm86 generation_time(ms) 806.571 total_generated_tokens 49.0 generation_tokens_per_second 60.751

2) Compare the performance of the INT8 quantization model with that of the original model.

The execution command of the original model:

def normal_inference():
    from transformers import AutoModelForCausalLM, AutoTokenizer
    from transformers.generation.utils import GenerationConfig
    tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
    model.generation_config = GenerationConfig.from_pretrained(model_path)
    messages = []
    messages.append({"role": "user", "content": prompt})
    response = model.chat(tokenizer, messages)

The execution command of the INT8 quantization model:

def tensorrt_llm_inference():
    from subprocess import Popen, PIPE
    script = f'''python3 /root/TensorRT-LLM/examples/run.py --input_text \"{prompt}\"  \
                 --max_output_len=50 \
                 --tokenizer_dir=/root/TensorRT-LLM/examples/baichuan/Baichuan2-7B-Chat \
    p = Popen(['sh', '-c', script], stdout=PIPE,
    output, err = p.communicate()
    if p.returncode != 0:
        print(f"tensorrt_llm_inference() error:{err}")


Compared with the default Baichuan2-7B-Chat model, the INT8 quantization model used in the TensorRT-LLM acceleration solution reduces the memory peak by 43.8% and the latency by 61.1%.

3. References

[1] Reference
[2] SmoothQuant technique
[3] AWQ
[4] GPTQ
[5] Orca: A Distributed Serving System for Transformer-Based Generative Models
[6] MQA (Multi-query Attention)
[7] GQA (Group-query Attention)
[8] Install the cloud-native AI suite
[9] ACK console
[10] ecs.gn7i-c16g1.4xlarge
[11] TensorRT-LLM
[12] Reference

0 1 0
Share on

You may also like


Related Products