BladeLLM advanced operations - Platform For AI - Alibaba Cloud Documentation Center

BladeLLM supports various advanced operations. This topic describes how to configure these operations.

Multi-GPU inference deployment

BladeLLM supports using multiple GPUs for model inference and text generation to improve inference speed and support a wider range of scenarios. The deployment process consists of two steps:

For non-quantized models, you can directly start the service by using blade_llm_server.
For quantized models with multi-GPU deployment, you need to first use blade_llm_split to split the quantized model, and then start the service by using blade_llm_server.

This section demonstrates how to use multi-GPU inference by deploying a Llama model on two GPUs.

Model splitting

Use the command line statement blade_llm_split to split the model. Currently, various Hugging Face model architectures are supported, including Bloom, Llama, ChatGLM, and OPT.

The method for creating a model splitting task is basically the same as that for creating a quantization task. You only need to configure Command as the model splitting command shown below. For more information, see Create a quantization task. After the task runs successfully, the split model will be saved in your specified output path by rank.

blade_llm_split \
  --tensor_parallel_size 2 \
  --model ./local/llama_model_dir \
  --output_dir ./llama_split_2_2

The following table describes the main parameters. You can view the complete help information by using the -h parameter.

Parameter	Type	Description
tensor_parallel_size	int	The number of GPUs per parallel group after model tensor parallel splitting.
model	str	The directory of the original floating-point model, which needs to match the OSS path mounted in the model configuration.
output_dir	str	The path where the split model is saved, which needs to match the OSS path mounted in the model configuration.

Model deployment

Use blade_llm_server to directly load the split model path for inference. For more information, see Get started with BladeLLM.

blade_llm_server \  
  --port 8081 \  
  --tensor_parallel_size 2 \  
  --model ./llama_split_2_2

Speculative decoding

Speculative decoding (also known as speculative sampling) supports using a smaller model as a teleprompter of the original model, improving overall throughput and generation speed while maintaining model accuracy.

Using speculative decoding

To deploy a service by using the BladeLLM engine, configure the key parameters described in the following table in the Advanced Settings section. For information about other parameters, see Scenario-based model deployment.

Parameter	Description
Speculative Decoding	Turn on Speculative Decoding.
Draft Model	Select a public model or a custom model. Only smaller models with the same vocabulary as the original model are supported.
Speculative Step Size	The length of the token sequence generated by the draft model in each speculation. Default value: 4. These sequences are verified and filtered by the target model. The number of tokens predicted each time should not be too many, and a value less than or equal to 5 is recommended.
Maximum GPU Memory Usage	Because the speculative sampling feature sequentially calls the draft model and the original model, the GPU memory ratio specified by this parameter for kv_cache needs to be appropriately reduced.
Command Preview	You can click Switch to Free Edit Mode in the upper-right corner of the Advanced Settings section to modify the run command. Sample code: `blade_llm_server --model Qwen2-72B --attn_cls ragged_flash --draft.model Qwen2-1.5B --draft.attn_cls ragged_flash --decode_algo sps --gamma 2` Parameters: --decode_algo sps: indicates that the speculative sampling feature is enabled. --draft.model: the draft model. --gamma: the speculative step size. --max_gpu_memory_utilization: the maximum GPU memory usage.

Precautions

Because the speculative decoding (speculative sampling) feature sequentially calls the draft model and the original model, the GPU memory ratio specified by –max_gpu_memory_utilization for kv_cache needs to be appropriately reduced.
The number of tokens predicted each time should not be too many. It is recommended to use a value less than or equal to 5 for –gamma (speculative step size).

LoRA service deployment and calling

Low-Rank Adaptation (LoRA) is a large language model (LLM) fine-tuning technique that achieves efficient and fast model adaptation by introducing low-rank matrices. The principle of LoRA is to add a bypass to the pre-trained model, including Dimensionality Reduction Matrix A and Dimensionality Increase Matrix B, where:

$B \in R^{d \times r}$ , $A \in R^{r \times k}$ , $r ≪ min (d, k)$ .

r can be viewed as the size of the rank. During training, the pre-trained parameters are fixed, and only A and B are trained. During inference, the parameters from both paths are stacked.

The method for using LoRA models in BladeLLM is as follows:

During deployment, specify the main model by using --model. During requests, specify the LoRA model by using --model. The service will load the LoRA model according to the priority: Try to obtain the LoRA model in the GPU. If it does not exist in the GPU, obtain the model from the CPU. If it does not exist in the CPU, load the model from the disk. When the number of LoRA modules in the GPU reaches max_loras, or the number of LoRA modules in the CPU reaches max_cpu_loras, the system will uninstall the least recently used LoRA model, and load a new LoRA model.

The following modules and models are supported for deploying LoRA services:

Module: ColParaLinearWithLoRA, RowParaLinearWithLoRA, QKVProjectionWithLoRA, and ColParaLinearWithSliceLoRA.
Model: Qwen, Qwen1.5, Qwen2, and Qwen2.5.

Deploying a service with a LoRA adapter

When you deploy a service by using the BladeLLM engine, configure LoRA-related parameters in Command of the Advanced Settings section. For information about other parameters, see Scenario-based model deployment. For example, you can run the following sample command to enable and load LoRA parameters:

# Load the LoRA path by using --model. Enable the LoRA feature by using -enable_lora.
blade_llm_server --model ~/workspace/test_models/Qwen--Qwen2-7B/ --enable_lora

Other optional LoRA parameters are as follows:

max_lora_rank: the maximum rank of the LoRA. Default value: 16.
max_loras: the number of LoRA modules that the GPU supports storing.
max_cpu_loras: the number of LoRA modules that the CPU supports storing. The value of this parameter is usually greater than the max_loras value.
lora_dtype: the data type in LoRA. Valid values: float16, bfloat16, and float32.

Calling a LoRA adapter service

When the client calls a service, it specifies the LoRA path by using model. Sample code:

# Call local service; replace <EAS_Service_URL> with the service endpoint. 
curl -X POST \
    -H "Content-Type: application/json" \ 
    -d '{
        "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "Hello!"
        }
        ],
        "model": "~/workspace/test_models/Ko-QWEN-7B-Chat-LoRA/"
    }' \
    <EAS_Service_URL>/v1/chat/completions

Structured output (JSON mode)

JSON is a widely used data exchange format. If you want the model to strictly generate outputs in JSON format while avoiding designing overly complex prompts, you can use the JSON mode feature provided by BladeLLM.

JSON mode is a structured output method by using guided decoding or constrained decoding. It corrects the decoding output according to the JSON schema you provide, thereby strictly generating outputs in JSON format. In addition to JSON, regular expressions are also supported to standardize outputs.

Based on the JSON schema or regular expression you enter, BladeLLM uses the describes tool to construct a finite-state machine (FSM). This finite-state machine is a mapping table in the form of Map<token,List<token>>, indicating the List<token> candidate set that conforms to the constraints following any token. According to this candidate set, BladeLLM filters out tokens that do not meet the requirements from the logits prob, thus standardizing the generation of the next word.

Guided encoding supports the following formats:

JSON schema
Regex expression

Models that support using the Hugging Face default tokenizer (such as LLaMA Herd, Qwen1.5, and Qwen2) are supported, but models that rewrite the tokenizer to bytes type (such as Qwen1 and GLM) are not supported.

How to use the JSON mode

No changes are needed for server-side model deployment. When the client calls a service, it can specify guided_json or guided_regex based on requirements.

Click here to view the sample code

#!/usr/bin/env python

import json
import requests

TEST_REGEX = r"((25[0-5]|(2[0-4]|1\d|[1-9]|)\d)\.){3}" + r"(25[0-5]|(2[0-4]|1\d|[1-9]|)\d)"

TEST_SCHEMA = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"},
        "skills": {"type": "array", "items": {"type": "string", "maxLength": 10}, "minItems": 3},
        "work history": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "company": {"type": "string"},
                    "duration": {"type": "string"},
                    "position": {"type": "string"},
                },
                "required": ["company", "position"],
            },
        },
    },
    "required": ["name", "age", "skills", "work history"],
}

url = "http://localhost:8081/v1/completions"
# prompt = f"Give an example IPv4 address with this regex: {TEST_REGEX}\n"
prompt = f"Give an example JSON for an employee profile that fits this schema: {TEST_SCHEMA}\n"

req = {
    "prompt": prompt,
    "stream": True,
    "temperature": 0.0,
    "top_p": 0.5,
    "top_k": 10,
    "max_tokens": 200,
    # "guided_regex": TEST_REGEX,
    "guided_json": json.dumps(TEST_SCHEMA),
}
response = requests.post(
    url,
    json=req,
    headers={"Content-Type": "application/json"},
    stream=True,
)
for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False):
    msg = chunk.decode("utf-8")
    if msg.startswith('data'):
        info = msg[6:]
        if info == '[DONE]':
            break
        else:
            resp = json.loads(info)
            print(resp['choices'][0]['text'], end='', flush=True)

Precautions

Only one guided_json or guided_regex is allowed in a single request. In multiple concurrent requests, different guided_json or guided_regex can be used, respectively.
When the first JSON mode request is sent to the server, there will be a relatively long time spent constructing the FSM, with the specific duration depending on the complexity of guided_json or guided_regex. Subsequently, the FSM will be cached, and there is no need to reconstruct the FSM for the second and subsequent requests with the same guided_json or guided_regex, significantly reducing the time.

Using CUDA Graph

The Compute Unified Device Architecture (CUDA) Graph feature can reduce the overhead of launching on both the device and host by pre-creating or capturing a graph, transforming many kernel launches in the graph into a single graph launch. BladeLLM runs the decoding phase of the model through CUDA Graph, which can improve decoding speed.

Configuring CUDA Graph

When you deploy a service by using the BladeLLM engine, add the following key configurations to Command in the Advanced Settings section. For information about other parameters, see Scenario-based model deployment.

Parameter	Type	Description
disable_cuda_graph	bool	Used to disable the CUDA Graph feature. Default value: False. During runtime, the first capture will have a log message `caching graph of graph_batch_size=xx`, indicating that the CUDA Graph feature is enabled.
cuda_graph_max_batch_size	int	The maximum batch size that can be captured. CUDA Graph is only used when the batch size is less than 64. Note When there is a conflict between cuda_graph_max_batch_size and cuda_graph_batch_sizes, the system will ignore cuda_graph_max_batch_size.
cuda_graph_batch_sizes	List[int]	Specify the range and padding method by passing an INT array. `--cuda_graph_batch_sizes 8 16 64`: When the batch size is not greater than 64, it is padded to the closest length in the list that is not less than the batch size. When the batch size exceeds the maximum number in the list, CUDA Graph is not used.

Precautions

Saving CUDA graphs consumes GPU memory. Therefore, you need to reduce the GPU memory ratio specified by --max_gpu_memory_utilization for kv_cache.

Deployment environment check

BladeLLM provides the blade_llm_check tool to check whether your environment meets the deployment requirements, to identify potential issues in advance.

Perform the check.
When you deploy a service by using the BladeLLM engine, configure the following run command in the Advanced Settings section. For information about other parameters, see Scenario-based model deployment.
```
blade_llm_check
```
View the output results.
Click the name of the desired service. Then, switch to the Logs tab to view the output results. Sample output:
```
       check_shm       PASS  -  shared Memory is 198269968384 bytes
    check_pb_env       PASS  -  Env var PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=None is OK
check_flops_diff       PASS  -  max flops diff amoung 2 devices is 1.00

PASS 3, WARNING 0, FAILED 0
```
Take note of the following items:
- Each line of output represents a check item, including the check item, check result (PASS, WARNING, or FAILED), and detailed information.
- The last line of output summarizes the results of all check items.
You can also confirm whether all checks have passed through the program's exit code. When all checks pass, blade_llm_check returns 0, otherwise, it returns 1. Example:
```
time="2025-05-22T09:58:02Z" level=info msg="program stopped with status:exit status 0" program=/bin/sh
```

After the check passes, you can stop the detection service and redeploy the service by using the BladeLLM engine. For more information, see Get started with BladeLLM.