All Products
Search
Document Center

Platform For AI:Service deployment parameter configuration

Last Updated:May 28, 2025

When deploying a service using the BladeLLM engine, you can start the service with the command line statement blade_llm_server. This topic introduces the configuration parameters supported by blade_llm_server and their descriptions.

Usage example

The following command loads a HuggingFace format Qwen3-4B model from the specified directory and listens on port 8081 by default to receive requests. For more information about how to deploy a service, see BladeLLM Quick Start.

blade_llm_server --model /mnt/model/Qwen3-4B/

Parameter description

The parameters supported by blade_llm_server are as follows:

usage: blade_llm_server [-h] [--tensor_parallel_size int] [--pipeline_parallel_size int] [--pipeline_micro_batch int] [--attention_dp_size int] [--host str] [--port int]
                        [--worker_socket_path str] [--log_level {DEBUG,INFO,WARNING,ERROR}] [--max_gpu_memory_utilization float]
                        [--preempt_strategy {AUTO,RECOMPUTE,SWAP}] [--ragged_flash_max_batch_tokens int] [--decode_algo {sps,look_ahead,normal}] [--gamma int]
                        [--disable_prompt_cache] [--prompt_cache_enable_swap] [--decoding_parallelism int] [--metric_exporters None [None ...]] [--max_queue_time int]
                        [--enable_custom_allreduce] [--enable_json_warmup] [--enable_llumnix] [--llumnix_config str] [--model [str]] [--tokenizer_dir [str]]
                        [--chat_template [str]] [--dtype {half,float16,bfloat16,float,float32}] 
                        [--kv_cache_quant {no_quant,int8,int4,int8_affine,int4_affine,fp8_e5m2,fp8_e4m3,mix_f852i4,mix_f843i4,mix_i8i4,mix_i4i4}]
                        [--kv_cache_quant_sub_heads int] [--tokenizer_special_tokens List] [--enable_triton_mla] [--disable_cuda_graph] [--cuda_graph_max_batch_size int]
                        [--cuda_graph_batch_sizes [List]] [--with_visual bool] [--use_sps bool] [--use_lookahead bool] [--look_ahead_window_size int]
                        [--look_ahead_gram_size int] [--guess_set_size int] [--draft.model [str]] [--draft.tokenizer_dir [str]] [--draft.chat_template [str]]
                        [--draft.dtype {half,float16,bfloat16,float,float32}] 
                        [--draft.kv_cache_quant {no_quant,int8,int4,int8_affine,int4_affine,fp8_e5m2,fp8_e4m3,mix_f852i4,mix_f843i4,mix_i8i4,mix_i4i4}]
                        [--draft.kv_cache_quant_sub_heads int] [--draft.tokenizer_special_tokens List] [--draft.enable_triton_mla] [--draft.disable_cuda_graph]
                        [--draft.cuda_graph_max_batch_size int] [--draft.cuda_graph_batch_sizes [List]] [--draft.with_visual bool] [--draft.use_sps bool]
                        [--draft.use_lookahead bool] [--draft.look_ahead_window_size int] [--draft.look_ahead_gram_size int] [--draft.guess_set_size int]
                        [--temperature [float]] [--top_p [float]] [--top_k [int]] [--cat_prompt [bool]] [--repetition_penalty [float]] [--presence_penalty [float]]
                        [--max_new_tokens [int]] [--stop_sequences [List]] [--stop_tokens [List]] [--ignore_eos [bool]] [--skip_special_tokens [bool]]
                        [--enable_disagg_metric bool] [--enable_export_kv_lens_metric bool] [--enable_hybrid_dp bool] [--enable_quant bool] [--asymmetry bool]
                        [--block_wise_quant bool] [--enable_cute bool] [--no_scale bool] [--quant_lm_head bool] [--rotate bool] [--random_rotate_matrix bool]
                        [--skip_ffn_fc2 bool]
                        ...

Detailed descriptions of the blade_llm_server parameters are as follows:

Parameter

Value type

Required

Default value

Description

--tensor_parallel_size (-tp)

int

No

1

Tensor parallel size.

--pipeline_parallel_size (-pp)

int

No

1

Pipeline parallel size.

--pipeline_micro_batch (-ppmb)

int

No

1

Micro batch size for pipeline parallelism.

--attention_dp_size (-dp)

int

No

1

Data parallel size.

--host

str

No

0.0.0.0

Server hostname.

--port

int

No

8081

The port number of the server.

--worker_socket_path

str

No

/tmp/blade_llm.sock

Socket path for worker processes.

--log_level

enumeration

No

INFO

Log level to print. Values include the following:

  • DEBUG

  • INFO

  • WARNING

  • ERROR

--max_gpu_memory_utilization

float

No

0.85

Maximum GPU memory utilization for continuous batch scheduler.

--preempt_strategy

enumeration

No

AUTO

Strategy for handling preempted requests when KV cache space is insufficient. Values include the following:

  • RECOMPUTE

  • SWAP

  • AUTO

--ragged_flash_max_batch_tokens

int

No

2048

Maximum batch tokens in ragged flash memory.

--decode_algo

enumeration

No

normal

Efficient decoding algorithm. Values include the following:

  • sps

  • look_ahead

  • normal

--gamma

int

No

0

Gamma step size for speculative decoding.

--disable_prompt_cache

None

No

False

Disable prompt prefix caching.

--prompt_cache_enable_swap

None

No

False

Swap prompt cache from GPU memory to CPU memory.

--decoding_parallelism

int

No

min(max(get_cpu_number() // 2, 1), 2)

Decoding parallelism setting.

--metric_exporters

None [None ...]

No

logger

Metric export methods.

  • logger: Print metrics to logs.

  • eas: Push metrics to EAS.

--max_queue_time

int

No

3600

Maximum waiting time (in seconds) for requests in the queue.

--enable_custom_allreduce

None

No

False

Use custom all reduce instead of nccl all reduce.

--enable_json_warmup

None

No

False

Enable finite-state machine compilation for JSON Schema.

--enable_llumnix

None

No

False

Enable llumnix.

--llumnix_config

str

No

None

Path to llumnix configuration file.

The following are model loading parameters.

--model

[str]

Yes

None

Directory containing model files.

--tokenizer_dir

[str]

No

None

Tokenizer path, defaults to model directory.

--chat_template

[str]

No

None

Chat template configuration.

--dtype

enumeration

No

half

Data precision used for model and activation parts that are not quantized

  • half

  • float16

  • bfloat16

  • float

  • float32

--kv_cache_quant

enumeration

No

no_quant

Enable KV cache quantization. Values include no_quant, int8, int4, int8_affine, int4_affine, fp8_e5m2, fp8_e4m3, mix_f852i4, mix_f843i4, mix_i8i4, mix_i4i4.

--kv_cache_quant_sub_heads

int

No

1

Number of sub heads for kv cache quantization.

--tokenizer_special_tokens

List

No

[]

Specify special tokens for the tokenizer, such as --tokenizer_special_tokens bos_token=<s> eos_token=</s>.

--enable_triton_mla

None

No

False

Enable Triton, otherwise use Bladnn MLA.

--disable_cuda_graph

None

No

False

Disable CUDA Graph.

--cuda_graph_max_batch_size

int

No

64

Maximum batch size for CUDA Graph.

--cuda_graph_batch_sizes

[List]

No

None

Batch sizes to be captured by CUDA Graph.

--with_visual, --nowith_visual

bool

No

True

Enable support for visual models.

--use_sps, --nouse_sps

bool

No

False

Enable speculative decoding.

--use_lookahead, --nouse_lookahead

bool

No

False

Enable LookAhead decoding parameters.

--look_ahead_window_size

int

No

2

LookAhead window size.

--look_ahead_gram_size

int

No

2

LookAhead n-gram size.

--guess_set_size

int

No

3

LookAhead guess set size.

The following are draft model loading parameters, which are only effective when speculative decoding is enabled.

--draft.model

[str]

No

None

Directory containing model files.

--draft.tokenizer_dir

[str]

No

None

Tokenizer path, defaults to model directory.

--draft.chat_template

[str]

No

None

Chat template configuration.

--draft.dtype

enumeration

No

half

  • half

  • float16

  • bfloat16

  • float

  • float32

--draft.kv_cache_quant

enumeration

No

no_quant

Enable KV cache quantization. Values include no_quant, int8, int4, int8_affine, int4_affine, fp8_e5m2, fp8_e4m3, mix_f852i4, mix_f843i4, mix_i8i4, mix_i4i4.

--draft.kv_cache_quant_sub_heads

int

No

1

Number of sub heads for kv cache quantization.

--draft.tokenizer_special_tokens

List

No

[]

Special tokenizer tokens.

--draft.enable_triton_mla

None

No

False

Enable Triton, otherwise use Bladnn MLA.

--draft.disable_cuda_graph

None

No

False

Enable CUDA Graph.

--cuda_graph_max_batch_size

int

No

64

Maximum batch size for CUDA Graph.

--draft.cuda_graph_batch_sizes

[List]

No

None

Batch sizes for CUDA Graph.

--draft.with_visual, --draft.nowith_visual

bool

No

True

Enable support for visual models.

--draft.use_sps, --draft.nouse_sps

bool

No

False

Enable speculative decoding.

--draft.use_lookahead, --draft.nouse_lookahead

bool

No

False

Enable LookAhead decoding parameters.

--draft.look_ahead_window_size

int

No

2

LookAhead window size.

--draft.look_ahead_gram_size

int

No

2

LookAhead n-gram size.

--draft.guess_set_size

int

No

3

LookAhead guess set size.

The following are lora-related parameters.

--max_lora_rank

int

No

16

Maximum rank value for LoRA weights.

--max_loras

int

No

2

Maximum value for LoRA.

--max_cpu_loras

int

No

None

Maximum CPU resource usage limit.

--lora_dtype

str

No

None

Specify LoRA data type.

The following are service sampling parameters. These parameters correspond to the options in service invocation parameter configuration. If you do not specify certain parameter values when making a request, the default values set during service startup will be used.

--temperature

[float]

No

None

Temperature parameter used to change the logits distribution.

--top_p

[float]

No

None

Keep the most likely tokens whose cumulative probability reaches top_p.

--top_k

[int]

No

None

Keep the top_k tokens with the highest probability.

--cat_prompt

[bool]

No

None

Enable detokenization of output IDs with prompt IDs.

--repetition_penalty

[float]

No

None

Specifies the degree to which the model avoids repeating words when generating text. Higher values indicate more strict avoidance of repeated words.

--presence_penalty

[float]

No

None

Specifies the degree of penalty for tokens that appear in the original input text when generating text. Higher values make the model more focused on consistency between generated text and original text, but may reduce the diversity of generated text.

--max_new_tokens

[int]

No

None

Limit the maximum number of tokens generated.

--stop_sequences

[List]

No

None

Stop generation on certain text, for example: "--stop_sequences a b c".

--stop_tokens

[List]

No

None

Stop generation on certain token IDs or token sequences, for example: "--stop_tokens 1 2 3".

--ignore_eos

[bool]

No

None

Do not stop at eos token during generation.

--skip_special_tokens

[bool]

No

None

Skip special tokens when converting token_id to token during decoding.