When deploying a service using the BladeLLM engine, you can start the service with the command line statement blade_llm_server. This topic introduces the configuration parameters supported by blade_llm_server and their descriptions.
Usage example
The following command loads a HuggingFace format Qwen3-4B model from the specified directory and listens on port 8081 by default to receive requests. For more information about how to deploy a service, see BladeLLM Quick Start.
blade_llm_server --model /mnt/model/Qwen3-4B/Parameter description
The parameters supported by blade_llm_server are as follows:
usage: blade_llm_server [-h] [--tensor_parallel_size int] [--pipeline_parallel_size int] [--pipeline_micro_batch int] [--attention_dp_size int] [--host str] [--port int]
[--worker_socket_path str] [--log_level {DEBUG,INFO,WARNING,ERROR}] [--max_gpu_memory_utilization float]
[--preempt_strategy {AUTO,RECOMPUTE,SWAP}] [--ragged_flash_max_batch_tokens int] [--decode_algo {sps,look_ahead,normal}] [--gamma int]
[--disable_prompt_cache] [--prompt_cache_enable_swap] [--decoding_parallelism int] [--metric_exporters None [None ...]] [--max_queue_time int]
[--enable_custom_allreduce] [--enable_json_warmup] [--enable_llumnix] [--llumnix_config str] [--model [str]] [--tokenizer_dir [str]]
[--chat_template [str]] [--dtype {half,float16,bfloat16,float,float32}]
[--kv_cache_quant {no_quant,int8,int4,int8_affine,int4_affine,fp8_e5m2,fp8_e4m3,mix_f852i4,mix_f843i4,mix_i8i4,mix_i4i4}]
[--kv_cache_quant_sub_heads int] [--tokenizer_special_tokens List] [--enable_triton_mla] [--disable_cuda_graph] [--cuda_graph_max_batch_size int]
[--cuda_graph_batch_sizes [List]] [--with_visual bool] [--use_sps bool] [--use_lookahead bool] [--look_ahead_window_size int]
[--look_ahead_gram_size int] [--guess_set_size int] [--draft.model [str]] [--draft.tokenizer_dir [str]] [--draft.chat_template [str]]
[--draft.dtype {half,float16,bfloat16,float,float32}]
[--draft.kv_cache_quant {no_quant,int8,int4,int8_affine,int4_affine,fp8_e5m2,fp8_e4m3,mix_f852i4,mix_f843i4,mix_i8i4,mix_i4i4}]
[--draft.kv_cache_quant_sub_heads int] [--draft.tokenizer_special_tokens List] [--draft.enable_triton_mla] [--draft.disable_cuda_graph]
[--draft.cuda_graph_max_batch_size int] [--draft.cuda_graph_batch_sizes [List]] [--draft.with_visual bool] [--draft.use_sps bool]
[--draft.use_lookahead bool] [--draft.look_ahead_window_size int] [--draft.look_ahead_gram_size int] [--draft.guess_set_size int]
[--temperature [float]] [--top_p [float]] [--top_k [int]] [--cat_prompt [bool]] [--repetition_penalty [float]] [--presence_penalty [float]]
[--max_new_tokens [int]] [--stop_sequences [List]] [--stop_tokens [List]] [--ignore_eos [bool]] [--skip_special_tokens [bool]]
[--enable_disagg_metric bool] [--enable_export_kv_lens_metric bool] [--enable_hybrid_dp bool] [--enable_quant bool] [--asymmetry bool]
[--block_wise_quant bool] [--enable_cute bool] [--no_scale bool] [--quant_lm_head bool] [--rotate bool] [--random_rotate_matrix bool]
[--skip_ffn_fc2 bool]
...Detailed descriptions of the blade_llm_server parameters are as follows:
Parameter | Value type | Required | Default value | Description |
--tensor_parallel_size (-tp) | int | No | 1 | Tensor parallel size. |
--pipeline_parallel_size (-pp) | int | No | 1 | Pipeline parallel size. |
--pipeline_micro_batch (-ppmb) | int | No | 1 | Micro batch size for pipeline parallelism. |
--attention_dp_size (-dp) | int | No | 1 | Data parallel size. |
--host | str | No | 0.0.0.0 | Server hostname. |
--port | int | No | 8081 | The port number of the server. |
--worker_socket_path | str | No |
| Socket path for worker processes. |
--log_level | enumeration | No | INFO | Log level to print. Values include the following:
|
--max_gpu_memory_utilization | float | No | 0.85 | Maximum GPU memory utilization for continuous batch scheduler. |
--preempt_strategy | enumeration | No | AUTO | Strategy for handling preempted requests when KV cache space is insufficient. Values include the following:
|
--ragged_flash_max_batch_tokens | int | No | 2048 | Maximum batch tokens in ragged flash memory. |
--decode_algo | enumeration | No | normal | Efficient decoding algorithm. Values include the following:
|
--gamma | int | No | 0 | Gamma step size for speculative decoding. |
--disable_prompt_cache | None | No | False | Disable prompt prefix caching. |
--prompt_cache_enable_swap | None | No | False | Swap prompt cache from GPU memory to CPU memory. |
--decoding_parallelism | int | No | min(max(get_cpu_number() // 2, 1), 2) | Decoding parallelism setting. |
--metric_exporters | None [None ...] | No | logger | Metric export methods.
|
--max_queue_time | int | No | 3600 | Maximum waiting time (in seconds) for requests in the queue. |
--enable_custom_allreduce | None | No | False | Use custom all reduce instead of nccl all reduce. |
--enable_json_warmup | None | No | False | Enable finite-state machine compilation for JSON Schema. |
--enable_llumnix | None | No | False | Enable llumnix. |
--llumnix_config | str | No | None | Path to llumnix configuration file. |
The following are model loading parameters. | ||||
--model | [str] | Yes | None | Directory containing model files. |
--tokenizer_dir | [str] | No | None | Tokenizer path, defaults to model directory. |
--chat_template | [str] | No | None | Chat template configuration. |
--dtype | enumeration | No | half | Data precision used for model and activation parts that are not quantized
|
--kv_cache_quant | enumeration | No | no_quant | Enable KV cache quantization. Values include no_quant, int8, int4, int8_affine, int4_affine, fp8_e5m2, fp8_e4m3, mix_f852i4, mix_f843i4, mix_i8i4, mix_i4i4. |
--kv_cache_quant_sub_heads | int | No | 1 | Number of sub heads for kv cache quantization. |
--tokenizer_special_tokens | List | No | [] | Specify special tokens for the tokenizer, such as |
--enable_triton_mla | None | No | False | Enable Triton, otherwise use Bladnn MLA. |
--disable_cuda_graph | None | No | False | Disable CUDA Graph. |
--cuda_graph_max_batch_size | int | No | 64 | Maximum batch size for CUDA Graph. |
--cuda_graph_batch_sizes | [List] | No | None | Batch sizes to be captured by CUDA Graph. |
--with_visual, --nowith_visual | bool | No | True | Enable support for visual models. |
--use_sps, --nouse_sps | bool | No | False | Enable speculative decoding. |
--use_lookahead, --nouse_lookahead | bool | No | False | Enable LookAhead decoding parameters. |
--look_ahead_window_size | int | No | 2 | LookAhead window size. |
--look_ahead_gram_size | int | No | 2 | LookAhead n-gram size. |
--guess_set_size | int | No | 3 | LookAhead guess set size. |
The following are draft model loading parameters, which are only effective when speculative decoding is enabled. | ||||
--draft.model | [str] | No | None | Directory containing model files. |
--draft.tokenizer_dir | [str] | No | None | Tokenizer path, defaults to model directory. |
--draft.chat_template | [str] | No | None | Chat template configuration. |
--draft.dtype | enumeration | No | half |
|
--draft.kv_cache_quant | enumeration | No | no_quant | Enable KV cache quantization. Values include no_quant, int8, int4, int8_affine, int4_affine, fp8_e5m2, fp8_e4m3, mix_f852i4, mix_f843i4, mix_i8i4, mix_i4i4. |
--draft.kv_cache_quant_sub_heads | int | No | 1 | Number of sub heads for kv cache quantization. |
--draft.tokenizer_special_tokens | List | No | [] | Special tokenizer tokens. |
--draft.enable_triton_mla | None | No | False | Enable Triton, otherwise use Bladnn MLA. |
--draft.disable_cuda_graph | None | No | False | Enable CUDA Graph. |
--cuda_graph_max_batch_size | int | No | 64 | Maximum batch size for CUDA Graph. |
--draft.cuda_graph_batch_sizes | [List] | No | None | Batch sizes for CUDA Graph. |
--draft.with_visual, --draft.nowith_visual | bool | No | True | Enable support for visual models. |
--draft.use_sps, --draft.nouse_sps | bool | No | False | Enable speculative decoding. |
--draft.use_lookahead, --draft.nouse_lookahead | bool | No | False | Enable LookAhead decoding parameters. |
--draft.look_ahead_window_size | int | No | 2 | LookAhead window size. |
--draft.look_ahead_gram_size | int | No | 2 | LookAhead n-gram size. |
--draft.guess_set_size | int | No | 3 | LookAhead guess set size. |
The following are lora-related parameters. | ||||
--max_lora_rank | int | No | 16 | Maximum rank value for LoRA weights. |
--max_loras | int | No | 2 | Maximum value for LoRA. |
--max_cpu_loras | int | No | None | Maximum CPU resource usage limit. |
--lora_dtype | str | No | None | Specify LoRA data type. |
The following are service sampling parameters. These parameters correspond to the options in service invocation parameter configuration. If you do not specify certain parameter values when making a request, the default values set during service startup will be used. | ||||
--temperature | [float] | No | None | Temperature parameter used to change the logits distribution. |
--top_p | [float] | No | None | Keep the most likely tokens whose cumulative probability reaches top_p. |
--top_k | [int] | No | None | Keep the top_k tokens with the highest probability. |
--cat_prompt | [bool] | No | None | Enable detokenization of output IDs with prompt IDs. |
--repetition_penalty | [float] | No | None | Specifies the degree to which the model avoids repeating words when generating text. Higher values indicate more strict avoidance of repeated words. |
--presence_penalty | [float] | No | None | Specifies the degree of penalty for tokens that appear in the original input text when generating text. Higher values make the model more focused on consistency between generated text and original text, but may reduce the diversity of generated text. |
--max_new_tokens | [int] | No | None | Limit the maximum number of tokens generated. |
--stop_sequences | [List] | No | None | Stop generation on certain text, for example: "--stop_sequences a b c". |
--stop_tokens | [List] | No | None | Stop generation on certain token IDs or token sequences, for example: "--stop_tokens 1 2 3". |
--ignore_eos | [bool] | No | None | Do not stop at eos token during generation. |
--skip_special_tokens | [bool] | No | None | Skip special tokens when converting token_id to token during decoding. |