Service deployment parameter configuration - Platform For AI

When deploying a service using the BladeLLM engine, you can start the service with the command line statement blade_llm_server. This topic introduces the configuration parameters supported by blade_llm_server and their descriptions.

Usage example

The following command loads a HuggingFace format Qwen3-4B model from the specified directory and listens on port 8081 by default to receive requests. For more information about how to deploy a service, see BladeLLM Quick Start.

blade_llm_server --model /mnt/model/Qwen3-4B/

Parameter description

The parameters supported by blade_llm_server are as follows:

usage: blade_llm_server [-h] [--tensor_parallel_size int] [--pipeline_parallel_size int] [--pipeline_micro_batch int] [--attention_dp_size int] [--host str] [--port int]
                        [--worker_socket_path str] [--log_level {DEBUG,INFO,WARNING,ERROR}] [--max_gpu_memory_utilization float]
                        [--preempt_strategy {AUTO,RECOMPUTE,SWAP}] [--ragged_flash_max_batch_tokens int] [--decode_algo {sps,look_ahead,normal}] [--gamma int]
                        [--disable_prompt_cache] [--prompt_cache_enable_swap] [--decoding_parallelism int] [--metric_exporters None [None ...]] [--max_queue_time int]
                        [--enable_custom_allreduce] [--enable_json_warmup] [--enable_llumnix] [--llumnix_config str] [--model [str]] [--tokenizer_dir [str]]
                        [--chat_template [str]] [--dtype {half,float16,bfloat16,float,float32}] 
                        [--kv_cache_quant {no_quant,int8,int4,int8_affine,int4_affine,fp8_e5m2,fp8_e4m3,mix_f852i4,mix_f843i4,mix_i8i4,mix_i4i4}]
                        [--kv_cache_quant_sub_heads int] [--tokenizer_special_tokens List] [--enable_triton_mla] [--disable_cuda_graph] [--cuda_graph_max_batch_size int]
                        [--cuda_graph_batch_sizes [List]] [--with_visual bool] [--use_sps bool] [--use_lookahead bool] [--look_ahead_window_size int]
                        [--look_ahead_gram_size int] [--guess_set_size int] [--draft.model [str]] [--draft.tokenizer_dir [str]] [--draft.chat_template [str]]
                        [--draft.dtype {half,float16,bfloat16,float,float32}] 
                        [--draft.kv_cache_quant {no_quant,int8,int4,int8_affine,int4_affine,fp8_e5m2,fp8_e4m3,mix_f852i4,mix_f843i4,mix_i8i4,mix_i4i4}]
                        [--draft.kv_cache_quant_sub_heads int] [--draft.tokenizer_special_tokens List] [--draft.enable_triton_mla] [--draft.disable_cuda_graph]
                        [--draft.cuda_graph_max_batch_size int] [--draft.cuda_graph_batch_sizes [List]] [--draft.with_visual bool] [--draft.use_sps bool]
                        [--draft.use_lookahead bool] [--draft.look_ahead_window_size int] [--draft.look_ahead_gram_size int] [--draft.guess_set_size int]
                        [--temperature [float]] [--top_p [float]] [--top_k [int]] [--cat_prompt [bool]] [--repetition_penalty [float]] [--presence_penalty [float]]
                        [--max_new_tokens [int]] [--stop_sequences [List]] [--stop_tokens [List]] [--ignore_eos [bool]] [--skip_special_tokens [bool]]
                        [--enable_disagg_metric bool] [--enable_export_kv_lens_metric bool] [--enable_hybrid_dp bool] [--enable_quant bool] [--asymmetry bool]
                        [--block_wise_quant bool] [--enable_cute bool] [--no_scale bool] [--quant_lm_head bool] [--rotate bool] [--random_rotate_matrix bool]
                        [--skip_ffn_fc2 bool]
                        ...

Detailed descriptions of the blade_llm_server parameters are as follows:

Parameter	Value type	Required	Default value	Description
--tensor_parallel_size (-tp)	int	No	1	Tensor parallel size.
--pipeline_parallel_size (-pp)	int	No	1	Pipeline parallel size.
--pipeline_micro_batch (-ppmb)	int	No	1	Micro batch size for pipeline parallelism.
--attention_dp_size (-dp)	int	No	1	Data parallel size.
--host	str	No	0.0.0.0	Server hostname.
--port	int	No	8081	The port number of the server.
--worker_socket_path	str	No	`/tmp/blade_llm.sock`	Socket path for worker processes.
--log_level	enumeration	No	INFO	Log level to print. Values include the following: DEBUG INFO WARNING ERROR
--max_gpu_memory_utilization	float	No	0.85	Maximum GPU memory utilization for continuous batch scheduler.
--preempt_strategy	enumeration	No	AUTO	Strategy for handling preempted requests when KV cache space is insufficient. Values include the following: RECOMPUTE SWAP AUTO
--ragged_flash_max_batch_tokens	int	No	2048	Maximum batch tokens in ragged flash memory.
--decode_algo	enumeration	No	normal	Efficient decoding algorithm. Values include the following: sps look_ahead normal
--gamma	int	No	0	Gamma step size for speculative decoding.
--disable_prompt_cache	None	No	False	Disable prompt prefix caching.
--prompt_cache_enable_swap	None	No	False	Swap prompt cache from GPU memory to CPU memory.
--decoding_parallelism	int	No	min(max(get_cpu_number() // 2, 1), 2)	Decoding parallelism setting.
--metric_exporters	None [None ...]	No	logger	Metric export methods. logger: Print metrics to logs. eas: Push metrics to EAS.
--max_queue_time	int	No	3600	Maximum waiting time (in seconds) for requests in the queue.
--enable_custom_allreduce	None	No	False	Use custom all reduce instead of nccl all reduce.
--enable_json_warmup	None	No	False	Enable finite-state machine compilation for JSON Schema.
--enable_llumnix	None	No	False	Enable llumnix.
--llumnix_config	str	No	None	Path to llumnix configuration file.
The following are model loading parameters.
--model	[str]	Yes	None	Directory containing model files.
--tokenizer_dir	[str]	No	None	Tokenizer path, defaults to model directory.
--chat_template	[str]	No	None	Chat template configuration.
--dtype	enumeration	No	half	Data precision used for model and activation parts that are not quantized half float16 bfloat16 float float32
--kv_cache_quant	enumeration	No	no_quant	Enable KV cache quantization. Values include no_quant, int8, int4, int8_affine, int4_affine, fp8_e5m2, fp8_e4m3, mix_f852i4, mix_f843i4, mix_i8i4, mix_i4i4.
--kv_cache_quant_sub_heads	int	No	1	Number of sub heads for kv cache quantization.
--tokenizer_special_tokens	List	No	[]	Specify special tokens for the tokenizer, such as `--tokenizer_special_tokens bos_token=<s> eos_token=</s>`.
--enable_triton_mla	None	No	False	Enable Triton, otherwise use Bladnn MLA.
--disable_cuda_graph	None	No	False	Disable CUDA Graph.
--cuda_graph_max_batch_size	int	No	64	Maximum batch size for CUDA Graph.
--cuda_graph_batch_sizes	[List]	No	None	Batch sizes to be captured by CUDA Graph.
--with_visual, --nowith_visual	bool	No	True	Enable support for visual models.
--use_sps, --nouse_sps	bool	No	False	Enable speculative decoding.
--use_lookahead, --nouse_lookahead	bool	No	False	Enable LookAhead decoding parameters.
--look_ahead_window_size	int	No	2	LookAhead window size.
--look_ahead_gram_size	int	No	2	LookAhead n-gram size.
--guess_set_size	int	No	3	LookAhead guess set size.
The following are draft model loading parameters, which are only effective when speculative decoding is enabled.
--draft.model	[str]	No	None	Directory containing model files.
--draft.tokenizer_dir	[str]	No	None	Tokenizer path, defaults to model directory.
--draft.chat_template	[str]	No	None	Chat template configuration.
--draft.dtype	enumeration	No	half	half float16 bfloat16 float float32
--draft.kv_cache_quant	enumeration	No	no_quant	Enable KV cache quantization. Values include no_quant, int8, int4, int8_affine, int4_affine, fp8_e5m2, fp8_e4m3, mix_f852i4, mix_f843i4, mix_i8i4, mix_i4i4.
--draft.kv_cache_quant_sub_heads	int	No	1	Number of sub heads for kv cache quantization.
--draft.tokenizer_special_tokens	List	No	[]	Special tokenizer tokens.
--draft.enable_triton_mla	None	No	False	Enable Triton, otherwise use Bladnn MLA.
--draft.disable_cuda_graph	None	No	False	Enable CUDA Graph.
--cuda_graph_max_batch_size	int	No	64	Maximum batch size for CUDA Graph.
--draft.cuda_graph_batch_sizes	[List]	No	None	Batch sizes for CUDA Graph.
--draft.with_visual, --draft.nowith_visual	bool	No	True	Enable support for visual models.
--draft.use_sps, --draft.nouse_sps	bool	No	False	Enable speculative decoding.
--draft.use_lookahead, --draft.nouse_lookahead	bool	No	False	Enable LookAhead decoding parameters.
--draft.look_ahead_window_size	int	No	2	LookAhead window size.
--draft.look_ahead_gram_size	int	No	2	LookAhead n-gram size.
--draft.guess_set_size	int	No	3	LookAhead guess set size.
The following are lora-related parameters.
--max_lora_rank	int	No	16	Maximum rank value for LoRA weights.
--max_loras	int	No	2	Maximum value for LoRA.
--max_cpu_loras	int	No	None	Maximum CPU resource usage limit.
--lora_dtype	str	No	None	Specify LoRA data type.
The following are service sampling parameters. These parameters correspond to the options in service invocation parameter configuration. If you do not specify certain parameter values when making a request, the default values set during service startup will be used.
--temperature	[float]	No	None	Temperature parameter used to change the logits distribution.
--top_p	[float]	No	None	Keep the most likely tokens whose cumulative probability reaches top_p.
--top_k	[int]	No	None	Keep the top_k tokens with the highest probability.
--cat_prompt	[bool]	No	None	Enable detokenization of output IDs with prompt IDs.
--repetition_penalty	[float]	No	None	Specifies the degree to which the model avoids repeating words when generating text. Higher values indicate more strict avoidance of repeated words.
--presence_penalty	[float]	No	None	Specifies the degree of penalty for tokens that appear in the original input text when generating text. Higher values make the model more focused on consistency between generated text and original text, but may reduce the diversity of generated text.
--max_new_tokens	[int]	No	None	Limit the maximum number of tokens generated.
--stop_sequences	[List]	No	None	Stop generation on certain text, for example: "--stop_sequences a b c".
--stop_tokens	[List]	No	None	Stop generation on certain token IDs or token sequences, for example: "--stop_tokens 1 2 3".
--ignore_eos	[bool]	No	None	Do not stop at eos token during generation.
--skip_special_tokens	[bool]	No	None	Skip special tokens when converting token_id to token during decoding.