JSON deployment parameters - Platform For AI - Alibaba Cloud Documentation Center

In EAS, you can define and deploy an online inference service with a JSON configuration file.

Quick start

1. Prepare a JSON configuration file

To deploy a service, you need a JSON file that defines the required configurations. For first-time users, we recommend navigating to Custom Model Deployment > Custom Deployment to configure parameters. The system automatically generates the JSON configuration, which you can use as a template.

The following code is a sample service.json file. For a complete list of parameters, see Appendix: JSON Parameter Reference.

{
    "metadata": {
        "name": "demo",
        "instance": 1,
        "workspace_id": "your-workspace-id"
    },
    "cloud": {
        "computing": {
            "instances": [
                {
                    "type": "ecs.c7a.large"
                }
            ]
        }
    },
    "containers": [
        {
            "image": "eas-registry-vpc.cn-hangzhou.cr.aliyuncs.com/pai-eas/python-inference:py39-ubuntu2004",
            "script": "python app.py",
            "port": 8000
        }
    ]
}

2. Deploy the service with JSON

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
On the Inference Service tab, click Deploy Service. Then, in the Custom Model Deployment section, select JSON Deployment.
Paste your JSON configuration and click Deploy. A service status of Running indicates a successful deployment.

Appendix: JSON parameters

Parameter	Required	Description
metadata	Yes	The service's metadata. For more information, see metadata parameters.
cloud	No	The compute resource and VPC configurations. For more information, see cloud parameters.
containers	No	The image configuration. For more information, see containers parameters.
dockerAuth	No	This parameter is required to access a private repository that requires authentication. The value is the Base64-encoded string of `username:password`.
networking	No	The service invocation configuration. For more information, see networking parameters.
storage	No	Mounts data from storage services such as OSS or NAS into the container. For configuration details, see storage mount.
token	No	The access token for service authentication. If not specified, the system automatically generates one.
aimaster	No	Enables computing power check and fault tolerance for multi-node distributed inference services.
model_path	Yes	Required when deploying a service with a processor. The model_path and processor_path parameters specify the input data source locations for the model and the processor, respectively. The following formats are supported: OSS path: The URL can point to a specific file or a directory. HTTP URL: The URL must point to a compressed archive, such as a TAR.GZ, TAR, BZ2, or ZIP file. local path: A local path can be used for local debugging with the `test` command.
oss_endpoint	No	The OSS endpoint, for example, oss-cn-beijing.aliyuncs.com. For other valid values, see Regions and endpoints. Note By default, you do not need to specify this parameter. The service uses the internal OSS endpoint of the current region to download model files or Processor files. You must specify this parameter when you access OSS across regions. For example, if you deploy a service in the Hangzhou region and specify an OSS address in the Beijing region for model_path, you must use this parameter to specify the public OSS endpoint of the Beijing region.
model_entry	No	The model's entry file, which can be any file within the model package. If unspecified, it defaults to the filename from model_path. The path to this entry file is passed to the initialize() function of the processor.
model_config	No	The configuration for the model, which can be any text. This value is passed as the second argument to the processor's initialize() function.
processor	No	If using a pre-built processor, specify its code. For the codes of pre-built processors available in `eascmd`, see pre-built processors. If using a custom processor, configure the processor_path, processor_entry, processor_mainclass, and processor_type parameters instead.
processor_path	No	The path to the processor package. For supported path formats, see the description of the model_path parameter.
processor_entry	No	The entry file of the processor, such as libprocessor.so or app.py. This file must implement the `initialize()` and `process()` functions required for inference. This parameter is required if processor_type is set to cpp or python.
processor_mainclass	No	The main class of the processor in the JAR package. For example, com.aliyun.TestProcessor. This parameter is required if processor_type is set to java.
processor_type	No	The implementation language of the processor. The valid values are as follows: cpp java python
warm_up_data_path	No	The path to the request file used for model warm-up. For more information about this feature, see model warm-up.
runtime.enable_crash_block	No	Specifies whether an instance that crashes due to a processor code exception automatically restarts. Valid values: true: The instance does not restart automatically, which preserves the runtime environment for troubleshooting. false (Default): The instance restarts automatically.
autoscaler	No	The configuration for horizontal auto scaling. For detailed parameter descriptions, see horizontal auto scaling.
labels	No	The labels to apply to the service. Use the `key:value` format.
unit.size	No	The number of machines per instance in a distributed inference configuration. The default value is 2.
sinker	No	Persists all service requests and responses to MaxCompute or Log Service (SLS). For detailed parameter descriptions, see sinker parameters.
confidential	No	Configures Trustee to ensure that information such as data, models, and code remains encrypted during service deployment and invocation. This enables a secure and encrypted inference service. The format is as follows: Note The secure encryption environment primarily protects mounted storage files. Ensure that you have mounted these files before enabling this feature. `"confidential": { "trustee_endpoint": "xxxx", "decryption_key": "xxxx" }` . trustee_endpoint: The URI of Trustee. decryption_key: The KBS URI of the decryption key. For example, `kbs:///default/key/test-key`.

Metadata parameters

General parameters

Parameter	Required	Description
name	Yes	The name of the service. Must be unique within a region.
instance	Yes	The number of instances for the service.
workspace_id	No	The ID of the PAI workspace. If specified, this parameter restricts the service to the workspace. For example: `1405**`.
cpu	No	The number of CPU cores required for each instance.
memory	No	The amount of memory required for each instance, in MB. The value must be an integer. For example, `"memory": 4096` indicates that each instance requires 4 GB of memory.
gpu	No	The number of GPUs required for each instance.
gpu_memory	No	Enables gpu slicing, which allows multiple instances to share a single GPU. This parameter can be configured only with dedicated resource groups or resource quotas.
gpu_core_percentage	No
qos	No	Specifies the Quality of Service (QoS) for the instance. Valid values: BestEffort or omitted. When qos is set to BestEffort, the instance enters CPU sharing mode. In this mode, scheduling is based on GPU memory and system memory, and scheduling ignores the number of CPU cores on the node. All instances on the node share the CPU resources. The cpu parameter then specifies the maximum CPU quota that a single instance can use.
resource	No	The ID of the resource group. The deployment policy is as follows: If deployed in a public resource group, omit this parameter. The service is then billed on a pay-as-you-go basis. If deployed in a dedicated resource group, set this parameter to the resource group ID. For example: eas-r-6dbzve8ip0xnzt****.
cuda	No	The CUDA version that the service requires. At runtime, the specified CUDA version is automatically mounted to the `/usr/local/cuda` directory of the instance. Supported CUDA versions: 8.0, 9.0, 10.0, 10.1, 10.2, 11.0, 11.1, and 11.2. For example: `"cuda":"11.2"`.
rdma	No	Specifies whether to enable RDMA networking for distributed inference. Set the value to 1 to enable RDMA networking. If omitted, this feature is disabled. Note Currently, RDMA networking is available only for services that are deployed using Lingjun intelligent computing resources.
enable_grpc	No	Specifies whether to enable gRPC connections for the service gateway. Valid values: false (Default): Disables gRPC connections. The gateway supports HTTP requests by default. true: Enables gRPC connections. Note If you deploy a service using a custom image with a gRPC-based server, you must set this parameter to switch the gateway protocol to gRPC.
enable_webservice	No	Specifies whether to enable a web server to deploy the service as an AI-Web application. false (Default): The web server is not enabled. true: The web server is enabled.
type	No	Set this parameter to LLMGatewayService to deploy an LLM intelligent router service. For more information, see Deploy an LLM intelligent router.

Advanced parameters

Important

Modify these advanced parameters with caution.

Parameter		Required	Description
rpc	batching	No	Enables server-side batching to accelerate GPU model inference. This feature is supported only in pre-built processor mode. Valid values: false (Default): Disables server-side batching. true: Enables server-side batching.
	keepalive	No	The maximum processing time for a single request, in milliseconds. If the processing time exceeds this value, the server returns a 408 Timeout error and closes the connection. The default value is 600000 for dedicated gateways. This parameter is not supported for Application Load Balancer (ALB)-based dedicated gateways.
	io_threads	No	The number of threads used to process network I/O requests in each instance. The default value is 4.
	max_batch_size	No	The maximum size of each batch. The default value is 16. This parameter takes effect only when rpc.batching is set to true. This feature is supported only in pre-built processor mode.
	max_batch_timeout	No	The maximum timeout period for each batch, in milliseconds. The default value is 50. This parameter takes effect only when rpc.batching is set to true. This feature is supported only in pre-built processor mode.
	max_queue_size	No	The maximum length of the queue for an asynchronous inference service. The default value is 64. If the queue is full, the server returns a 450 error and closes the connection. This allows the client to retry on other instances and prevent server overload. For services with long response times (RTs), you can reduce the queue length to prevent requests from piling up and causing timeouts.
	worker_threads	No	The number of threads in each instance that are used to concurrently process requests. The default value is 5. This feature is supported only in pre-built processor mode.
	rate_limit	No	Enables QPS rate limiting and specifies the maximum QPS that an instance can process. The default value is 0, which indicates that QPS rate limiting is disabled. For example, if you set this parameter to 2000, requests are rejected with a 429 (Too Many Requests) error when the QPS exceeds 2,000.
	enable_sigterm	No	Valid values: false (Default): The system does not send a SIGTERM signal when an instance enters the terminating state. true: When a service instance enters the terminating state, the system immediately sends a SIGTERM signal to the main process. The process within the service must handle this signal to perform a custom graceful termination. If the signal is not handled, the main process may exit immediately, preventing a graceful termination.
rolling_strategy	max_surge	No	The maximum number of additional instances created beyond the desired count during a rolling update. The value can be a positive integer that indicates the number of instances, or a percentage, such as 2%. The default value is 2%. A larger value accelerates service updates. For example, if the service instance count is 100 and you set this parameter to 20, 20 new instances are created immediately after the service update starts.
rolling_strategy	max_unavailable	No	The maximum number of unavailable instances during a rolling update. This parameter can free up resources for new instances during an update and prevent the update from stalling due to insufficient resources. The default value is 1 for dedicated resource groups and 0 for public resource groups. For example, if you set this parameter to N, N instances are stopped immediately after the service update starts. Note If idle resources are sufficient, you can set this parameter to 0. A large value may affect service stability because the number of available instances decreases during the update, which increases the traffic load on a single instance. Balance service stability with resource availability when you configure this parameter.
eas.termination_grace_period		No	The graceful termination period of an instance, in seconds. The default value is 30. EAS services use a rolling update strategy. An instance first enters the Terminating state, and the service routes traffic away from the terminating instance. The instance then waits for 30 seconds to process any received requests before it exits. If requests take a long time to process, you can increase this value to ensure that all in-flight requests are completed during a service update. Important A smaller value may affect service stability, while a larger value may slow down service updates. Only change this parameter if necessary.
scheduling	spread.policy	No	The spread policy for scheduling service instances. The following policies are supported: host: Spreads instances across different nodes. zone: Spreads instances across different availability zones. default: Schedules instances based on the default policy using the system's default placement strategy. Configuration example: `{ "metadata": { "scheduling": { "spread": { "policy": "host" } } }`
resource_rebalancing		No	Valid values: false (Default): This feature is disabled. true: EAS periodically creates probe instances on high-priority resources. If a probe instance is scheduled successfully, it creates more probe instances exponentially until scheduling fails. When a successfully scheduled probe instance completes initialization and becomes ready, it replaces an instance running on a lower-priority resource. This feature helps resolve the following issues: Prevents new instances from being temporarily scheduled to a public resource group during a rolling update. This can occur when terminating instances in a dedicated resource group have not yet freed their resources. When using both spot and regular instances, the system periodically checks for available spot instances and migrates regular instances to them.
resource_burstable		No	Enables the elastic resource pool feature for an EAS service that is deployed in a dedicated resource group. true: Enables the feature. false: Disables the feature.
shm_size		No	The size of the shared memory for each instance, in GB. Shared memory allows direct read and write operations, eliminating the need for data copying or transfer.

Cloud parameters

Parameter		Required	Description
computing	instances	No	Specifies a list of instance types to use when deploying the service in a public resource group. If a bid for a spot instance fails or an instance type is out of stock, the system creates the service by using the next instance type in the list. type: The instance type. spot_price_limit: Optional. If you specify this parameter, the instance type becomes a pay-as-you-go spot instance, and this value is its maximum price in USD. If you omit this parameter, a regular pay-as-you-go instance is created. capacity: The maximum number of instances of this type to create. You can specify a number, such as "500", or a percentage in a string, such as "20%". After the capacity limit is reached, the system stops creating instances of this type, even if resources are available. For example, if the total number of instances for a service is 200 and you set the `capacity` of an instance type to `20%`, the system launches a maximum of 40 instances of this type. The remaining instances are launched by using other specified instance types.
computing	disable_spot_protection_period	No	Specifies whether to disable the protection period for a spot instance. This parameter applies only to spot instances. Valid values: false (Default): The spot instance has a 1-hour protection period after it is created. During the protection period, the system does not reclaim the instance even if the market price exceeds your bid. true: Disables the protection period. Instances without a protection period are typically about 10% cheaper than those with a protection period.
networking	vpc_id	No	The ID of the VPC.
	vswitch_id	No	The ID of the VSwitch.
	security_group_id	No	The ID of the security group.
	destination_cidrs	No	If the CIDR block of the configured VSwitch conflicts with the EAS management CIDR blocks (10.224.0.0/16 or 10.240.0.0/12), you must explicitly set this parameter to the CIDR block of your VSwitch. Example: `"cloud": { "networking": { "destination_cidrs": "10.241.28.0/22" } }` Replace `10.241.28.0/22` with the actual CIDR block of your VSwitch.

Example:

{
    "cloud": {
        "computing": {
            "instances": [
                {
                    "type": "ecs.c8i.2xlarge",
                    "spot_price_limit": 1
                },
                {
                    "type": "ecs.c8i.xlarge",
                    "capacity": "20%"
                }
            ],
            "disable_spot_protection_period": false
        },
        "networking": {
            "vpc_id": "vpc-bp1oll7xawovg9*****",
            "vswitch_id": "vsw-bp1jjgkw51nsca1e****",
            "security_group_id": "sg-bp1ej061cnyfn0b*****"
        }
    }
}

Container parameters

To deploy a service using a custom image, see Custom Images.

Parameter		Required	Description
image		Yes	The image address for the model service. Required when you deploy using an image.
env	name	No	The name of the environment variable.
env	value	No	The value of the environment variable.
command		You must specify either command or script.	The entry point command for the image. This parameter supports only a single command. For complex scripts, such as `cd xxx && python app.py`, use the `script` parameter. Use the `command` parameter if the image lacks the `/bin/sh` command.
script		You must specify either command or script.	The entry point script for the image. You can specify complex scripts with multiple lines. Separate commands with `\n` or a semicolon (;).
port		No	The container port. Important The EAS engine listens on fixed ports 8080 and 9090. To avoid port conflicts, ensure the container port is not 8080 or 9090. This port must match the port configured in the xxx.py file specified by the command.
prepare	pythonRequirements	No	A list of Python requirements to install before the instance starts. The image must have the python and pip commands available in the system PATH. For example: `"prepare": { "pythonRequirements": [ "numpy==1.16.4", "absl-py==0.11.0" ] }`
prepare	pythonRequirementsPath	No	The path to a requirements.txt file for installing Python packages before the instance starts. The image must have the python and pip commands available in the system PATH. This file can be included in the image or mounted from external storage. For example: `"prepare": { "pythonRequirementsPath": "/data_oss/requirements.txt" }`

Networking parameters

Parameter

Required

Description

gateway

Specifies the dedicated gateway for the EAS service.

gateway_policy

rate_limit: Sets the maximum number of requests per second (QPS) for global rate limiting.
- enable: Set to true to enable rate limiting, or false to disable it.
- limit: The maximum QPS.
  Note
  Services on a shared gateway default to 1,000 QPS per service and 10,000 QPS per server group. Dedicated gateways have no default value.
concurrency_limit: Sets the maximum number of concurrent requests for global concurrency control. This setting is not supported for ALB-based dedicated gateways.
- enable: Set to true to enable concurrency control, or false to disable it.
- limit: The maximum number of concurrent requests.

Example configuration:

{
    "networking": {
        "gateway_policy": {
            "rate_limit": {
                "enable": true,
                "limit": 100
            },
            "concurrency_limit": {
                "enable": true,
                "limit": 50
            }
        }
    }
}

Sinker parameters

Parameter		Required	Description
type		No	Specifies the destination storage service. Supported values: `maxcompute`: MaxCompute. `sls`: Log Service (SLS).
config	maxcompute.project	No	The MaxCompute project name.
	maxcompute.table	No	The MaxCompute table name.
	sls.project	No	The Log Service (SLS) project name.
	sls.logstore	No	The Logstore name.

Example configurations:

Sink to MaxCompute

"sinker": {
        "type": "maxcompute",
        "config": {
            "maxcompute": {
                "project": "cl****",
                "table": "te****"
            }
        }
    }

Sink to SLS

"sinker": {
        "type": "sls",
        "config": {
            "sls": {
                "project": "k8s-log-****",
                "logstore": "d****"
            }
        }
    }

JSON configuration example

The following is a sample JSON configuration:

{
  "token": "****M5Mjk0NDZhM2EwYzUzOGE0OGMx****",
  "processor": "tensorflow_cpu_1.12",
  "model_path": "oss://examplebucket/exampledir/",
  "oss_endpoint": "oss-cn-beijing.aliyuncs.com",
  "model_entry": "",
  "model_config": "",
  "processor_path": "",
  "processor_entry": "",
  "processor_mainclass": "",
  "processor_type": "",
  "warm_up_data_path": "",
  "runtime": {
    "enable_crash_block": false
  },
  "unit": {
        "size": 2
    },
  "sinker": {
        "type": "MaxCompute",
        "config": {
            "maxcompute": {
                "project": "cl****",
                "table": "te****"
            }
        }
    },
  "cloud": {
    "computing": {
      "instances": [
        {
          "capacity": 800,
          "type": "dedicated_resource"
        },
        {
          "capacity": 200,
          "type": "ecs.c7.4xlarge",
          "spot_price_limit": 3.6
        }
      ],
      "disable_spot_protection_period": true
    },
    "networking": {
            "vpc_id": "vpc-bp1oll7xawovg9t8****",
            "vswitch_id": "vsw-bp1jjgkw51nsca1e****",
            "security_group_id": "sg-bp1ej061cnyfn0b****"
        }
  },
  "autoscaler": {
    "min": 2,
    "max": 5,
    "strategies": {
      "qps": 10
    }
  },
  "storage": [
    {
      "mount_path": "/data_oss",
      "oss": {
        "endpoint": "oss-cn-shanghai-internal.aliyuncs.com",
        "path": "oss://bucket/path/"
      }
    }
  ],
  "confidential": {
        "trustee_endpoint": "xx",
        "decryption_key": "xx"
    },
  "metadata": {
    "name": "test_eascmd",
    "resource": "eas-r-9lkbl2jvdm0puv****",
    "instance": 1,
    "workspace_id": "1405**",
    "gpu": 0,
    "cpu": 1,
    "memory": 2000,
    "gpu_memory": 10,
    "gpu_core_percentage": 10,
    "qos": "",
    "cuda": "11.2",
    "enable_grpc": false,
    "enable_webservice": false,
    "rdma": 1,
    "rpc": {
      "batching": false,
      "keepalive": 5000,
      "io_threads": 4,
      "max_batch_size": 16,
      "max_batch_timeout": 50,
      "max_queue_size": 64,
      "worker_threads": 5,
      "rate_limit": 0,
      "enable_sigterm": false
    },
    "rolling_strategy": {
      "max_surge": 1,
      "max_unavailable": 1
    },
    "eas.termination_grace_period": 30,
    "scheduling": {
      "spread": {
        "policy": "host"
      }
    },
    "resource_rebalancing": false,
    "shm_size": 100
  },
  "features": {
    "eas.aliyun.com/extra-ephemeral-storage": "100Gi",
    "eas.aliyun.com/gpu-driver-version": "tesla=550.127.08"
  },
  "networking": {
    "gateway": "gw-m2vkzbpixm7mo****"
  },
  "containers": [
    {
      "image": "registry-vpc.cn-shanghai.aliyuncs.com/xxx/yyy:zzz",
      "prepare": {
        "pythonRequirements": [
          "numpy==1.16.4",
          "absl-py==0.11.0"
        ]
      },
      "command": "python app.py",
      "port": 8000
    }
  ],
  "dockerAuth": "dGVzdGNhbzoxM*******"
}