Elastic Algorithm Service (EAS) lets you quickly deploy models as real-time inference services. This topic uses the deployment of the Qwen3-0.6B model with the vLLM framework as an example to walk you through the process of deploying and calling a service.
This topic uses the custom deployment of a large language model (LLM) as an example to help you quickly get started with EAS. For the actual deployment of LLM models, use the scenario-based LLM deployment or deploy models with one click from the Model Gallery. These methods are more convenient and faster.
Cost Considerations: This guide uses public pay-as-you-go resources. Estimated cost: ¥2-5 for the complete tutorial. Always stop or delete services when not in use to avoid unnecessary charges.
Prerequisites
Active Alibaba Cloud account with root permissions
PAI service activated in your workspace
Basic familiarity with cloud computing concepts
Use your root account to activate PAI and create a workspace. Log on to the PAI console, select a region in the top-left corner, and then complete the one-click authorization and product activation.
Billing
This topic uses public resources to create a model service. The billing method is pay-as-you-go. For more information about billing rules, see EAS billing.
Preparations
To deploy a model service, you typically need to prepare model files and code files, such as web interfaces. If the Alibaba Cloud Image for the platform does not meet your deployment requirements, you must also build your own runtime image.
Prepare the model file
To obtain the Qwen3-0.6B model file for this example, run the following Python code. The file is downloaded from ModelScope to the default path ~/.cache/modelscope/hub.
Note: The model will be saved to ~/.cache/modelscope/hub by default.
# Download the model
from modelscope import snapshot_download
model_dir = snapshot_download('Qwen/Qwen3-0.6B')Prepare the code file
The vLLM framework makes it easy to build an OpenAI API-compatible API service. Therefore, you do not need to prepare a separate code file.
If you have complex business logic or specific API requirements, prepare your own code files. For example, the following code uses Flask to create a simple API interface.
Upload files to OSS
Use ossutil to upload the model and code files to OSS. You can then read the model file by mounting OSS to the service.
Best Practices: Use regional OSS buckets for optimal performance and organize models in structured directory paths.
In addition to OSS, you can use other storage methods. For more information, see Storage configurations.
You can also package all required files into a runtime image for deployment. However, this method is not recommended for the following reasons:
Model updates or iterations require you to rebuild and re-upload the runtime image, which increases maintenance costs.
Large model files significantly increase the runtime image size. This leads to longer image pull times and affects the service startup efficiency.
Prepare the runtime image
You can use the Qwen3-0.6B model with vllm>=0.8.5 to create an OpenAI-compatible API endpoint. The official image vllm:0.8.5.post1-mows0.2.1 provided by EAS meets this requirement. Therefore, this topic uses the official image.
If no Alibaba Cloud Image meets your requirements, you must create a custom image. If you develop and train models in a DSW instance, you can create a DSW instance image to ensure consistency between the development and deployment environments.
Service deployment
-
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
-
Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.
Configure the deployment parameters. Configure the key parameters as follows and keep the default values for the other parameters.
Deployment Method: Image-based Deployment.
Image Configuration: In the Alibaba Cloud Image list, select
vllm:0.8.5.post1-mows0.2.1.Mount Storage: This topic stores the model file in OSS at the path
oss://examplebucket/models/Qwen/Qwen3-0___6B. Therefore, select OSS and configure it as follows.Uri: The OSS path where the model is located. Set this to
oss://examplebucket/models/.Mount Path: The destination path in the service instance where the file is mounted, such as
/mnt/data/.
Command: The Alibaba Cloud Image has a default startup command associated with it. You can modify it as needed. For this example, change it to
vllm serve /mnt/data/Qwen/Qwen3-0___6B.Resource Type: Select Public Resources. Set Resource Specification to
ecs.gn7i-c16g1.4xlarge. If you want to use other resource types, see Resource configurations.
Click Deploy. The service deployment takes about 5 minutes. When the Service Status changes to Running, the service is successfully deployed.
Tip: Monitor the deployment progress in the deployment logs for real-time status updates.
Online debugging
After the service is deployed, you can use the online debugging feature to test whether the service is running correctly. You can configure the request method, request path, and request body based on your specific model service.
The online debugging method for the service deployed in this topic is as follows:
On the Inference Service tab, click the destination service to go to the service overview page. Switch to the Online Debugging tab.
In the Online Debugging Request Parameters section of the debugging page, set the request parameters and click Send Request. The request parameters are as follows:
Chat interface: Append
/v1/chat/completionsto the existing URL.Headers: Add a request header. Set the key to
Content-Typeand the value toapplication/json.
Body:
{ "model": "/mnt/data/Qwen/Qwen3-0___6B", "messages": [ { "role": "user", "content": "Hello!" } ], "max_tokens": 1024 }
The response is shown in the following figure.

Validation Checklist:
✓ HTTP 200 status code received
✓ Response contains generated text
✓ Response time < 5 seconds for simple queries
Service invocation
Obtain the endpoint and token
This deployment uses the Shared Gateway by default. After the deployment is complete, you can obtain the endpoint and token required for the invocation from the service overview information.
On the Inference Services tab, click the target service name to go to its Overview page. In the Basic Information section, click View Endpoint Information.
In the Invocation Method panel, obtain the endpoint and token. Choose an internet or VPC endpoint as needed. The following examples use <EAS_ENDPOINT> and <EAS_TOKEN> as placeholders for these values.

Use curl or Python for invocation
The following code provides an example:
curl http://16********.cn-hangzhou.pai-eas.aliyuncs.com/api/predict/****/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: *********5ZTM1ZDczg5OT**********" \
-X POST \
-d '{
"model": "/mnt/data/Qwen/Qwen3-0___6B",
"messages": [
{
"role": "user",
"content": "Hello!"
}
],
"max_tokens": 1024
}' import requests
# Replace with the actual endpoint.
url = 'http://16********.cn-hangzhou.pai-eas.aliyuncs.com/api/predict/***/v1/chat/completions'
# For the header, set the value of Authorization to the actual token.
headers = {
"Content-Type": "application/json",
"Authorization": "*********5ZTM1ZDczg5OT**********",
}
# Construct the service request based on the data format required by the specific model.
data = {
"model": "/mnt/data/Qwen/Qwen3-0___6B",
"messages": [
{
"role": "user",
"content": "Hello!"
}
],
"max_tokens": 1024
}
# Send the request.
resp = requests.post(url, json=data, headers=headers)
print(resp)
print(resp.content)Stop or delete the service
This topic uses public resources to create the EAS service, which is billed on a pay-as-you-go basis. When you no longer need the service, stop or delete it to avoid further charges.
Management Tips: For temporary pauses, use "Stop" to preserve configuration. For permanent removal, use "Delete" to free all resources.

References
To improve LLM service efficiency, see Deploy an LLM intelligent router.
For more information about EAS features, see EAS overview.