Build a scalable and highly available large model inference service with Auto Scaling and Elastic Container Service - Auto Scaling

You can quickly create a cross-zone, elastic, and highly available inference model by using Elastic Container Instance scaling groups with Application Load Balancer (ALB).

Solution overview

This solution employs a cross-zone, high-availability architecture to ensure resilience. Elastic container instances equipped with the inference model are deployed across multiple zones within a single region, enabling zone-level disaster recovery. Traffic is intelligently routed through ALB instances for optimal load distribution, while Object Storage Service (OSS) serves as the centralized model warehouse, guaranteeing both business continuity and data reliability. The architecture of the solution is shown in the figure below.

Solution benefits

Avoidance of SPOF: To prevent single points of failure (SPOF), multiple elastic container instances handle business workloads simultaneously, enhancing system stability and minimizing disruption risks.
Automatic scaling: The system uses scaling groups to manage the inference service cluster, dynamically adjusting the number of container instances to enable rapid horizontal scaling. Additionally, it supports auto-scaling policies to on-demand expansion based on real-time business load.

Procedure

Plan the cluster network. Set up a virtual private cloud (VPC) and create vSwitches across multiple zones to establish the cluster's foundational network environment.
Create an OSS bucket. The model weight file is stored in this bucket.
Configure an instance RAM role. A Resource Access Management (RAM) role is used to grant an elastic container instance permissions to access the OSS bucket created in Step 2.
Prepare an image cache. Create the image cache in the Elastic Container Instance console to accelerate instance startup, and upload the model weight file to the OSS bucket.
Create an ALB instance. The ALB instance serves as the service access portal.
Create a scaling group. Create a scaling group and associate it with the ALB instance. Ensure that any newly added elastic container instances are automatically registered to the ALB's backend server group.
Activate the inference service. Update the scaling group's expected instance number, start the inference service, and then wait until the service is fully running.

1. Plan and set up the cluster network

To ensure a reliable cluster network, begin by planning the network infrastructure and then create a VPC and vSwitches according to the design. For high availability, distribute resources such as elastic container instances across multiple zones. This prevents service disruptions in case of a single-zone failure. In this solution, one VPC and two vSwitches are employed to achieve redundancy. To set up the cluster network, perform the following steps:

You can reuse an existing VPC to skip this step.

Go to the console. Create one VPC and two vSwitches by following the steps in the figures below.

①②: Click VPC in the left-side navigation pane and then click Create VPC.

③: Set the Region parameter to China (Hangzhou).

④: Set the Name parameter to vpc-ess-hangzhou.

⑤: Set the IPv4 CIDR Block parameter to 192.168.0.0/16.

vSwitch 1:

⑥: Set the Name parameter to vSwitch-j.

⑦: Set the Zone parameter to Hangzhou Zone J.

⑧: Set the IPv4 CIDR Block parameter to 192.168.0.0/24.

⑨: Click Add to create another vSwitch.

vSwitch 2:

⑩: Set the Name parameter to vSwitch-k.

⑪: Set the Zone parameter to Hangzhou Zone K.

⑫: Set the IPv4 CIDR Block parameter to 192.168.1.0/24.

Click OK and wait until the VPC is created.

2. Create an OSS bucket

Once the network environment is set up, create an OSS bucket to store the model weight file, which will later be accessed by elastic container instances. To create an OSS bucket, perform the following steps:

You can reuse an existing OSS bucket to skip this step.

Go to the OSS console to create an OSS bucket.

The following figures show the key parameter settings for this example. Retain the default values for any parameters not displayed.

①: Click Create Bucket.

②: Enter a name in the Bucket Name text box. The bucket name is required when mounting the bucket.

③: Set the Region parameter based on your business requirements. Select Specific Region and make sure that the selected region matches that of elastic container instances. In this example, the region is China (Hangzhou).

Elastic container instances can access OSS buckets within the same region over an internal network, and there are no charges for traffic on this network. For more information, see Access to OSS resources from an ECS instance by using an internal endpoint of OSS.

Click Create.

3. Create an instance RAM role

In this step, an instance RAM role will be created, and access permissions for the OSS bucket will be assigned to it. Elastic container instances can later assume this role to read the model weight file from the OSS bucket. To create an instance RAM role, perform the following steps:

Go to the RAM console to create a RAM role. The following figures show the key parameter settings for this example.

①: Click Create Role.

②: Set the Principal Type parameter to Cloud Service.

③: Set the Principal Name parameter to Elastic Compute Service/ECS.

④: Click OK and configure a RAM role name as prompted.

Go to the RAM role to create a custom policy as shown in the following figure.

①: Click Create Policy.

②: Click the JSON tab.

③ This policy grants all the necessary permissions to access an OSS bucket. The policy script is as follows:

Important

When configuring the custom policy, replace <bucket_name> with the name of the bucket you actually created.

{
    "Version": "1",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "oss:*",
            "Resource": [
                "acs:oss:*:*:<bucket_name>",
                "acs:oss:*:*:<bucket_name>/*"
            ]
        }
    ]
}

Click OK and configure the policy name as prompted.

Go to the console to assign the custom policy to the RAM role.

①②: Choose Grants > Grant Permission.

③: Set the Principal parameter to the RAM role that is created.

④: Set the Policy parameter to the custom policy that is created.

Click Grant permissions.

4. Prepare an image cache and a model weight file

In this solution, the image file used by elastic container instances is large. To speed up instance startup, create an image cache in the Elastic Container Instance console. For faster model loading, you must also download the model weight file to the OSS bucket.

Go to the console to create an image cache.

①②: Choose Image Cache > Create Image Cache.

③④: Set the Region and Zone section. Ensure the region you select matches the VPC's region specified in Step 1, and the zone matches one of the vSwitches' zones specified in Step 1).

⑤⑥: Set the Network Type section. Ensure the VPC matches the one specified in Step 1, and the vSwitch matches one of the vSwitches listed in Step 1.

⑦: Set the EIP parameter to an existing elastic IP address (EIP). If no EIP exists, create one in the EIP console. After creation is finished, click the refresh icon to update the drop-down list and choose the EIP.

⑧: Set the Security Group parameter to an existing security group. If no security group exists, click create a security group. Once creation is complete, return to this page to select the security group.

⑨: Set the Image Cache Name parameter to vllm.

⑩: Set the Cache Size parameter to 100 GB.

⑪: Set the Image parameter to egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/vllm.

⑫: Set the Version Number parameter to 0.6.4.post1-pytorch2.5.1-cuda12.4-ubuntu22.04.

Wait for the image to be created. It takes about 15 minutes to create the image cache. Please be patient. You can check the creation progress on the Image Cache page in the console.

Launch a temporary elastic container instance in the console to download the model weight file to OSS bucket.

①②: Choose Container Group > Create Container Group.

③: Set the Billing Method parameter to Pay-as-you-go.

④: Set the Region parameter to match the VPC's region from Step 1. In this example, the region is China (Hangzhou).

⑤: Set the VPC parameter to the VPC created in Step 1.

⑥: Set the vSwitch parameter to one of the vSwitches created in Step 1.

⑦: Set the Security Group parameter as prompted.

Container Group Configurations:

⑧: Set the vCPU parameter to 2 vCPUs.

⑨: Set the Memory parameter to 4 GiB.

⑩: Set the After containers run and exit parameter to Upon Failure.

Advanced Settings:

⑪: Display the Advanced Settings section.

⑫: Select the Automatically Match Image Cache check box.

⑬: Click the OSS Persistence tab.

⑭: Set the Name parameter to oss-data.

⑮: Set the Bucket parameter to the bucket created in Step 2.

⑯: Set the RAM Role parameter to the instance RAM role created in Step 3.

⑰: Set the Ephemeral Storage parameter to 100 GiB.

Container Configurations:

⑱: Set the Image parameter to egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/vllm.

⑲: Click Select Image Tag to select 0.6.4.post1-pytorch2.5.1-cuda12.4-ubuntu22.04.

⑳: Copy the following startup command to the Startup Command text box as shown in the following figure.

/bin/bash
-c
git-lfs clone https://www.modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B.git /oss-data/DeepSeek-R1-Distill-Qwen-7B

Command meaning: Clone the model from the ModelScope repository by using git-lfs and save it to the /oss-data/DeepSeek-R1-Distill-Qwen-7B directory.

container-1 Advanced Settings:

㉑: Display the container-1 Advanced Settings section.

㉒: Set the vCPU parameter to 2 vCPUs.

㉓: Set the Memory parameter to 4 GiB.

㉔: Enable Storage and click Add to mount the OSS bucket.

㉕: Set the Select Volume parameter to oss-data.

㉖: Set the Mount Path parameter to /oss-data.

㉗: Click Next: Other Settings.

㉘: Set the EIP parameter to Auto Create.

㉙: Set the Maximum Bandwidth parameter to 200 Mbps.

㉚: Click Confirm Configuration and complete instance creation as prompted.

Once the elastic container instance is created, the model weight file will be automatically downloaded. You can proceed with Steps 5 and 6 while waiting for the download to complete.

How do I determine if the download is complete?

If the download is complete, the status of the elastic container instance is Succeeded in the Elastic Container Instance console.

In the OSS console, a file named DeepSeek-R1-Distill-Qwen-7B is generated in the OSS bucket.

5. Create an ALB instance

Before creating an Elastic Container Instance cluster, you must first set up an ALB instance as the access portal. To create an ALB instance, perform the following steps:

Go to the console to create an ALB instance.

①②: Choose Instances > Create ALB.

③: Set the Region parameter to match the VPC's region from Step 1. In this example, the region is China (Hangzhou).

④: Set the Network Type parameter to Internet-facing to enable Internet access.

⑤: Set the VPC parameter to the VPC created in Step 1.

⑥: Set the Zone parameter and select the the vSwitches created in Step 1.

⑦: Set the Instance Name parameter to alb-eci-deepseek-7B.

⑧: Click Buy Now and complete instance creation as prompted.

Go to the console to configure a listener and backend server group.

①②: Find the previously created ALB instance and click Create Listener in the Actions column.

If no ALB is listed, try selecting a different region from the drop-down list in the upper-left corner of the page.

③: Set the Listener Protocol parameter to HTTP.

④: Set the Listener Port parameter to 80.

⑤: Click Next.

⑥: Click Create Server Group.

⑦: Set the Server Group Type parameter to Server.

⑧: Set the Server Group Name parameter to eci-deepseek-7B.

⑨: Set the VPC parameter to the VPC created in Step 1. This value is auto-filled.

⑩: Set the Backend Server Protocol parameter to HTTP.

⑪⑫: Enable Health Check and click Modify to modify the health check settings as needed.

⑬: Set the Health Check Method parameter to GET.

⑭: Set the Health Check Path parameter to /health. Once the inference service starts, check the /health path to verify its status.

⑮⑯⑰: Click Create, Create, Next, and then Submit.

6. Create a scaling group

After creating and associating a scaling group with the ALB instance, elastic container instances are automatically provisioned, managed, and added to the ALB's backend server group for load balancing.

Go to the Auto Scaling console to create a scaling group and associate it with the ALB instance.

①②③: Choose Scaling Groups > Create.

④: Set the Scaling Group Name parameter to deepseek-7B-servers.

⑤: Set the Type parameter to ECI.

⑥: Set the Instance Configuration Source parameter to Create from Scratch.

⑦: Set the Minimum Number of Instances parameter to 0. This parameter specifies the lower limit for the number of instances in the scaling group.

⑧: Set the Maximum Number of Instances parameter to 10. This parameter specifies the upper limit for the number of instances in the scaling group.

⑨: Set the VPC parameter to the VPC created in Step 1.

⑩: Set the vSwitch parameter to the vSwitches created in Step 1.

⑪: Click Show Advanced Settings.

Advanced Settings:

⑫⑬: Set the Expected Number of Instances parameter to 0.

⑭: Click Add Server Group and Set the Type parameter to ALB.

⑮: Set the Server Group parameter to the server group created in Step 5.

⑯: Set the Port Number parameter to 30000. The inference service, deployed in elastic container instances, exposes this port for external access.

⑰: Click Create. Wait for the scaling group to be created, and then follow the prompts to set up a scaling configuration.

Go to the Auto Scaling console to create a scaling configuration.

A scaling configuration defines the template for instances in a scaling group. When scaling out, new instances are launched by using this configuration as their blueprint. To create a scaling configuration, perform the following steps:

①②③: Find the previously created scaling group and click its ID to enter the scaling group details page.

④⑤⑥: Choose Instance Configuration Sources > Scaling Configurations > Create Scaling Configuration.

⑦: Set the Billing Method parameter to Pay-as-you-go.

⑧: Set the Security Group parameter to an existing security group.

Container Group Configurations:

⑨: Click Specify Instance Type.

⑩: Set the Instance Type parameter to ecs.gn7i-c8g1.2xlarge.

⑪: Select the Automatically Match Image Cache check box.

⑫: Display Advanced Settings.

⑬⑭⑮⑯: Click the OSS Persistence tab. Set the Bucket and RAM Role parameters.

⑰: Set the Ephemeral Storage parameter to 100 GiB.

⑱: Set the GPU Driver Version parameter to tesla=550.

Container Configurations:

⑲: Set the Image parameter to egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/vllm.

⑳: Click Select Image Tag to select 0.6.4.post1-pytorch2.5.1-cuda12.4-ubuntu22.04.

㉑: Copy the following startup command to the Startup Command text box as shown in the following figure.

/bin/bash
-c
vllm serve /oss-data/DeepSeek-R1-Distill-Qwen-7B --port 30000 --served-model-name DeepSeek-R1-Distill-Qwen-7B --tensor-parallel-size 1 --max-model-len=16384 --enforce-eager --dtype=half --api-key api-key-example-abc123

This command loads the model weight file from the OSS bucket and launches the inference service on port 30000, by using the api-key-example-abc123 API key.

㉒: Display the container-1 Advanced Settings section.

㉓: Set the vCPU parameter to 8 vCPUs.

㉔: Set the Memory parameter to 30 GiB.

㉕: Set the GPU parameter to 1.

㉖㉗: Enable Storage and click Add to mount the OSS bucket.

㉘: Set the Mount Path parameter to /oss-data.

㉙: Click Next: Other Settings.

㉚: Set the EIP parameter to Auto Create.

㉛: Set the Maximum Bandwidth parameter to 200 Mbit/s.

㉜: Click Confirm Configuration and complete scaling configuration creation as prompted.

㉝㉞㉟: Enable the scaling group and scaling configuration as prompted.

7. Enable the inference service

Important

Before proceeding, ensure the model weight file is downloaded in Step 4.

After completing all prior steps, modify the scaling group's expected number of instances to initiate a scale-out event. The figure below illustrates how to scale out five elastic container instances.

Please note that adjusting the expected number of instances may cause a delay in their creation. You can track the progress of scaling activities on the Scaling Activities tab of the scaling group.

What to do next

Use Dify to communicate with the large model

Log on to Dify.
Add the model provider.
1. Click your profile picture and then click Settings. In the left-side navigation pane, click Model Provider. Find OpenAI-API-compatible and click Add.
  If no model provider is installed, install one as prompted.
2. Click Save.
  - Model Type: LLM.
  - Model Name: DeepSeek-R1-Distill-Qwen-7B.
  - API Key: api-key-example-abc123, matching the api-key configuration set in the startup command for the container group in Step 6.
Create a Q&A assistant and communicate with it.

Release resources

To avoid ongoing billing, you can follow these steps to release the resources used in this topic:

Delete the scaling group. (If you delete the scaling group created in Step 6, the instances that were automatically created will also be deleted.)
Release the ALB instance and remove the server group created in Step 5.
Release the pay-as-you-go EIP created in Step 4.
Delete the image cache and the elastic container instance created in Step 4.
Delete the instance RAM role and the custom policy created in Step 3.
Delete the OSS bucket created in Step 2.
Delete the VPC created in Step 1.

Apply the solution to a production environment

For production use, consider these architecture improvements:

Implement an auto-scaling mechanism to automatically adjust the number of instances in the scaling group based on business load, optimizing costs.

For example, you can set up event-triggered tasks to enable custom scaling based on GPU utilization or ALB backend server QPS.

Use a NAT gateway to avoid configuring an EIP for each instance.

In this example, the elastic container instance uses the configured EIP to pull public images. In production, a NAT gateway can be used to provide Internet access for elastic container instances.

Set up a domain name for the ALB instance and enable HTTPS.

We recommend using a custom domain name for the cluster access portal and enabling HTTPS for improved security. For more information, see Add a CNAME record to an ALB instance and Add an HTTPS listener.