All Products
Search
Document Center

Platform For AI:EAS FAQ

Last Updated:Mar 10, 2026

Troubleshoot common EAS deployment, resource, scaling, invocation, and permission issues.

Deployment and instance status

After deployment, check status, logs, and events for each instance in the Service Instances list on the Overview page to diagnose issues.

Service stuck in Waiting state

A service stays in Waiting while EAS schedules resources and starts instances. It moves to Running once all instances are up. Common causes:

1. Insufficient resources (instances show Pending)

The dedicated resource group lacks enough idle CPU, memory, or GPU on a single node. For example, if an instance needs 3 CPU cores and 4 GB memory, at least one node must have that much free.

8da00c3a5f2ebf7f0a110686ae473103
Important

Each node reserves 1 CPU core for system components. Schedulable resources = total node resources minus the reserved core.

Check the node list in your dedicated resource group for available resources. For details, see Use EAS resource groups.

image

2. Health check incomplete (container status shows [0/1] or [1/2])

The number before the slash is the count of successfully started containers; the number after is the total. When deploying with a custom image, EAS injects a sidecar container for traffic shaping and monitoring, so the total is 2 (your container + the sidecar). The instance receives traffic only after both containers reach Ready state.

ab1dfe90bbf7d3056c52b2dfbea196ce

3. Health check fails due to port mismatch

This happens when your code listens on one port but EAS health-checks a different one. EAS expects port 8089 by default. If your Flask, FastAPI, Sanic, or Django app runs on a different port (for example, 7000), the health check fails and the service stays in Waiting.

imageimageimage

Solution: Change the EAS service port to match the port in your code, then restart the service.

image

Service in Failed state

A service enters Failed in two scenarios:

  • During deployment: A required resource (such as a model path) does not exist. The error appears in the service status information.

  • During startup: Instances were scheduled but crashed. The status message looks like: Instance <network-test-5ff76448fd-h9dsn> not healthy: Instance crashed, please inspect instance log.

Open the Service Instances list on the Overview page and check the failed instance. Common causes:

  • Out of memory (OOM): The instance was killed during startup because it ran out of memory. Increase allocated memory and redeploy. 5c694af1e97b7d3c11cea6d6303d1540

  • Code error: Last Status shows Error with an error code. Click Log in the Actions column to view details. 93d610794f407ebfc40a862fe47a9069

  • Image pull failure: See ImagePullBackOff error below.

ImagePullBackOff error

If Last Exit Reason in the service instance list shows ImagePullBackOff, the image pull failed. Click the icon in the Status column for details.

image

Common causes and solutions:

Cause Error message Solution
System disk full no space left on device Expand the system disk.
ACR access control not configured no such host For public endpoints, enable public access for ACR. For internal endpoints: (1) Add a VPC such as eas_vpc for EAS. (2) Add eas_vpc to ACR Enterprise Edition access control settings. See Configure access control for a VPC for ACR.
EAS network misconfiguration dial tcp ***** timeout Configure Internet access for EAS.
Missing or wrong credentials 401 Unauthorized or authorization failed If the ACR Enterprise Edition instance does not allow anonymous pulls and you are pulling from another region over the Internet, configure the image repository username and password during deployment. See Configure access credentials.

Choose between public and internal image URLs:

  • Same region: Use the internal URL.

  • Different regions:

For ACR Enterprise Edition instances, configure access control for VPCs and Internet as needed. If a repository does not allow anonymous pulls and you pull from a different region over Internet, provide the image repository username and password in EAS.

Stopped service restarts automatically

This happens when Auto Scaling is configured with a minimum instance count of 0. After a period with no traffic, EAS scales instances down to 0. When a new request arrives, scale-out triggers automatically, even if the configured metric threshold has not been reached.

Check the auto scaling description in deployment events to confirm.

Prevention options:

  • Delete the service if no longer needed.

  • Manually stop the service by clicking Stop in the console or calling the StopService API. A manually stopped service does not auto-scale.

  • Set minimum instance count to a value greater than 0.

  • Disable Auto Scaling entirely.

No space left on device (StorageFull error)

Full error message:

[2024-10-21 20:59:33] serialize_file(_flatten(tensors), filename, metadata=metadata)
[2024-10-21 20:59:33] safetensors_rust.SafetensorError: Error while serializing: IoError(Os { code: 28, kind: StorageFull, message: "No space left on device" })
[2024-10-21 20:59:35] time="2024-10-21T12:59:35Z" level=info msg="program stopped with status:exit status 1" program=/bin/sh

The system disk is full because model files are too large.

Solutions:

  1. Expand the system disk for EAS instance.

  2. Move models to external storage (OSS or NAS) and load them through storage mounts.

fork/exec /bin/sh: exec format error

The exec format error means the CPU architecture of your executable or container image does not match the host system. Switch to a different resource specification.

Invalid GPU count 6, only supported: [0 1 2 4 8 16]

GPU counts must be a power of 2 for efficient multi-GPU communication. Valid values: 0, 1, 2, 4, 8, or 16.

Resources

Resource specifications and quotas

1-core, 2 GB specification unavailable

The 1-core, 2 GB resource specification is no longer available. EAS deploys system components on each node, and on small machines these components consume too large a share of available resources, leaving too little for your service.

Service deployment capacity

This depends on available resources. Check the machine list of your resource group in the console. See Use EAS resource groups.

For CPU-based allocation: (Total CPU cores on node - 1) / cores per instance.

4090 GPU equivalent specification

The ecs.gn8ia-2x.8xlarge specification offers performance similar to an NVIDIA 4090.

Model maximum concurrency

There is no fixed answer. It depends on the model, workload, and resource configuration. Use Automatic stress testing to benchmark your specific setup.

Dedicated resource groups

Dedicated resource group continuous scale-out

Usually because the current region lacks sufficient resources. For subscription-based instances, if creation fails due to insufficient resources, a refund order is automatically created and payment is returned to your account.

Delete subscription instance from resource group

Go to the Alibaba Cloud Unsubscribe page and:

  1. Set Type to Partial Refund.

  2. Set Product Name to EAS Dedicated Machine Subscription.

Click Search to find the resource, then click Unsubscribe Resource in the Actions column and follow instructions.

Service instance data retention after unsubscribe

No. Service instance data is not retained after unsubscribe.

System disk

Expand system disk

Two methods:

  1. Console: Go to Resource Information > Configure System Disk and set the System Disk size. image

  2. JSON configuration: Set the disk value under metadata:

       "metadata": {"disk": "40Gi"}
Note

For dedicated resource groups, the configured system disk size cannot exceed the node's system disk size. To get a larger system disk, release the current node and purchase a new one with a larger disk.

Scaling and updates

Supported scaling policies

Two options: horizontal auto-scaling and scheduled scaling.

Horizontal auto-scaling triggers based on metrics such as queries per second (QPS) or CPU utilization. For configuration details, see Horizontal auto-scaling.

To prevent thrashing, EAS applies a 10% tolerance to the threshold. If QPS threshold is 10, scale-out typically triggers only when QPS stays consistently above 11 (10 x 1.1). Brief fluctuations between 10 and 11 do not trigger scale-out.

Scaled-out instance placement

If you have a dedicated resource group with an elastic resource pool configured, EAS scales out to public resources when the dedicated group is full.

Update service without downtime

Combine these features:

  1. Rolling updates: Configure under Service Features > Stability Assurance. See Rolling updates and graceful exit. image

  2. Elastic resource pool: Lets overflow instances deploy to a pay-as-you-go public resource group when the dedicated group is full. See Elastic resource pools.

  3. High-priority resource descheduling: When space opens up in the dedicated group (for example, old instances are destroyed), EAS automatically moves instances back from the public resource group to save costs.

Service invocation

Invocation error troubleshooting

Check the returned status code. For a complete list, see Appendix: Service status codes and common errors.

HTTPS support

Yes. Replace http:// with https:// in the service endpoint. If the client (for example, Python requests) reports an SSL certificate validation error, that is a client-side environment issue, not an EAS problem.

Force HTTPS access

  • Shared gateway: No. Forced HTTPS is not supported.

  • Dedicated gateway: Yes. Enable HTTPS Redirection in the dedicated gateway configuration. All HTTP requests redirect to HTTPS. image

Custom domain name support

Yes. Create a fully managed dedicated gateway and configure your custom domain name in it. See Use a dedicated gateway.

Service token expiration

No. The Token generated after deployment is long-lived. Restarts, updates, and scaling do not change it. It becomes invalid only if you manually reset the token or delete the service.

Multiple tokens per service

No. Each EAS service supports only one authentication Token. For multi-user permission management or separate metering, use Alibaba Cloud RAM authentication instead.

Enable streaming responses for LLM service

EAS has no global streaming toggle. Add "stream": true to the JSON request body of each API call. For example, when calling an LLM service compatible with OpenAI format, include it in the request body.

VPC endpoint invocation vs VPC direct connection

  • VPC endpoint invocation: Requests go through an internal-facing SLB (Layer 4) and a gateway (Layer 7) before reaching your service instance. This is the standard path. The gateway has a default bandwidth limit of 1 Gbps, and the extra hops cause some performance overhead under high traffic.

  • VPC direct connection: A direct network path between your VPC and EAS service VPC. Requests use EAS service discovery to locate the service, then load-balance from the client side. No extra cost. Requires the EAS SDK with endpoint_type set to DIRECT.

Example using the Python SDK:

client = PredictClient('http://pai-eas-vpc.cn-hangzhou.aliyuncs.com', 'mnist_saved_model_example')
client.set_token('M2FhNjJlZDBmMzBmMzE4NjFiNzZhMmUxY2IxZjkyMDczNzAzYjFi****')
client.set_endpoint_type(ENDPOINT_TYPE_DIRECT)  # Direct link
client.init()

Permissions and network

RAM user cannot create or delete EAS service-linked role

Only users with the right permissions can automatically create or delete the AliyunServiceRoleForPaiEas service-linked role.

Access policy for creating or deleting a service-linked role

{
  "Statement": [
    {
      "Action": "ram:CreateServiceLinkedRole",
      "Resource": "*",
      "Effect": "Allow",
      "Condition": {
        "StringEquals": {
          "ram:ServiceName": "eas.pai.aliyuncs.com"
        }
      }
    }
  ],
  "Version": "1"
}
  1. Create a custom policy with the script above. See Create a custom permission policy.

  2. Attach the policy to the RAM user. See Manage the permissions of a RAM user.

EAS service Internet access

By default, EAS services cannot reach the public Internet. Configure a VPC with Internet access for the service. See Access public or private resources from EAS.

Service management

SSH access to EAS instance

No. EAS is a managed service and does not provide SSH access to containers. To run commands at container startup, specify them in the Run Command field in the service configuration.

EAS service statuses

View the Service Status column on the Elastic Algorithm Service (EAS) page. Possible statuses:

Status Meaning
Creating Service is being created.
Waiting Waiting for instances to start.
Starting Service is starting.
Running Service is running normally.
Updating Service is updating and instances will be updated.
HotUpdate Service is being updated. This is a hot update, and instances are not updated.
Scaling Instances are being scaled.
Stopping Service is being stopped.
Stopped Service is stopped.
Failed Service has failed.
Deleting Service is being deleted.
DeleteFailed Service deletion failed.
Pending Waiting for a specific action.
Completed Task is complete.
Preparing Service is preparing.

Find which RAM user created a service

Query events in the ActionTrail console. Set the event name to CreateService. See Query events in the ActionTrail console.

Download official EAS images

No. PAI official images are internal platform images. They are available only within the PAI platform and cannot be downloaded from outside the platform's containers.

Storage and configuration

Cannot select OSS bucket during deployment

The OSS bucket and NAS file system must be in the same region as the EAS service. If they are in a different region, you cannot select them. Ensure your storage resources are in the same region when configuring models and code using storage mounts.

TensorFlow issues

See TensorFlow FAQ.