Troubleshoot common EAS deployment, resource, scaling, invocation, and permission issues.
Deployment and instance status
After deployment, check status, logs, and events for each instance in the Service Instances list on the Overview page to diagnose issues.
Service stuck in Waiting state
A service stays in Waiting while EAS schedules resources and starts instances. It moves to Running once all instances are up. Common causes:
Service in Failed state
A service enters Failed in two scenarios:
-
During deployment: A required resource (such as a model path) does not exist. The error appears in the service status information.
-
During startup: Instances were scheduled but crashed. The status message looks like:
Instance <network-test-5ff76448fd-h9dsn> not healthy: Instance crashed, please inspect instance log.
Open the Service Instances list on the Overview page and check the failed instance. Common causes:
-
Out of memory (OOM): The instance was killed during startup because it ran out of memory. Increase allocated memory and redeploy.

-
Code error: Last Status shows Error with an error code. Click Log in the Actions column to view details.

-
Image pull failure: See ImagePullBackOff error below.
ImagePullBackOff error
If Last Exit Reason in the service instance list shows ImagePullBackOff, the image pull failed. Click the icon in the Status column for details.
Common causes and solutions:
| Cause | Error message | Solution |
|---|---|---|
| System disk full | no space left on device |
Expand the system disk. |
| ACR access control not configured | no such host |
For public endpoints, enable public access for ACR. For internal endpoints: (1) Add a VPC such as eas_vpc for EAS. (2) Add eas_vpc to ACR Enterprise Edition access control settings. See Configure access control for a VPC for ACR. |
| EAS network misconfiguration | dial tcp ***** timeout |
Configure Internet access for EAS. |
| Missing or wrong credentials | 401 Unauthorized or authorization failed |
If the ACR Enterprise Edition instance does not allow anonymous pulls and you are pulling from another region over the Internet, configure the image repository username and password during deployment. See Configure access credentials. |
Choose between public and internal image URLs:
-
Same region: Use the internal URL.
-
Different regions:
-
ACR Personal Edition: Use public image URL.
-
ACR Enterprise Edition: For better security and stability, use internal URL and connect VPCs with Cloud Enterprise Network (CEN). See Access an ACR Enterprise Edition instance from a different region or an IDC. As a temporary fallback, public URL works but downloads are slower.
-
For ACR Enterprise Edition instances, configure access control for VPCs and Internet as needed. If a repository does not allow anonymous pulls and you pull from a different region over Internet, provide the image repository username and password in EAS.
Stopped service restarts automatically
This happens when Auto Scaling is configured with a minimum instance count of 0. After a period with no traffic, EAS scales instances down to 0. When a new request arrives, scale-out triggers automatically, even if the configured metric threshold has not been reached.
Check the auto scaling description in deployment events to confirm.
Prevention options:
-
Delete the service if no longer needed.
-
Manually stop the service by clicking Stop in the console or calling the
StopServiceAPI. A manually stopped service does not auto-scale. -
Set minimum instance count to a value greater than 0.
-
Disable Auto Scaling entirely.
No space left on device (StorageFull error)
Full error message:
[2024-10-21 20:59:33] serialize_file(_flatten(tensors), filename, metadata=metadata)
[2024-10-21 20:59:33] safetensors_rust.SafetensorError: Error while serializing: IoError(Os { code: 28, kind: StorageFull, message: "No space left on device" })
[2024-10-21 20:59:35] time="2024-10-21T12:59:35Z" level=info msg="program stopped with status:exit status 1" program=/bin/sh
The system disk is full because model files are too large.
Solutions:
-
Expand the system disk for EAS instance.
-
Move models to external storage (OSS or NAS) and load them through storage mounts.
fork/exec /bin/sh: exec format error
The exec format error means the CPU architecture of your executable or container image does not match the host system. Switch to a different resource specification.
Invalid GPU count 6, only supported: [0 1 2 4 8 16]
GPU counts must be a power of 2 for efficient multi-GPU communication. Valid values: 0, 1, 2, 4, 8, or 16.
Resources
Resource specifications and quotas
1-core, 2 GB specification unavailable
The 1-core, 2 GB resource specification is no longer available. EAS deploys system components on each node, and on small machines these components consume too large a share of available resources, leaving too little for your service.
Service deployment capacity
This depends on available resources. Check the machine list of your resource group in the console. See Use EAS resource groups.
For CPU-based allocation: (Total CPU cores on node - 1) / cores per instance.
4090 GPU equivalent specification
The ecs.gn8ia-2x.8xlarge specification offers performance similar to an NVIDIA 4090.
Model maximum concurrency
There is no fixed answer. It depends on the model, workload, and resource configuration. Use Automatic stress testing to benchmark your specific setup.
Dedicated resource groups
Dedicated resource group continuous scale-out
Usually because the current region lacks sufficient resources. For subscription-based instances, if creation fails due to insufficient resources, a refund order is automatically created and payment is returned to your account.
Delete subscription instance from resource group
Go to the Alibaba Cloud Unsubscribe page and:
-
Set Type to Partial Refund.
-
Set Product Name to EAS Dedicated Machine Subscription.
Click Search to find the resource, then click Unsubscribe Resource in the Actions column and follow instructions.
Service instance data retention after unsubscribe
No. Service instance data is not retained after unsubscribe.
System disk
Expand system disk
Two methods:
-
Console: Go to Resource Information > Configure System Disk and set the System Disk size.

-
JSON configuration: Set the
diskvalue undermetadata:"metadata": {"disk": "40Gi"}
For dedicated resource groups, the configured system disk size cannot exceed the node's system disk size. To get a larger system disk, release the current node and purchase a new one with a larger disk.
Scaling and updates
Supported scaling policies
Two options: horizontal auto-scaling and scheduled scaling.
Horizontal auto-scaling triggers based on metrics such as queries per second (QPS) or CPU utilization. For configuration details, see Horizontal auto-scaling.
To prevent thrashing, EAS applies a 10% tolerance to the threshold. If QPS threshold is 10, scale-out typically triggers only when QPS stays consistently above 11 (10 x 1.1). Brief fluctuations between 10 and 11 do not trigger scale-out.
Scaled-out instance placement
If you have a dedicated resource group with an elastic resource pool configured, EAS scales out to public resources when the dedicated group is full.
Update service without downtime
Combine these features:
-
Rolling updates: Configure under Service Features > Stability Assurance. See Rolling updates and graceful exit.

-
Elastic resource pool: Lets overflow instances deploy to a pay-as-you-go public resource group when the dedicated group is full. See Elastic resource pools.
-
High-priority resource descheduling: When space opens up in the dedicated group (for example, old instances are destroyed), EAS automatically moves instances back from the public resource group to save costs.
Service invocation
Invocation error troubleshooting
Check the returned status code. For a complete list, see Appendix: Service status codes and common errors.
HTTPS support
Yes. Replace http:// with https:// in the service endpoint. If the client (for example, Python requests) reports an SSL certificate validation error, that is a client-side environment issue, not an EAS problem.
Force HTTPS access
-
Shared gateway: No. Forced HTTPS is not supported.
-
Dedicated gateway: Yes. Enable HTTPS Redirection in the dedicated gateway configuration. All HTTP requests redirect to HTTPS.

Custom domain name support
Yes. Create a fully managed dedicated gateway and configure your custom domain name in it. See Use a dedicated gateway.
Service token expiration
No. The Token generated after deployment is long-lived. Restarts, updates, and scaling do not change it. It becomes invalid only if you manually reset the token or delete the service.
Multiple tokens per service
No. Each EAS service supports only one authentication Token. For multi-user permission management or separate metering, use Alibaba Cloud RAM authentication instead.
Enable streaming responses for LLM service
EAS has no global streaming toggle. Add "stream": true to the JSON request body of each API call. For example, when calling an LLM service compatible with OpenAI format, include it in the request body.
VPC endpoint invocation vs VPC direct connection
-
VPC endpoint invocation: Requests go through an internal-facing SLB (Layer 4) and a gateway (Layer 7) before reaching your service instance. This is the standard path. The gateway has a default bandwidth limit of 1 Gbps, and the extra hops cause some performance overhead under high traffic.
-
VPC direct connection: A direct network path between your VPC and EAS service VPC. Requests use EAS service discovery to locate the service, then load-balance from the client side. No extra cost. Requires the EAS SDK with endpoint_type set to DIRECT.
Example using the Python SDK:
client = PredictClient('http://pai-eas-vpc.cn-hangzhou.aliyuncs.com', 'mnist_saved_model_example')
client.set_token('M2FhNjJlZDBmMzBmMzE4NjFiNzZhMmUxY2IxZjkyMDczNzAzYjFi****')
client.set_endpoint_type(ENDPOINT_TYPE_DIRECT) # Direct link
client.init()
Permissions and network
RAM user cannot create or delete EAS service-linked role
Only users with the right permissions can automatically create or delete the AliyunServiceRoleForPaiEas service-linked role.
-
Create a custom policy with the script above. See Create a custom permission policy.
-
Attach the policy to the RAM user. See Manage the permissions of a RAM user.
EAS service Internet access
By default, EAS services cannot reach the public Internet. Configure a VPC with Internet access for the service. See Access public or private resources from EAS.
Service management
SSH access to EAS instance
No. EAS is a managed service and does not provide SSH access to containers. To run commands at container startup, specify them in the Run Command field in the service configuration.
EAS service statuses
View the Service Status column on the Elastic Algorithm Service (EAS) page. Possible statuses:
| Status | Meaning |
|---|---|
| Creating | Service is being created. |
| Waiting | Waiting for instances to start. |
| Starting | Service is starting. |
| Running | Service is running normally. |
| Updating | Service is updating and instances will be updated. |
| HotUpdate | Service is being updated. This is a hot update, and instances are not updated. |
| Scaling | Instances are being scaled. |
| Stopping | Service is being stopped. |
| Stopped | Service is stopped. |
| Failed | Service has failed. |
| Deleting | Service is being deleted. |
| DeleteFailed | Service deletion failed. |
| Pending | Waiting for a specific action. |
| Completed | Task is complete. |
| Preparing | Service is preparing. |
Find which RAM user created a service
Query events in the ActionTrail console. Set the event name to CreateService. See Query events in the ActionTrail console.
Download official EAS images
No. PAI official images are internal platform images. They are available only within the PAI platform and cannot be downloaded from outside the platform's containers.
Storage and configuration
Cannot select OSS bucket during deployment
The OSS bucket and NAS file system must be in the same region as the EAS service. If they are in a different region, you cannot select them. Ensure your storage resources are in the same region when configuring models and code using storage mounts.
TensorFlow issues
See TensorFlow FAQ.

