This topic provides answers to frequently asked questions about online prediction.
Service deployment
What to do if my service remains in the Waiting state for a long time?
After deployment, a service enters the Waiting state while resource scheduling and service instance startup occur. Once all service instances start successfully, the service enters the Running state. If a service remains in the Waiting state for an extended period, you can identify the cause by checking the status of service instances in the Service Instance list on the Overview page. The causes typically fall into two categories:
Insufficient resources: In the Instances list, some or all instances are in the Pending state.
Resource scheduling fails because the dedicated resource group lacks sufficient idle resources. Example:
Verify whether the nodes in the dedicated resource group can provide sufficient idle resources, including CPU, memory, and GPU resources. For example, if a service instance requires 3 vCPUs and 4 GB of memory, the resource group must have at least one node that can provide these idle resources.
ImportantA node must reserve at least 1 vCPU for system components to prevent system failures during peak hours. Therefore, schedulable vCPU is 1 core less than the total number.
The following figure shows the list of nodes in a dedicated resource group. To view the details of the resource group, see Work with EAS resource groups.
Health check failed: The status of the service instances is Running but the status of the containers is [0/1] or [1/2].
The number to the left of the forward slash (/) indicates the number of containers that have started up. The number to the right indicates the total number of containers. When you deploy a service using a custom image, a sidecar container is automatically injected into each instance to control and monitor traffic. You do not need to pay attention to the sidecar container. The total number of containers displayed in the console is 2, which indicates the container from the custom image and the sidecar container. A service instance starts up only if both containers are ready, and then begins to receive traffic.
What to do if the status of my service is Failed?
A service remains in the Failed state in the following situations:
During service deployment: If the resources, such as the model path, that you specified during deployment do not exist, an error is displayed in the status information of the service. In most cases, you can identify the cause of failure based on the error message.
During service startup: The service fails to start up after deployment and resource scheduling. The following status message appears:
Instance <network-test-5ff76448fd-h9dsn> not healthy: Instance crashed, please inspect instance log.
.In this scenario, you need to check the instance status in the Instances list of the Overview page to identify the cause of failure. A service instance fails to start up in the following situations:
The service instance is terminated by the system during the startup process due to an out-of-memory (OOM) error. In most cases, you can resolve this issue by increasing the memory allocated to the service. Example:
The service crashes due to a code error during the startup phase. In this case, the Last Status is Error(error code). Click the Logs button in the Actions column of the instance to investigate the service logs and identify the cause of the startup failure. The instance status is shown in the following figure:
The service fails to pull the image. See Image pull failure (ImagePullBackOff).
Image pull failure (ImagePullBackOff)
If the Last Exit Reason in the service instance list is ImagePullBackOff, it usually means that the image failed to be pulled. If the Status column displays the following icon, you can click the icon to view the specific reason.
Common reasons for image pull failures are as follows:
Failure reason | Possible errors | Solutions |
Insufficient system disk space |
| |
ACR Resource Access Management not configured |
| To use a public image address, you need to enable public access for ACR. To use an internal image address:
|
EAS network configuration issues |
| When using public network image addresses, you need to configure internet access for EAS. |
Missing or incorrect authentication information |
| If an ACR Enterprise instance is not configured for public anonymous pulling and you need to pull images across regions via the internet, you must configure the username and password for the image repository download when deploying. For information on how to obtain these credentials, see Configure access credentials. |
Based on the regions where the image service and EAS service are located, the recommendations are as follows:
Same region: It is recommended to use the internal image address to pull images.
Cross-region: ACR Personal Edition can only use public image addresses. For ACR Enterprise Edition, choose based on the following situations:
If you have high requirements for security and stability, use the internal network address of the image. You must connect the VPCs using CEN. For more information, see Access ACR Enterprise Edition instances across regions or from an IDC.
If your business scenario is relatively simple, or you cannot complete internal network connection temporarily, use public image addresses as a temporary solution. Downloading through the internet will be slower.
For ACR Enterprise Edition instances, note the following:
You need to configure access control for VPC and public network according to your situation.
If the repository is not configured for public anonymous pulling, when pulling across regions via public addresses, the EAS service needs to configure the username and password for the image repository download.
How do services deployed in EAS access the Internet?
By default, EAS services cannot access the Internet. To access the Internet, you must establish a direct network connection to a VPC that has Internet access enabled. If the VPC does not have Internet access, you can enable it by creating a NAT Gateway. To create a direct connection to connect EAS to the Internet, see the following topics:
For dedicated resource groups: Configure networking for a resource group.
For public resource group: Configure networking for a resource group.
Why am I unable to select my OSS bucket when deploying an EAS service?
When deploying an EAS service, you can configure model and code by mounting. Make sure that the OSS bucket or the NAS file system reside in the same region as the EAS service. Otherwise, you cannot select the OSS bucket or the NAS file system.
Why am I unable to choose the instance type with 1 core and 2 GB memory when deploying an EAS service?
To avoid potential issues during usage, the instance type with 1 core and 2 GB memory has been removed from sale. This is because EAS deploys certain system components on each node, which occupies some resources. If the node specification is too small, the resource usage ratio of system components will be too high, resulting in a lower proportion of available resources for you.
How many services can I deploy in EAS at most?
The number of instances that you can deploy for an EAS service is determined by the remaining resource usage. You can view the amount of remaining resources in the list of machines on the Resource Groups page. For more information, see Work with EAS resource groups.
If tasks are allocated based on the number of vCPU cores, the upper limit for the number of deployed instances is (vCPU cores - 1) / number of cores used per instance.
Can I download the Alibaba Cloud images from the internet?
Alibaba Cloud images are internal images that can only be used on PAI. They cannot be downloaded for use outside of PAI containers.
EAS service automatically restarts after stopping
Problem: The EAS service automatically restarts after stopping for a period of time.
Cause:
This occurs because the service instance is configured with auto scaling, and the minimum number of instances is set to 0. After a period without traffic, the number of instances automatically scales down to 0. When there are no active instances due to automatic scaling down, any incoming request triggers automatic scaling up (without needing to reach the set scaling threshold).
You can determine whether automatic scaling occurs by checking the auto scaling description in the deployment events.
Solutions:
If you no longer need this service, you can delete it directly.
If you do not want to delete the service, you can manually stop it by clicking Stop in the console or calling the
StopService
API operation. A service that is manually stopped cannot be triggered to scale out by traffic.If you do not want auto scaling to cause the service to automatically stop, do not set the minimum number of instances to 0.
You can also disable auto scaling to prevent unexpected traffic from triggering scaling up.
PAI-EAS startup error: IoError(Os { code: 28, kind: StorageFull, message: "No space left on device" })
Problem:
PAI-EAS fails to start with the following error:
[2024-10-21 20:59:33] serialize_file(_flatten(tensors), filename, metadata=metadata)
[2024-10-21 20:59:33] safetensors_rust.SafetensorError: Error while serializing: IoError(Os { code: 28, kind: StorageFull, message: "No space left on device" })
[2024-10-21 20:59:35] time="2024-10-21T12:59:35Z" level=info msg="program stopped with status:exit status 1" program=/bin/sh
Cause: The system disk of the EAS instance is filled with model files, causing the service to fail to start normally.
Solutions:
Solution 1: Expand the system disk of the EAS instance.
Solution 2: If the model files are too large, store them on an external storage service, such as OSS or NAS, and then read them by mounting storage.
How to expand the system disk?
You can configure this in Resource Information > Additional System Disk, or directly modify the service JSON configuration file as shown below.
"features": { "eas.aliyun.com/extra-ephemeral-storage": "40GB" }
When using dedicated resources, if the additional system disk you need to configure is larger than the remaining system disk size of the machine, you need to delete the current resource, purchase a new one, and adjust the system disk size during purchase.
Service calling
How to block HTTP access and only allow HTTPS access
Shared gateways cannot be configured to block HTTP access. For dedicated gateways, you can enable HTTPS Redirection, which automatically redirects all HTTP requests to the HTTPS protocol.
What do I do if the "no healthy upstream" error message is returned during online debugging?
The "no healthy upstream" error message is returned when you debug a service online. The status code is 503.
Causes: Resources such as vCPU, memory, and GPU memory are overutilized. As a result, available resources are insufficient to support instance operation.
Solutions:
If your instance uses public resources, you can try again during off-peak hours, use a different instance type, or switch to another region.
If your instance uses dedicated resources (EAS resource groups), make sure that the resource groups reserve sufficient vCPU, memory, and GPU memory for the instance. We recommend that you reserve at least 20% of the total resources as a buffer.
What is the difference between calling a service over a VPC and over a VPC direct connection?
Call service over VPC: To call a service over a VPC, an internal-facing Server Load Balancer (SLB) instance and a gateway are used (for public addresses, a public SLB and gateway are used). This is a classic request model. In this mode, requests need to go through Layer 4 forwarding by SLB and Layer 7 forwarding by the gateway before reaching the service instance. In high-traffic and high-concurrency scenarios, forwarding can cause performance overhead, and the gateway also has bandwidth limitations (default 1 Gbps).
VPC direct connection: EAS provides a high-speed direct connection access mode that solves both performance and scalability issues without additional cost. After enabling VPC direct connection, it's equivalent to establishing a network path between your VPC and the EAS service VPC. Your requests use the service discovery feature provided by EAS to locate services and initiate load-balanced service requests through software load balancing in your client code. This process requires using the SDK provided by EAS and setting endpoint_type to DIRECT.
For example, in the scenario described in SDK for Python, you can add the following line to your code to change the connection mode from gateway to direct connection:
client = PredictClient('http://pai-eas-vpc.cn-hangzhou.aliyuncs.com', 'mnist_saved_model_example') client.set_token('M2FhNjJlZDBmMzBmMzE4NjFiNzZhMmUxY2IxZjkyMDczNzAzYjFi****') client.set_endpoint_type(ENDPOINT_TYPE_DIRECT) # Direct link client.init()
How do I use cURL commands to call EAS online services?
After you deploy an online service in EAS, you can run cURL commands to call the service through the public endpoint or the VPC endpoint. Perform the following steps:
Obtain the service endpoint and token.
On the Elastic Algorithm Service (EAS) page, click the target service to go to the Overview page.
In the Basic Information section, click View Invocation Information.
In the Invocation Information dialog box, on the Public Endpoint Invocation or VPC Endpoint Invocation tab, obtain the service endpoint and token.
Run the cURL command to call the service.
Example:
$ curl <service_url> -H 'Authorization: <service_token>' -d '[{"sex":0,"cp":0,"fbs":0,"restecg":0,"exang":0,"slop":0,"thal":0,"age":0,"trestbps":0,"chol":0,"thalach":0,"oldpeak":0,"ca":0}]'
Where:
<service_url>: Replace with the service endpoint that you obtained.
<service_token>: Replace with the token that you obtained.
-d: The service request data.
What to do if the service log says "[WARN] connection is closed: End of file" or "Write an Invalid stream: End of file"?
The connection between the client and server is interrupted. When the server answers the request from the client, the server identifies that the connection to the client is interrupted. Then, the server generates a warning event. A connection is interrupted in the following situations:
The server times out and closes the connection: In Processor mode, the default server timeout is 5 seconds. You can use the service's
metadata.rpc.keepalive
parameter to modify the timeout. After the timeout period expires, the server closes the connection and records a 408 status code in the server's monitoring data.The client times out: The client timeout period is specified in the code of your client. If the server does not return a response before the timeout period ends, the HTTP client automatically closes the connection. In this scenario, you can find a 499 status code in the monitoring data of the server.
For more information about request status codes, see Appendix: Status codes.
Upstream connect error or disconnect/reset before headers. reset reason: connection termination
This is typically caused by issues such as long connection timeouts that result in request failures or uneven instance loads. If the server processing time exceeds the HTTP timeout configured on the client, the client abandons the request and closes the connection. At this point, the server monitoring shows a 499 status code. You can check the monitoring metrics for further confirmation.
Check the monitoring metrics for further confirmation. For cases where inference takes a long time, we recommend that you deploy an asynchronous inference service.
What do I do if I fail to call or debug a service that is deployed using a TensorFlow or PyTorch processor?
For performance reasons, the request body for a service that is deployed using a TensorFlow or PyTorch processor is in the non-plaintext protobuf format. The online debugging feature supports only plaintext input. Therefore, you cannot use the console to debug this type of service. You can use the SDKs provided by EAS to send requests. For more information about the SDKs for different programming languages, see Service Invocation SDKs.
Service management
Can I connect to an EAS instance over SSH?
No. EAS does not support remote SSH access. You cannot log on to the container to debug an EAS instance. To execute commands, we recommend configuring them in the run command.
What are the service statuses in EAS?
Currently, an EAS service can be in one of the following statuses. You can also go to the Elastic Algorithm Service (EAS) page to view the status in the service status column.
Creating
Waiting
Stopped
Failed
Updating
Stopping: In progress
HotUpdate: Upgrading (A hot update that does not update the instance)
Starting
DeleteFailed: Delete failed
Running
Scaling
Pending
Deleting
Completed
Preparing
Resource purchase/unsubscribe/delete
Dedicated resource group keeps scaling
This is usually due to insufficient resources in the current region.
For instances that use the subscription billing method, if resource creation fails due to insufficient resources, the system will automatically create a refund order, and the paid fees will be returned through the original payment method.
How to delete the nodes in a dedicated resource group that uses the subscription billing method?
Go to the Unsubscribe page to unsubscribe from the subscription-based dedicated EAS machines that you no longer use.
Type: Select Partial Refund.
Product Name: Select EAS Dedicated Machine Subscription.
Click Search to find the target resource, then click Unsubscribe Resource in the Actions column and follow the on-screen instructions to complete the unsubscription process.
After I unsubscribe an instance in a resource group, will the service instance data be retained?
No, the service instance data will not be retained.
Permissions
Why is the service-linked role of EAS not automatically created or deleted for RAM users?
The AliyunServiceRoleForPaiEas role can be automatically created or deleted only if you have the required permissions. Therefore, the AliyunServiceRoleForPaiEas role cannot be automatically created or deleted for Resource Access Management (RAM) users. If you want the system to automatically create and delete the role, attach the following custom policy to the RAM user.
Use the following script to create a custom policy. For more information, see Create a custom policy.
{ "Statement": [ { "Action": "ram:CreateServiceLinkedRole", "Resource": "*", "Effect": "Allow", "Condition": { "StringEquals": { "ram:ServiceName": "eas.pai.aliyuncs.com" } } } ], "Version": "1" }
Attach the custom policy to the RAM user. For more information, see Grant permissions to the RAM user.