FAQ about EAS - Platform For AI - Alibaba Cloud Documentation Center

This topic provides answers to some frequently asked questions about online prediction.

What do I do if a service remains in the Waiting state for a long period of time?
What do I do if the status of a service is Failed?
How do services deployed in EAS access the Internet?
What is the difference between calling a service over a VPC and over a VPC direct connection?
What do I do if the log of the service includes the "[WARN] connection is closed: End of file" or "Write an Invalid stream: End of file" error?
What do I do if I fail to call or debug a service that is deployed by using a TensorFlow or PyTorch processor?
Why is the service-linked role of EAS not automatically created or deleted for RAM users?
How do I delete the nodes in a dedicated resource group that uses the subscription billing method?
How do I use cURL commands to call EAS online services?

What do I do if a service remains in the Waiting state for a long period of time?

After you deploy a service in Elastic Algorithm Service (EAS), the status of the service changes to Waiting and the service is waiting for resource scheduling and the startup of the service instances. After all service instances start up, the status of the service changes to Running. If the service remains in the Waiting state for a long period of time, you can go to the Service Details page, click the Instances tab, and then identify the cause based on the status of the service instances. In most cases, the issue occurs in the following situations:

Insufficient resources: On the Instances tab, all service instances or some service instances are in the Pending state.

Resource scheduling fails because the dedicated resource group does not have sufficient idle resources. The following figure shows an example.

In this scenario, check whether the nodes in the dedicated resource group can provide sufficient idle resources, including CPU, memory, and GPU resources. If a service instance requires 3 vCPUs and 4 GB of memory, the resource group must have at least one node to provide the required idle resources.

Important

Note: A node must reserve at least 1 vCPU for system components to avoid system failures during peak hours. Therefore, you need to decrease 1 vCPU from the total number of schedulable vCores.

The following figure shows the nodes in a resource group.

Service instances fail to complete health check: The status of the service instances is Running but the status of the containers is [0/1] or [1/2].

The number to the left of the forward slash (/) indicates the number of containers that have started up. The number to the right of the forward slash (/) indicates the total number of containers. When you deploy a service from a custom image, a sidecar container is automatically injected into each service instance to control and monitor service traffic. You do not need to pay attention to the sidecar container. The total number of containers displayed in the console is 2, which indicates the container created from the custom image and the sidecar container. A service instance has started up only if both of the containers are ready. Service traffic is distributed to the service instance after the instance starts up.

What do I do if the status of a service is Failed?

A service remains in the Failed state in the following situations:

Service deployment phase: If the resources, such as the model path, that you specified when you deploy the service do not exist, an error is displayed in the status information of the service. In most cases, you can identify the cause of failure based on the error.
Service startup phase: After the service is deployed and resources are scheduled to the service, the service fails to start up. In this scenario, the following status information is displayed:

Instance <network-test-5ff76448fd-h9dsn> not healthy: Instance crashed, please inspect instance log.

The information indicates that a service instance fails to start up. In this scenario, you need to check the instance status on the Instances tab of the service details page to identify the cause of failure. A service instance fails to start up in the following situations:

The service instance is terminated by the system during the startup process due to an out-of-memory (OOM) error. In most cases, you can increase the memory allocated to the service to resolve this issue. The following figure shows the status of the instance.

The service instance encounters a code crash during the startup phase and the Last Status column of the service instance displays Error(2). In this scenario, you need to click Logs in the Actions column of the service instance on the Instances tab to identify the cause of failure. The following figure shows the status of the instance.
The service instance fails to pull the image and the status of the service instance on the Instances tab is Pending. The Reason for Last Exit column of the service instance displays ImagePullBackOff.

This issue usually occurs when you use a custom image to deploy a service and the custom image fails to be pulled. To resolve this issue, check the following items:

Check whether the image address is valid. By default, EAS is not connected to the Internet. Therefore, EAS cannot pull images over the Internet. If the image is stored on a Container Registry Personal Edition instance, pull the image over the virtual private cloud (VPC) of the Container Registry Personal Edition instance. If you want to pull an image over the Internet, refer to How do services deployed in EAS access the Internet?
If your image is stored on a Container Registry Enterprise Edition instance, make sure that the VPC configuration of the Container Registry Enterprise Edition instance is the same as the network configuration of EAS. To configure networking for EAS and Container Registry, see the following topics:
VPC configuration for Container Registry Enterprise Edition instances: Configure access over VPCs
Network configuration for EAS dedicated resource groups: Configure networking for a resource group
Network configuration for the EAS public resource group: Configure networking for a resource group
Check whether the authentication information is valid. After you grant permissions to EAS in the console, EAS can pull images from Container Registry instances without passwords. To pull images from a Container Registry Personal Edition instance, enter the address of the image. To pull images from a Container Registry Enterprise Edition instance, specify the "cloud.docker_registry.instance_id": "cr_xxx" field in the service configuration. This field specifies the ID of the Container Registry Enterprise Edition instance from which you want to pull images. If you want to pull images from a third-party image repository, specify the dockerAuth field to set the username and password of the image repository.

How do services deployed in EAS access the Internet?

By default, services deployed in EAS cannot access the Internet. If you want to access the Internet, you need to connect the VPC in which your service is deployed to the Internet. Before you create a direct connection, make sure that a NAT gateway is deployed for the VPC. If no NAT gateway exists, create a NAT gateway for the VPC. To create a direct connection to connect EAS to the Internet, see the following topics:

Network configuration for EAS dedicated resource groups: Configure networking for a resource group
Network configuration for the EAS public resource group: Configure networking for a resource group

What is the difference between calling a service over a VPC and over a VPC direct connection?

To call a service over the Internet, an Internet-facing Server Load Balancer (SLB) instance and a NAT gateway are used. To call a service over a VPC, an internal-facing SLB instance and a NAT gateway are used. The SLB instance is used to forward requests at Layer 4 and the NAT gateway is used to forward requests at Layer 7. Requests need to pass through the SLB instance and NAT gateway and then reach the service instances. During peak hours, the performance of the service may be degraded due to traffic forwarding and the limited bandwidth of the NAT gateway. By default, the maximum bandwidth of a NAT gateway is 1 Gbit/s. EAS does not provide dedicated gateways that you can pay to scale the maximum bandwidth of the gateways because these gateways will increase your resource and data transfer costs. To resolve the preceding performance and bandwidth limit issue, EAS allows you to use the VPC direct connection feature without incurring additional fees. After you activate the VPC direct connection feature, a direct connection is created between the VPC of your client and the VPC in which the EAS service is deployed. Then, you can use the service discovery feature provided by EAS to obtain the service endpoint and use a software load balancer on your client to initiate a request to the service. This solution requires you to use the EAS SDK to access the service and set endpoint_type to DIRECT.

For example, in the scenario described in SDK for Python, you can add the following code block to the client code to change the connection mode from NAT gateway to VPC direct connection:

client = PredictClient('http://pai-eas-vpc.cn-hangzhou.aliyuncs.com', 'mnist_saved_model_example')
client.set_token('M2FhNjJlZDBmMzBmMzE4NjFiNzZhMmUxY2IxZjkyMDczNzAzYjFi****')
client.set_endpoint_type(ENDPOINT_TYPE_DIRECT) # Direct link
client.init()

What do I do if the log of the service includes the "[WARN] connection is closed: End of file" or "Write an Invalid stream: End of file" error?

The connection between the client and server is interrupted. When the server answers the request from the client, the server identifies that the connection to the client is interrupted. Then, the server generates a warning event. A connection is interrupted in the following situations:

The server times out: If the service is deployed by using a processor, the default timeout period of the server is 5 seconds. You can set the metadata.rpc.keepalive parameter of the service to modify the server timeout period. The server automatically closes the connection when the timeout period ends. In this scenario, you can find a 408 status code in the monitoring data of the server.
The client times out: The client timeout period is specified in the code of your client. If the server does not return a response before the timeout period ends, the HTTP client automatically closes the connection. In this scenario, you can find a 499 status code in the monitoring data of the server.

For more information about request status codes, see Appendix: Status codes.

What do I do if I fail to call or debug a service that is deployed by using a TensorFlow or PyTorch processor?

To ensure service performance, the body of requests sent by the TensorFlow or PyTorch processor is in the protobuf format. This means that the requests are cipher requests. However, online service debugging supports only plaintext requests. You cannot perform online service debugging in the console. To perform online service debugging, you need to use the EAS SDK to send requests. For more information about the EAS SDK for different programming languages, see SDKs.

Why is the service-linked role of EAS not automatically created or deleted for RAM users?

The AliyunServiceRoleForPaiEas role can be automatically created or deleted only if you have the required permissions. Therefore, the AliyunServiceRoleForPaiEas role cannot be automatically created or deleted for Resource Access Management (RAM) users. If you want the system to automatically create and delete the role, attach the following custom policy to the RAM user.

Create a custom policy based on the following content in script mode. For more information, see Create a custom policy.

{
  "Statement": [
    {
      "Action": "ram:CreateServiceLinkedRole",
      "Resource": "*",
      "Effect": "Allow",
      "Condition": {
        "StringEquals": {
          "ram:ServiceName": "eas.pai.aliyuncs.com"
        }
      }
    }
  ],
  "Version": "1"
}

Attach the custom policy to the RAM user. For more information, see Grant permissions to the RAM user.

How do I delete the nodes in a dedicated resource group that uses the subscription billing method?

You cannot cancel the subscription of a node before the subscription expires. If you want to delete a subscription node, contact your customer business manager.

How do I use cURL commands to call EAS online services?

After you deploy an online service in EAS, you can run cURL commands to call the service through the public endpoint or the VPC endpoint. Perform the following steps:

Obtain the service endpoint and token.
1. On the EAS-Online Model Services page, click the service name to go to the Service Details page.
2. In the Basic Information section, click View Endpoint Information.
3. On the Public Endpoint or VPC Endpoint tab of the Invocation Method dialog box, obtain the endpoint and token of the service.
Run the cURL command to call the service.
Sample command:
```
$ curl <service_url> -H 'Authorization: <service_token>' -d '[{"sex":0,"cp":0,"fbs":0,"restecg":0,"exang":0,"slop":0,"thal":0,"age":0,"trestbps":0,"chol":0,"thalach":0,"oldpeak":0,"ca":0}]'
```
Parameters:
- Replace <service_url> with the endpoint that you obtained.
- Replace <service_token> with the token that you obtained.
- -d: the service request data.