Warm up model services - Platform For AI - Alibaba Cloud Documentation Center

Elastic Algorithm Service (EAS) of Platform for AI (PAI) provides the model warm-up feature to reduce the time required to process the initial request sent to an online model service. This feature warms up a model before the model service is published online. This ensures that model services work as expected immediately after they are published online. This topic describes how to use the model warm-up feature.

Background information

When an initial request is sent to a model service, the model performs initialization operations based on the runtime environment. In most cases, the initialization operations require a long period of time to complete. As a result, a request timeout error may occur when you scale out or update a model service. For example, the first time you start a Java processor, a large amount of time is required to process the first few requests due to the cold start of the Java virtual machine (JVM). The first time you call specific TensorFlow model services, the system loads model-related files and parameters to the memory. The process requires a large amount of time. As a result, the response time of the first few requests is long, and a 408 timeout error or a 450 error may be returned. To resolve the preceding issue, EAS provides the model warm-up feature. This feature calls model services before they are published online as a warm-up process to ensure that the model services can work as expected immediately after they are published online.

To warm up a model, the service engine of EAS sends the warm-up requests to itself before the model service is published online. Warm-up requests are sent at an interval of 10 seconds, which is twice the keep-alive time. Alternatively, a warm-up request can also be sent immediately after the success response to the previous request is received. Each request file is sent five times in a row. If the initial request of a model service is processed within 10 seconds, subsequent requests can be processed within a short period of time. In this case, the overall warm-up period is less than 10 seconds. If the initial request requires more than 10 seconds to be processed, the warm-up period may take tens of seconds.

To use the model warm-up feature of EAS, you must generate a warm-up request file. Then, specify the warm-up request file in the JSON file that is used to deploy your model service. When you deploy or update the model service, the service engine of EAS sends warm-up requests to warm up the model service.

Use the model warm-up feature

To generate a warm-up request file, you need to construct a request file based on the requests that can be sent after a model service is published online. The warm-up request file is read or called during the warm-up process. You can use EAS SDKs that are used to call services to construct a request file. For more information, see SDKs. The following example describes how to use the model warm-up feature to warm up a TensorFlow model.

Generate a warm-up request file.

In the following examples, EAS SDK for Python and EAS SDK for Java are used to describe how to construct request files for TensorFlow model services. If you want to warm up other types of models, you can use the corresponding SDKs to construct request files based on the method that is used to construct request files for TensorFlow model services. For model services that use strings as inputs, you can store requests in a TXT file as strings. Each TXT file can contain multiple requests. Each request occupies a data row in the TXT file. EAS automatically identifies the file format and sends warm-up requests in different formats.

Important

The inputs and outputs of the requests used to warm up a TensorFlow model must be the same as those of the requests sent after the model service is published online.

The following sample code provides an example on how to use SDKs to construct a request file for a TensorFlow model service.

Use the SDK for Python

#!/usr/bin/env python

from eas_prediction import PredictClient
from eas_prediction import StringRequest
from eas_prediction import TFRequest

if __name__ == '__main__':
        # The sample warm-up request. Construct a warm-up request based on your actual requirements. The inputs and outputs of the requests used for warm-up must be the same as those of the requests that are sent after the model service is published online. 
        req = TFRequest('serving_default')
        req.add_feed('sentence1', [200, 15], TFRequest.DT_INT32, [1] * 200 * 15)
        req.add_feed('sentence2', [200, 15], TFRequest.DT_INT32, [1] * 200 * 15)
        req.add_feed('y', [200, 2], TFRequest.DT_INT32, [2] * 200 * 2)
        req.add_feed('keep_rate', [], TFRequest.DT_FLOAT, [0.2])
        req.add_feed('images', [1, 784], TFRequest.DT_FLOAT, [1] * 784)
        req.add_fetch('sorted_labels')
        req.add_fetch('sorted_probs')
        # print(req.request_data) # Display the request data. 
        with open("warm_up.bin", "wb") as fw :
            fw.write(req.to_string());
        # Save the warm_up.bin file as the warm-up request file.

Use the SDK for Java

To use EAS SDK for Java in a Maven project, you must add the eas-sdk dependencies to the <dependencies> in the pom.xml file. The latest version in the Maven repository prevails. Sample code:

<dependency>
  <groupId>com.aliyun.openservices.eas</groupId>
  <artifactId>eas-sdk</artifactId>
  <version>2.0.13</version>
</dependency>

Sample code of SDK for Java:

import java.io.File;
import com.aliyun.openservices.eas.predict.request.TFDataType;
import com.aliyun.openservices.eas.predict.request.TFRequest;
import org.apache.commons.io.FileUtils;

public class TestTf {

    public static void main(String[] args) throws Exception{
        // The sample warm-up request. Construct a warm-up request based on your actual requirements. 
        TFRequest request = new TFRequest();
        request.setSignatureName("predict_images");
        float[] content = new float[784];
        for (int i = 0; i < content.length; i++){
          content[i] = (float)0.0;
        }
        request.addFeed("images", TFDataType.DT_FLOAT, new long[]{1, 784}, content);
        request.addFetch("scores");
        
        try {
            // Construct a request file. If no existing file is available, create a new file. 
            File writename = new File("/path/to/warm_up1.bin");
            FileUtils.writeByteArrayToFile(writename, request.getRequest().toByteArray());
        } catch (Exception ex) {
        }
    }
}

Verify the request file.

You can use one of the following methods to verify the request file:

Method 1: Send a service request for verification
Run the following command to send a request to the model service: If the returned content is too large and cannot be printed on the terminal, you can add a --output <filePath> field to store the result in a file.
```
curl  --data-binary @"</path/to/warmup.bin>" -H 'Authorization: <yourToken>' <serviceAddress>
```
Replace the following parameters with actual values:
- </path/to/warmup.bin>: the path of the warm-up request file that is generated in the preceding step.
- <yourToken>: the token that is used to access the model service.
- <serviceAddress>: the endpoint of the model service.

Method 2: Parse the request file for verification

Use Python

from eas_prediction import TFRequest

req = TFRequest()
with open('/path/to/warm_up1.bin', 'rb') as wm:
    req.request_data.ParseFromString(wm.read())
    print(req.request_data)

Use Java

import com.aliyun.openservices.eas.predict.proto.PredictProtos;
import org.apache.commons.io.FileUtils;
import java.io.File;

public class Test {

    public static void main(String[] args) throws Exception {

        File refile = new File("/path/to/warm_up1.bin");
        byte[] data = FileUtils.readFileToByteArray(refile);
        PredictProtos.PredictRequest pb = PredictProtos.PredictRequest.parseFrom(data);
        System.out.println(pb);
    }
}

Configure the model service.
1. Upload the generated warm-up request file to OSS.
2. Configure the parameters of the model service.
  In the model description file in the JSON format, configure the parameters of the model service.
```
{
    "name":"warm_up_demo",
    "model_path":"oss://path/to/model", 
    "warm_up_data_path":"oss://path/to/warm_up_test.bin", // The path of the warm-up request file in OSS. 
    "processor":"tensorflow_cpu_1.15",
    "metadata":{
        "cpu":2,
        "instance":1,
        "rpc": {
            "warm_up_count": 5, // The number of times each warm-up request is sent. If you do not specify a value for this parameter, 5 is used as the default value. 
        }
    }
}
```
  The following warm-up parameters are used. For information about other parameters, see Create a service.
  - warm_up_data_path: the path of the warm-up request file in OSS. The system automatically searches for the file and warms up the model by using the file before the model service is published online.
  - warm_up_count: the number of times each warm-up request is sent. If you do not specify a value for this parameter, 5 is used as the default value.
Deploy or update the model service. For more information, see Create a service or Modify a service.
When you deploy or update the model service, the service engine of EAS sends warm-up requests to warm up the model service.

FAQ about warming up TensorFlow models

Issue
In actual business scenarios, if a TensorFlow model service is updated, the stability of the model service may be affected. You can add inference functions for calling model services to the implementation logic of a processor to add the warm-up feature. However, the preceding issue remains even if the warm-up feature is added to the TensorFlow processor. Tests conducted by different parties show that each time a new input or output is involved, the service engine reloads the request file to warm up the model. Even if all inputs and outputs are loaded for warm-up, a large amount of time is required to reload the request file when specific inputs and outputs are involved.

Causes

The preceding issue occurs because hash-based verification is performed on the inputs and output_tensor_names parameters for each session based on session->Run(inputs, output_tensor_names, {}, &outputs). If the input or output changes, the service engine reloads the request file without checking whether the file needs to be reloaded.

The following sample code shows the inputs of a sample TensorFlow model:

Inputs:
  threshold: []; DT_FLOAT
  model_id: []; DT_STRING
  input_holder: [-1]; DT_STRING

The following sample code shows the outputs of the TensorFlow model:

Outputs:
  model_version_id: []; DT_STRING
  sorted_labels: [-1, 3]; DT_STRING
  sorted_probs: [-1, 3]; DT_FLOAT

The following sample warm-up requests are sent:

request.addFeed("input_holder",TFDataType.DT_STRING, new long[]{1}, input);
request.addFeed("threshold", TFDataType.DT_FLOAT, new long[] {}, th);
request.addFeed("model_id", TFDataType.DT_STRING, new long[]{}, model_name);

request.addFetch("sorted_labels");
request.addFetch("sorted_probs");

After the TensorFlow model is warmed up, the following requests are sent. Compared with the warm-up requests, the outputs of the following requests contain an additional parameter. In this case, the request file needs to be reloaded.

request.addFeed("input_holder",TFDataType.DT_STRING, new long[]{1}, input);
request.addFeed("threshold", TFDataType.DT_FLOAT, new long[] {}, th);
request.addFeed("model_id", TFDataType.DT_STRING, new long[]{}, model_name);

request.addFetch("sorted_labels");
request.addFetch("sorted_probs");
request.addFetch("model_version_id"); // An additional parameter is returned.

Solution
Each TensorFlow model needs to be warmed up based on actual service requests. The warm-up result applies only to the inputs and outputs of the specified service requests. You must use the model warm-up feature of EAS based on actual service requests.
To warm up a TensorFlow model, run the session->Run command only once based on actual service requests. You can upload only one warm-up request file at a time and warm up the model by using the inputs and outputs of the specified requests.