Elastic Algorithm Service (EAS) of Machine Learning Platform for AI (PAI) provides the model warm-up feature to reduce the time required for processing the initial request sent to a model service that is published online. This feature warms up a model before the model service is published online. This way, model services can work as expected immediately after they are published online. This topic describes how to use the model warm-up feature.

Background information

When an initial request is sent to a model service, different initialization operations are performed in different runtime environments. As a result, a large amount of time is consumed to process the initial request. Therefore, after you upgrade the configurations of or update a model service, a request timeout error may occur. For example, when you start a Java processor for the first time, a large amount of time is required to process the first few requests due to the cold start of the Java virtual machine (JVM). For specific TensorFlow model services, the model-related files or parameters are loaded to the memory the first time you call the services. To load the files or parameters, a large amount of time is required. In this scenario, the response time of the first few requests are long, and a 408 timeout error or a 450 error may be returned. In view of this, EAS provides the model warm-up feature. This feature calls model services before they are published online. This way, a model can be warmed up in advance, and the model service can work as expected immediately after the model service is published online.

To warm up a model, the service engine of EAS sends the uploaded warm-up requests to itself before the model service is published online. Warm-up requests are sent at intervals of 10 seconds, which is twice the keep-alive time. Alternatively, a warm-up request is sent immediately after the success response to the previous request is received. Each request file is sent for five times in a row. If the initial request of a model service is processed within 10 seconds, a few number of subsequent requests can also be processed within a short period. In this case, the overall warm-up period is less than 10 seconds. In general, it takes longer to process the initial request than the subsequent requests. If the initial request requires more than 10 seconds to be processed, the warm-up period lasts longer, which may be tens of seconds.

To use the model warm-up feature of EAS, you must first generate a warm-up request file. Then, specify the warm-up request file in the JSON file that is used to deploy your model service. When you deploy or update the model service, the service engine of EAS sends warm-up requests. After the warm-up requests are sent, the model service is warmed up.

Use the model warm-up feature

To generate a warm-up request file, you need to construct a request file based on the requests that can be sent after a model service is published online. The warm-up request file is read or called during the warm-up process. You can use EAS SDKs that are used to call services to construct a request file. For more information, see SDKs. The following example describes how to use the model warm-up feature to warm up a TensorFlow model.

  1. Generate a warm-up request file.
    In the following examples, EAS SDK for Python and EAS SDK for Java are used to describe how to construct request files for TensorFlow model services. To warm up other types of models, use the corresponding SDKs to construct request files based on the method that is used to construct request files for TensorFlow model services. For model services that use strings as inputs, you can store requests in a TXT file as strings. Each TXT file can contain multiple requests. Each request occupies a data row in the TXT file. EAS automatically identifies the file format and sends warm-up requests in different formats.
    Notice The inputs and outputs of the requests used to warm up a TensorFlow model must be the same as those of the requests sent after the model service is published online.
    The following code shows how to use SDKs to construct a request file for a TensorFlow model service:
    • Use the SDK for Python
      #!/usr/bin/env python
      
      from eas_prediction import PredictClient
      from eas_prediction import StringRequest
      from eas_prediction import TFRequest
      
      if __name__ == '__main__':
              # ...
              # The sample warm-up request. Construct a warm-up request based on your actual requirements. The inputs and outputs of the requests used for warm-up must be the same as those of the requests that are sent after the model service is published online. 
              req = TFRequest('serving_default')
              req.add_feed('sentence1', [200, 15], TFRequest.DT_INT32, [1] * 200 * 15)
              req.add_feed('sentence2', [200, 15], TFRequest.DT_INT32, [1] * 200 * 15)
              req.add_feed('y', [200, 2], TFRequest.DT_INT32, [2] * 200 * 2)
              req.add_feed('keep_rate', [], TFRequest.DT_FLOAT, [0.2])
              req.add_feed('images', [1, 784], TFRequest.DT_FLOAT, [1] * 784)
              req.add_fetch('sorted_labels')
              req.add_fetch('sorted_probs')
              # print(req.request_data) # Display the request data. 
              with open("warm_up.bin", "wb") as fw :
                  fw.write(req.to_string());
              # Save the warm_up.bin file as the warm-up request file. 
    • Use the SDK for Java
      import java.util.List;
      import com.aliyun.openservices.eas.predict.http.PredictClient;
      import com.aliyun.openservices.eas.predict.http.HttpConfig;
      import com.aliyun.openservices.eas.predict.request.TFDataType;
      import com.aliyun.openservices.eas.predict.request.TFRequest;
      import com.aliyun.openservices.eas.predict.response.TFResponse;
      
      public class Test_TF {
      
          public static void main(String[] args) throws Exception{
              // The sample warm-up request. Construct a warm-up request based on your actual requirements. 
              TFRequest request = new TFRequest();
              request.setSignatureName("predict_images");
              float[] content = new float[784];
              for (int i = 0; i < content.length; i++)
                  content[i] = (float)0.0;
              request.addFeed("images", TFDataType.DT_FLOAT, new long[]{1, 784}, content);
              request.addFetch("scores");
              
              try {
                  // Construct a request file. 
                  File writename = new File("/path/to/warm_up1.bin"); // If the file does not exist, create one. 
                  FileUtils.writeByteArrayToFile(writename, request.getRequest().toByteArray());
              } catch (Exception ex) {
              }
          }
  2. Verify the request file.
    You can use one of the following methods to verify the request file:
    • Method 1: Send a service request for verification
      Run the following command to send a request to the model service:
      curl  --data-binary @"</path/to/warmup.bin>" -H 'Authorization: <yourToken>' <serviceAddress>
      Replace the following parameter names with actual values:
      • </path/to/warmup.bin>: the path of the warm-up request file that is generated in the preceding step.
      • <yourToken>: the token that is used to access the model service.
      • <serviceAddress>: the endpoint of the model service.
    • Method 2: Parse the request file for verification
      • Use Python
        from eas_prediction import TFRequest
        
        req = TFRequest()
        with open('/path/to/warm_up1.bin', 'rb') as wm:
            req.request_data.ParseFromString(wm.read())
            print(req.request_data)
      • Use Java
        
        public static void main(String[] args) throws Exception {
        
              File readfile = new File("/path/to/warm_up1.bin");
              byte[] data = FileUtils.readFileToByteArray(readfile);
              PredictProtos.PredictRequest pb = PredictProtos.PredictRequest.parseFrom(data);
              System.out.println(pb);
          }
  3. Configure the model service.
    1. Upload the generated warm-up request file to Object Storage Service (OSS).
    2. Set the parameters for the model service.
      In the model description file in the JSON format, set the parameters for the model service.
      {
          "name":"warm_up_demo",
          "model_path":"oss://path/to/model", 
          "warm_up_data_path":"oss://path/to/warm_up_test.bin", // The path of the warm-up request file in OSS. 
          "processor":"tensorflow_cpu_1.15",
          "metadata":{
              "cpu":2,
              "instance":1,
              "rpc": {
                  "warm_up_count": 5, // The number of times each warm-up request is to be sent. If you do not specify a value for this parameter, 5 is used as the default value. 
              }
          }
      }
      The following warm-up parameters are involved. For information about other parameters, see Create a service.
      • warm_up_data_path: the path of the warm-up request file in OSS. The system automatically searches for the file and warms up the model by using the file before the model service is published online.
      • warm_up_count: the number of times each warm-up request is to be sent. If you do not specify a value for this parameter, 5 is used as the default value.
  4. Deploy or update the model service. For more information, see Create a service or Modify a service.
    When you deploy or update the model service, the service engine of EAS sends warm-up requests. After the warm-up requests are sent, the model service is warmed up.

FAQ about the warm-up of TensorFlow models

  • Issue description

    In actual business scenarios, if a TensorFlow model service is updated, the stability of the model service may be affected. You can add inference functions for calling model services to the implementation logic of a processor to add the warm-up feature. However, the preceding issue remains even if the warm-up feature is added to the TensorFlow processor. Tests conducted by different parties show that each time a new input or output is involved, the service engine reloads the request file to warm up the model. Even if all inputs and outputs have been loaded for warm-up, a large amount of time is required to reload the request file when specific inputs and outputs are involved.

  • Cause

    The preceding issue is caused because the hash-based verification is performed on the inputs and output_tensor_names parameters for each session based on session->Run(inputs, output_tensor_names, {}, &outputs). If the input or output changes, the service engine reloads the request file without determining whether the file need to be reloaded.

    The following code shows the inputs of a sample TensorFlow model:
    Inputs:
      threshold: []; DT_FLOAT
      model_id: []; DT_STRING
      input_holder: [-1]; DT_STRING
    The following code shows the outputs of the TensorFlow model:
    Outputs:
      model_version_id: []; DT_STRING
      sorted_labels: [-1, 3]; DT_STRING
      sorted_probs: [-1, 3]; DT_FLOAT
    The following warm-up requests are sent:
    request.addFeed("input_holder",TFDataType.DT_STRING, new long[]{1}, input);
    request.addFeed("threshold", TFDataType.DT_FLOAT, new long[] {}, th);
    request.addFeed("model_id", TFDataType.DT_STRING, new long[]{}, model_name);
    
    request.addFetch("sorted_labels");
    request.addFetch("sorted_probs");
    After the TensorFlow model is warmed up, the following requests are sent. Compared with the warm-up requests, the outputs of the following requests contain an additional parameter. In this case, the request file needs to be reloaded.
    request.addFeed("input_holder",TFDataType.DT_STRING, new long[]{1}, input);
    request.addFeed("threshold", TFDataType.DT_FLOAT, new long[] {}, th);
    request.addFeed("model_id", TFDataType.DT_STRING, new long[]{}, model_name);
    
    request.addFetch("sorted_labels");
    request.addFetch("sorted_probs");
    request.addFetch("model_version_id"); // An additional parameter will be returned. 
  • Solution

    Each TensorFlow model needs to be warmed up based on actual service requests. The warm-up result applies only to the inputs and outputs of the specified service requests. Therefore, you must use the model warm-up feature of EAS based on actual service requests.

    To warm up a TensorFlow model, you need to run the session->Run command only once based on actual service requests. You can upload one warm-up request file at a time and warm up the model by using the inputs and outputs of the specified requests.