Deploy inference services by using PAI Python SDK - Platform For AI

PAI SDK for Python provides an easy-to-use HighLevel API that lets you deploy models to Machine Learning Platform for AI (PAI) and create inference services. This topic describes how to use the PAI SDK for Python to deploy inference services in PAI.

Introduction

You can use the following HighLevel APIs provided by the SDK to deploy models to Elastic Algorithm Service (EAS) of PAI and perform test calls: pai.model.Model and pai.predictor.Predictor.

Perform the following steps to create an inference service by using the SDK:

Specify the configurations of the model inference service by using pai.model.InferenceSpec. The configurations include information about the processor or image that you want to use.
Create a pai.model.Model object by using the instance that is configured based on InferenceSpec and the model configuration file.
Call the pai.model.Model.deploy() method to specify information about the service, such as the resources and the service name, and create an inference service in PAI.
Call the deploy method to obtain a pai.predictor.Predictor object. Then, call the predict method to send a prediction request to the inference service.

Sample code:

from pai.model import InferenceSpec, Model, container_serving_spec
from pai.image import retrieve, ImageScope

#1. Specify a PyTorch image that can be used for model inference.
torch_image = retrieve("PyTorch", framework_version="latest",
    image_scope=ImageScope.INFERENCE)


#2. Use InferenceSpec to describe the configurations of model inference.
inference_spec = container_serving_spec(
    # Startup command of the inference service
    command="python app.py",
    source_dir="./src/"
    # The image used for the inference service
    image_uri=torch_image.image_uri,
)


#3. Create a model object for model deployment.
model = Model(
    # Use a model file stored in the OSS bucket.
    model_data="oss://<YourBucket>/path-to-model-data",
    inference_spec=inference_spec,
)

#4. Deploy the model to EAS, create an online inference service, and obtain the Predictor object.
predictor = model.deploy(
    service_name="example_torch_service",
    instance_type="ecs.c6.xlarge",
)

#5. Test the inference service.
res = predictor.predict(data=data)

Configure the `InferenceSpec` of the model

You can deploy the inference service by using a processor or an image. pai.model.InferenceSpec describes the configurations of the model inference and is used to create the inference service. The configuration contains information such as whether to use a processor or an image for deployment, the storage configuration of the service, the warm-up configuration of the model service, and the configuration of the remote procedure call (RPC) batching.

Deploy the inference service by using a built-in processor

A processor is a package that contains online prediction logic. It can build an inference service based on a provided model. PAI provides built-in processors that support common machine learning model formats, such as TensorFlow SavedModel, PyTorch TorchScript, XGBoost, LightGBM, and PMML. For more information about model formats, see TensorFlow SavedModel, PyTorch TorchScript, XGBoost, LightGBM, and PMML. For more information about processors provided by PAI, see Built-in processors.

Example:

# Use a built-in TensorFlow processor.
tf_infer_spec = InferenceSpec(processor="tensorflow_cpu_2.3")


# Use a built-in PyTorch processor.
tf_infer_spec = InferenceSpec(processor="pytorch_cpu_1.10")

# Use a built-in XGBoost processor.
xgb_infer_spec = InferenceSpec(processor="xgboost")

You can configure additional features for the inference service on the instance that is configured by using InferenceSpec, such as the warm-up file or the RPC configuration of the service. For more information about service parameters, see Parameters of model services.

# Configure the properties of InferenceSpec.
tf_infer_spec.warm_up_data_path = "oss://<YourOssBucket>/path/to/warmup-data" # Configure the path of the service warm-up file.
tf_infer_spec.metadata.rpc.keepalive=1000# Configure the keepalive duration of the request link.

print(tf_infer_spec.warm_up_data_path)
print(tf_infer_spec.metadata.rpc.keepalive)

Deploy the inference service by using an image

Processors can simplify model deployment, but may not be suitable for scenarios that have complex or custom requirements, such as scenarios in which models or inference services have complex dependencies. For scenarios that require high flexibility, you can deploy your model by using an image.

You can package the code and dependencies of the model into a Docker image and push the Docker image to Alibaba Cloud Container Registry (ACR). Then, you can build an InferenceSpec based on the Docker image to deploy the model.

from pai.model import InferenceSpec, container_serving_spec

# Use container_serving_spec to build an InferenceSpec from an image.
container_infer_spec = container_serving_spec(
    # The image used to run the inference service.
    image_uri="<CustomImageUri>",
    # The port on which the inference service that runs in the container listens. The prediction request is forwarded by PAI to this port of the service container.
    port=8000,
    environment_variables=environment_variables,
    # The startup command of the inference service.
    command=command,
    # The Python package on which the inference service depends. 
    requirements=[
        "scikit-learn",
        "fastapi==0.87.0",
    ],
)


print(container_infer_spec.to_dict())

m = Model(
    model_data="oss://<YourOssBucket>/path-to-tensorflow-saved-model",
    inference_spec=custom_container_infer_spec,
)
p = m.deploy(
    instance_type="ecs.c6.xlarge"
)

In most cases, when you deploy a model by using a custom image, you need to prepare the code and package the code in the running container, build an image, and then push the image to the image repository. PAI SDK for Python simplifies this process. You can package your on-premises code as a base image and use the image to create an inference service without the need to manually build images. When you use the pai.model.container_serving_spec() method, you can specify an on-premises code directory in source_dir. Then, the SDK packages and uploads the code directory to an OSS bucket, and mounts the path of the OSS bucket to the running container. This way, the system can use the startup command to start the inference service.

from pai.model import InferenceSpec

inference_spec = container_serving_spec(
    # The on-premises directory of the inference code. The directory is uploaded to the OSS bucket and mounted to the container. Default value: /ml/usercode/.
    source_dir="./src",
    # The command to start the service. If you specify source_dir, then /ml/usercode is automatically used as the working directory. 
    command="python run.py",
    image_uri="<ServingImageUri>",
    requirements=[
        "fastapi",
        "uvicorn",
    ]
)
print(inference_spec.to_dict())

If you have additional code or models that you want to prepare in the container of the inference service, you can use the pai.model.InferenceSpec.mount() method to mount an on-premises directory or an OSS path to the container of the online service.

# Upload on-premises data to OSS and mount the OSS path to the /ml/tokenizers directory of the container.
inference_spec.mount("./bert_tokenizers/", "/ml/tokenizers/")

# Mount the OSS path to the /ml/data directory of the container.
inference_spec.mount("oss://<YourOssBucket>/path/to/data/", "/ml/data/")

Obtain public images provided by PAI

PAI provides images that can be used in model inference based on common frameworks, such as TensorFlow, PyTorch, and XGBoost. You can use these images to quickly create inference services. You can set image_scope to ImageScope.INFERENCE when you use the pai.image.list_images and pai.image.retrieve methods to obtain the image used for inference service, and then deploy the model by using the image.

from pai.image import retrieve, ImageScope, list_images

# Obtain all PyTorch images that can be used for inference services provided by PAI.
for image_info in list_images(framework_name="PyTorch", image_scope=ImageScope.INFERENCE):
  	print(image_info)


# Obtain a PyTorch 1.12 image that can be used in CPU-based inference.
retrieve(framework_name="PyTorch", framework_version="1.12", image_scope=ImageScope.INFERENCE)

# Obtain a PyTorch 1.12 image that can be used in GPU-based inference.
retrieve(framework_name="PyTorch", framework_version="1.12", accelerator_type="GPU", image_scope=ImageScope.INFERENCE)

# Obtain the most recent PyTorch image that can be used in GPU-based inference.
retrieve(framework_name="PyTorch", framework_version="latest", accelerator_type="GPU", image_scope=ImageScope.INFERENCE)

Deploy the online inference service

You can build a pai.model.Model model object by using pai.model.InferenceSpec and model data specified by model_data. Then, you can deploy the model by using the .deploy method. model_data can be an OSS URI or an on-premises path. If you use an on-premises path, the model file stored in the path is uploaded to an OSS bucket and then prepared in the inference service.

If you use the .deploy method to deploy a model, you need to specify the parameters such as required resource configuration, the number of service instances, and the service name. For more information about advanced parameters, see Parameters of model services.

from pai.model import Model

model = Model(
    # model_data specifies the path of the model, which can be an OSS URI or an on-premises path. By default, models that are stored in on-premises paths are uploaded to OSS buckets. 
    model_data="oss://<YourBucket>/path-to-model-data",
    inference_spec=inference_spec,
)

# Deploy an online inference service in EAS.
predictor = m.deploy(
    # The name of the inference service.
    service_name="example_xgb_service",
    # The instance type used for the inference service.
    instance_type="ecs.c6.xlarge",
    # The number of instances used for the inference service.
    instance_count=2,
    # Optional. Your dedicated resource group. By default, a public resource group is used.
    # resource_id="<YOUR_EAS_RESOURCE_GROUP_ID>",
    options={
        "metadata.rpc.batching": True,
        "metadata.rpc.keepalive": 50000,
        "metadata.rpc.max_batch_size": 16,
        "warm_up_data_path": "oss://<YourOssBucketName>/path-to-warmup-data",
    },
)

You can use resource_config to configure resources such as CPU and memory size used for each service instance. Sample code:

from pai.model import ResourceConfig

predictor = m.deploy(
    service_name="dedicated_rg_service",
    # Configure the CPU and memory size of a single service instance.
    # In this example, each service instance uses two vCPUs and 4000 MB of memory.
    resource_config=ResourceConfig(
        cpu=2,
        memory=4000,
    ),
)

Send requests to the inference service

pai.model.Model.deploy() creates a new inference service by calling the EAS APIs and returns an object pai.predictor.Predictor. You can use the Predictor.predict method to send requests to the inference service. You can use the predict and raw_predict methods to send prediction requests to the inference service.

Note

The input and output of the pai.predictor.Predictor.raw_predict are not processed by serializer.

from pai.predictor import Predictor, EndpointType

# Create an inference service.
predictor = model.deploy(
    instance_type="ecs.c6.xlarge",
    service_name="example_xgb_service",
)

# Use an existing inference service.
predictor = Predictor(
    service_name="example_xgb_service",
    # Internet is used by default. If your client code is deployed in a virtual private cloud (VPC), you can set the endpoint type to INTRANET.
    # endpoint_type=EndpointType.INTRANET,
)

# The .predict method sends a request to the inference service and obtains the result. The input and output are processed by serializer. 
res = predictor.predict(data_in_nested_list)


# The .raw_predict method sends requests to the inference service in a more flexible manner.
response: RawResponse = predictor.raw_predict(
  	# Data in bytes and file-like objects can be passed directly in the HTTP request body. 
  	# Other data is serialized into JSON-formatted data and then passed in the HTTP request body. 
  	data=data_in_nested_list
  	# path="predict" # If the inference service listens on a custom path, such as /predict, you can send prediction requests to the path specified by the path parameter. 
  	# headers=dict(), # Custom request headers
  	# method="POST" # Custom HTTP method
  	# timeout=30, # Custom request timeout period. Unit: seconds.
)

# Obtain the returned body and headers.
print(response.content, response.headers)
# Deserialize the returned JSON result into a Python object.
print(response.json())

    
# Stop the inference service.
predictor.stop_service()
# Start the inference service.
predictor.start_service()
# Delete the inference service.
predictor.delete_service()

Use serializer to process the input and output of the inference service

When you use SDK to send requests to the inference service, you need to serialize the input Python data to convert the data to a data format that is supported by the service. The data returned by the service also needs to be deserialized into a readable or operable Python object. The SDK serializes or deserializes request and response data by using the serializer parameter.

When you call the predict(data=<PredictionData>) method, the data parameter serializes the request data through the serilizer.serialize method, obtains request data in the bytes format, and then passes it to the prediction service through the HTTP request body.
When the inference service returns the response through HTTP, the Predictor object deserializes the response through the serializer.deserialize method, and returns the result in the predict method.

PAI Python SDK provides specific built-in serializers to handle data serialization. The serializers can also process the input and output of the built-in processor.

JsonSerializer

JsonSerializer supports serialization and deserialization of data in the JSON format. The data passed through the predict method can be a numpy.ndarray or a list. The JsonSerializer.serialize serializes the array into a JSON string. The JsonSerializer.deserialize deserializes the returned JSON string into a Python object.

Built-in processors such as XGBoost processors and PMML processors use the JSON format to receive data and response results. By default, predictor uses JsonSerializer to process the input and output of the services that are created by the processors.

from pai.serializers import JsonSerializer

# Use the .deploy method to specify the serializer that you want to use.
p = Model(
    inference_spec=InferenceSpec(processor="xgboost"),
    model_data="oss://<YourOssBucket>/path-to-xgboost-model"
).deploy(
    instance_type="ecs.c6.xlarge",
    # Optional. By default, the service that uses the XGBoost processor uses JsonSerializer.
    serializer=JsonSerializer()
)

# You can also specify the serializer when you create a predictor.
p = Predictor(
    service_name="example_xgb_service"
    serializer=JsonSerializer(),
)

# The returned results of the prediction is a list.
res = p.predict([[2,3,4], [4,5,6]])

TensorFlowSerializer

PAI provides TensorFlow processors that allow you to directly deploy a TensorFlow SavedModel to PAI to create an inference service. For more information, see TensorFlow. The input and output data is in the format of Protocol Buffers. For more information about the format, see tf_predict.proto.

The SDK provides a preset TensorFlowSerializer that allows you to send requests to the inference service by using numpy.ndarray. The serializer uses numpy.ndarray to generate Protocol Buffers messages and deserializes the received Protocol Buffers messages into numpy.ndarray.

# Create a TensorFlow processor service.
tf_predictor = Model(
    inference_spec=InferenceSpec(processor="tensorflow_cpu_2.7"),
    model_data="oss://<YourOssBucket>/path-to-tensorflow-saved-model"
).deploy(
    instance_type="ecs.c6.xlarge",
    # Optional. By default, the service that uses the TensorFlow processor uses TensorFlowSerializer.
    # serializer=TensorFlowSerializer(),
)

# If your service uses the TensorFlow processor, you can obtain the service signature by using an API.
print(tf_predictor.inspect_signature_def())

# The input of the TensorFlow processor is of the DICT type. The Key is the name of the input signature, and the Value is the specific input data. 
tf_result = tf_predictor.predict(data={
    "flatten_input": numpy.zeros(28*28*2).reshape((-1, 28, 28))
})

assert result["dense_1"].shape == (2, 10)

PyTorchSerializer

PAI provides built-in PyTorch processors that allow you to use models in the TorchScript format to deploy inference services. For more information, see PyTorch and the PyTorch website. The input and output are in the format of Protocol Buffers. For more information about the format, see pytorch_predict_proto.

The SDK provides a PyTorchSerializer that allows you to send requests to the inference service by using numpy.ndarray. The serializer uses numpy.ndarray to generate Protocol Buffers messages and deserializes the received Protocol Buffers messages into numpy.ndarray.

# Create a service that uses the PyTorch processor.
torch_predictor = Model(
    inference_spec=InferenceSpec(processor="pytorch_cpu_1.10"),
    model_data="oss://<YourOssBucket>/path-to-torch_script-model"
).deploy(
    instance_type="ecs.c6.xlarge",
    # Optional. By default, the service that uses the PyTorch processor uses PyTorchSerializer.
    # serializer=PyTorchSerializer(),
)

#1. Convert the input data into a format supported by the model. 
#2. Use list or tuple to input multiple data entries. Each entry is a numpy.ndarray.
torch_result = torch_predictor.predict(data=numpy.zeros(28 * 28 * 2).reshape((-1, 28, 28)))
assert torch_result.shape == (2, 10)

Custom serializer

You can use pai.serializers.SerializerBase to specify a custom serializer class based on the supported data formats of the inference service.

The following section provides an example on the process where a custom NumpySerializer is used.

Client: numpy.ndarray or pandas.DataFrame is passed in when you use the predict method. Then, NumpySerializer.serializer is used to serialize the input data into npy format and send the data to the server.
Server: The inference service receives data in the npy format, deserializes the data, obtains the inference result, and then serializes the output data to the npy format and returns the data.
Client: Receives the response in the npy format and deserializes it into a numpy.ndarray by using the NumpySerializer.deserialize method.

import pandas as pd
import numpy as np
import io
from typing import Union

class NumpySerializer(SerializerBase):

    def serialize(self, data: Union[np.ndarray, pd.DataFrame, bytes]) -> bytes:
        """Serialize input python object to npy format"""
        if isinstance(data, bytes):
            return data
        elif isinstance(data, str):
            return data.encode()
        elif isinstance(data, pd.DataFrame):
            data = data.to_numpy()

        res = io.BytesIO()
        np.save(res, data)
        res.seek(0)
        return res.read()

    def deserialize(self, data: bytes) -> np.ndarray:
        """Deserialize prediction response to numpy.ndarray"""
        f = io.BytesIO(data)
        return np.load(f)


# Create an inference service whose input and output use the npy format.
predictor = Model(
    inference_spec=infer_spec,
    model_data="oss://<YourOssBucket>/path-to-model"
).deploy(
    instance_type="ecs.c6.xlarge",

    # Use a custom serializer.
    serializer=NumpySerializer(),
)

res: predictor.predict(data=input_data)

assert isinstance(input_data, np.ndarray)
assert isinstance(res, np.ndarray)

Deploy and test the inference service in an on-premises environment

Model services that are deployed by using a custom image can be run in an on-premises environment by using PAI Python SDK. To run a service in an on-premises environment, set the instance_type parameter to local in model.deploy. The SDK uses docker to start a model service on your on-premises machine. The required model is downloaded from the OSS bucket to your on-premises machine and then mounted to the container image that runs on your on-premises machine.

from pai.predictor import LocalPredictor

p: LocalPredictor = model.deploy(
    # Set instance_type to local.
    instance_type="local",
    serializer=JsonSerializer()
)

p.predict(data)

# Delete the docker container.
p.delete_service()