Deploy Scalable Inference Services with PAI SDK Built-in or Custom Image - Platform For AI

Deploy models to PAI and create inference services using high-level APIs from the PAI Python SDK.

Deployment workflow

The SDK provides high-level APIs pai.model.Model and pai.predictor.Predictor to deploy models to EAS and test services.

Basic flow to create an inference service:

Define inference service configuration in a pai.model.InferenceSpec object, including processor or runtime image information.
Create a pai.model.Model object using the InferenceSpec object and model file for deployment.
Call pai.model.Model.deploy() to create an inference service in PAI. Specify resources, service name, and other configuration.
The deploy method returns a pai.predictor.Predictor object. Use its predict method to send inference requests to the service.

Example:

from pai.model import InferenceSpec, Model, container_serving_spec
from pai.image import retrieve, ImageScope

# 1. Retrieve a PyTorch inference runtime image provided by PAI.
torch_image = retrieve("PyTorch", framework_version="latest",
    image_scope=ImageScope.INFERENCE)


# 2. Define inference configuration using InferenceSpec.
inference_spec = container_serving_spec(
    # Start command for the inference service.
    command="python app.py",
    source_dir="./src/"
    # Inference runtime image.
    image_uri=torch_image.image_uri,
)


# 3. Build Model object for deployment.
model = Model(
    # Model file from an OSS Bucket.
    model_data="oss://<YourBucket>/path-to-model-data",
    inference_spec=inference_spec,
)

# 4. Deploy model to PAI-EAS to create an online inference service and return a Predictor object.
predictor = model.deploy(
    service_name="example_torch_service",
    instance_type="ecs.c6.xlarge",
)

# 5. Test the inference service.
res = predictor.predict(data=data)

The following sections describe code configurations for deploying an inference service.

Configure InferenceSpec

Deploy an inference service using a processor or runtime image. The pai.model.InferenceSpec object defines service configuration, including deployment method, storage settings, warm-up settings, and RPC batching. Use the configured InferenceSpec object to create the service.

Use a built-in processor

A processor is a PAI abstraction for an inference service package that builds an inference service directly from a user-provided model. PAI provides built-in processors supporting common machine learning model formats such as TensorFlow SavedModel, PyTorch TorchScript, XGBoost, LightGBM, and PMML. For a complete list, see Built-in processors.

To deploy a model using a processor, configure InferenceSpec as shown below:

# Use a built-in TensorFlow processor.
tf_infer_spec = InferenceSpec(processor="tensorflow_cpu_2.3")


# Use a built-in PyTorch processor.
tf_infer_spec = InferenceSpec(processor="pytorch_cpu_1.10")

# Use a built-in XGBoost processor.
xgb_infer_spec = InferenceSpec(processor="xgboost")

Configure additional features for the inference service on the InferenceSpec instance, such as warm-up data file or RPC settings. For a complete list of service parameters, see JSON-based deployment.

# Configure InferenceSpec properties directly.
tf_infer_spec.warm_up_data_path = "oss://<YourOssBucket>/path/to/warmup-data" # Path to warm-up data file.
tf_infer_spec.metadata.rpc.keepalive = 1000 # Keep-alive duration for request connection.

print(tf_infer_spec.warm_up_data_path)
print(tf_infer_spec.metadata.rpc.keepalive)

Use a runtime image

While deploying with a processor is straightforward, it lacks support for flexible custom configurations. For models or inference services with complex dependencies, PAI supports deployment using a runtime image, providing greater flexibility.

Package model service code and dependencies into a Docker image and push to an Alibaba Cloud ACR image repository. Then build an InferenceSpec from the Docker image for deployment.

from pai.model import InferenceSpec, container_serving_spec

# Build InferenceSpec for a model served by a runtime image.
container_infer_spec = container_serving_spec(
    # Runtime image to run the inference service.
    image_uri="<CustomImageUri>",
    # Port that the inference service listens on in the container. PAI forwards prediction requests to this port.
    port=8000,
    environment_variables=environment_variables,
    # Start command for the inference service.
    command=command,
    # Python packages that the inference service depends on.
    requirements=[
        "scikit-learn",
        "fastapi==0.87.0",
    ],
)


print(container_infer_spec.to_dict())

m = Model(
    model_data="oss://<YourOssBucket>/path-to-tensorflow-saved-model",
    inference_spec=custom_container_infer_spec,
)
p = m.deploy(
    instance_type="ecs.c6.xlarge"
)

Deploying with a custom image typically requires preparing inference code, integrating it into a container, and building and pushing the image to a repository. The PAI SDK simplifies this process by building an inference service from local code and a base image without manually building the image. Use the source_dir parameter in pai.model.container_serving_spec() to specify a local code directory. The SDK automatically packages and uploads this directory to an OSS Bucket and mounts it to a path in the running container. Use the specified start command to start the service.
```
from pai.model import InferenceSpec

inference_spec = container_serving_spec(
    # Path to local directory with your inference program. It is uploaded to an OSS Bucket and mounted into the running container. Default mount path is /ml/usercode/.
    source_dir="./src",
    # Service start command. When source_dir is specified, the command executes with /ml/usercode as the default working directory.
    command="python run.py",
    image_uri="<ServingImageUri>",
    requirements=[
        "fastapi",
        "uvicorn",
    ]
)
print(inference_spec.to_dict())
```

To import additional data, code, or models into the inference service container, use pai.model.InferenceSpec.mount() to mount a local directory or OSS data path to the online service container.

# Upload local data to OSS and mount to /ml/tokenizers/ directory in the container.
inference_spec.mount("./bert_tokenizers/", "/ml/tokenizers/")

# Mount data stored in OSS to /ml/data/ directory in the container.
inference_spec.mount("oss://<YourOssBucket>/path/to/data/", "/ml/data/")

Retrieve PAI public images

PAI provides inference runtime images for common frameworks such as TensorFlow, PyTorch, and XGBoost. Pass image_scope=ImageScope.INFERENCE to pai.image.list_images and pai.image.retrieve to retrieve corresponding inference runtime images for model deployment.

from pai.image import retrieve, ImageScope, list_images

# Get all PyTorch inference runtime images provided by PAI.
for image_info in list_images(framework_name="PyTorch", image_scope=ImageScope.INFERENCE):
  	print(image_info)


# Get PyTorch 1.12 CPU inference runtime image provided by PAI.
retrieve(framework_name="PyTorch", framework_version="1.12", image_scope=ImageScope.INFERENCE)

# Get PyTorch 1.12 GPU inference runtime image provided by PAI.
retrieve(framework_name="PyTorch", framework_version="1.12", accelerator_type="GPU", image_scope=ImageScope.INFERENCE)

# Get latest PyTorch GPU inference runtime image provided by PAI.
retrieve(framework_name="PyTorch", framework_version="latest", accelerator_type="GPU", image_scope=ImageScope.INFERENCE)

Deploy and invoke online inference services

Deploy an inference service

Build a pai.model.Model object using a pai.model.InferenceSpec object and model data address specified by model_data. Deploy the model by calling .deploy. The model_data parameter accepts an OSS URI or local path. If a local path is specified, the model file is uploaded to an OSS Bucket and prepared in the inference service for loading and use.

When calling .deploy to deploy the model, specify service parameters such as required resource configuration, number of service instances, and service name. For advanced parameters, see JSON-based deployment.

from pai.model import Model, InferenceSpec
from pai.predictor import Predictor

model = Model(
    # Path to model_data. It can be an OSS URI or local path. For a local path, it is uploaded to an OSS Bucket by default.
    model_data="oss://<YourBucket>/path-to-model-data",
    inference_spec=inference_spec,
)

# Deploy to EAS.
predictor = m.deploy(
    # Name of the inference service.
    service_name="example_xgb_service",
    # Machine type used by the service.
    instance_type="ecs.c6.xlarge",
    # Number of machine instances/services.
    instance_count=2,
    # Dedicated resource group. This is optional. Public resource group is used by default.
    # resource_id="<YOUR_EAS_RESOURCE_GROUP_ID>",
    options={
        "metadata.rpc.batching": True,
        "metadata.rpc.keepalive": 50000,
        "metadata.rpc.max_batch_size": 16,
        "warm_up_data_path": "oss://<YourOssBucketName>/path-to-warmup-data",
    },
)

To configure the service based on specific resource requirements such as CPU and memory, use resource_config to allocate resources to each service instance:

from pai.model import ResourceConfig

predictor = m.deploy(
    service_name="dedicated_rg_service",
    # Specify CPU and memory resources used by a single service instance.
    # In this example, each service uses 2 CPU cores and 4000 MB of memory.
    resource_config=ResourceConfig(
        cpu=2,
        memory=4000,
    ),
)

Invoke an inference service

The pai.model.Model.deploy method calls the EAS API to create a new inference service and returns a pai.predictor.Predictor object pointing to the newly created service. The Predictor object provides predict and raw_predict methods to send prediction requests to the service.

Note

The input and output of pai.predictor.Predictor.raw_predict do not require processing by a Serializer.

from pai.predictor import Predictor, EndpointType

# Create a new inference service.
predictor = model.deploy(
    instance_type="ecs.c6.xlarge",
    service_name="example_xgb_service",
)

# Use an existing inference service.
predictor = Predictor(
    service_name="example_xgb_service",
    # By default, the service is accessed over the Internet. Configure access over a VPC network. The client code must run in the VPC environment.
    # endpoint_type=EndpointType.INTRANET
)

# .predict sends a data request to the corresponding service and gets the response. Input data and response are processed by the serializer.
res = predictor.predict(data_in_nested_list)


# .raw_predict provides a more flexible way to send requests to the inference service.
response: RawResponse = predictor.raw_predict(
  	# If input data is bytes or a file-like object, the request data is passed directly in the HTTP request body.
  	# Otherwise, it is first serialized into JSON and then passed in the HTTP request body.
  	data=data_in_nested_list
  	# path="predict"            # Custom HTTP request path. By default, requests are sent to the "/" path.
  	# headers=dict(),						# Custom request header.
  	# method="POST"							# Custom HTTP method.
  	# timeout=30,								# Custom request timeout.
)

# Get returned body and headers.
print(response.content, response.headers)
# Deserialize returned result from JSON into a Python object.
print(response.json())


# Stop the inference service.
predictor.stop_service()
# Start the inference service.
predictor.start_service()
# Delete the inference service.
predictor.delete_service()

Process input and output with Serializers

When invoking an inference service using the SDK's pai.predictor.Predictor.predict method, the input Python data structure must be serialized into a format the service can process. Similarly, the response data from the service must be deserialized into a readable Python object. The Predictor uses the serializer parameter to serialize prediction data and deserialize prediction responses.

When calling predict(data=<PredictionData>), the data parameter is serialized by serializer.serialize. This produces a bytes object, which is then passed to the inference service in the HTTP request body.
After the inference service returns an HTTP response, the Predictor object deserializes the HTTP response body using serializer.deserialize. The result is then returned by predict.

The SDK provides built-in serializers supporting common data types and handling input/output data for PAI's built-in deep learning processors.

JsonSerializer

The JsonSerializer object supports serialization and deserialization of data in JSON format. The data passed to predict can be a numpy.ndarray or a List. JsonSerializer.serialize serializes the array into a JSON string. JsonSerializer.deserialize deserializes the received JSON string into a Python object.

PAI's built-in processors such as XGBoost and PMML accept and return data in JSON format. For services created using these processors, the Predictor uses JsonSerializer by default.

from pai.serializers import JsonSerializer

# Specify serializer for the returned predictor in the ".deploy" method.
p = Model(
    inference_spec=InferenceSpec(processor="xgboost"),
    model_data="oss://<YourOssBucket>/path-to-xgboost-model"
).deploy(
    instance_type="ecs.c6.xlarge",
    # Optional: Services using XGBoost processor use JsonSerializer by default.
    serializer=JsonSerializer()
)

# Or, specify serializer directly when creating the Predictor.
p = Predictor(
    service_name="example_xgb_service"
    serializer=JsonSerializer(),
)

# The prediction result is also a list.
res = p.predict([[2,3,4], [4,5,6]])

TensorFlowSerializer

Use a PAI built-in TensorFlow processor to deploy a TensorFlow SavedModel to PAI and create an inference service. The input and output message format for this service is Protocol Buffers. For more information about the file format, see tf_predict.proto.

The SDK provides a built-in TensorFlowSerializer to send requests with data of the numpy.ndarray type. The serializer converts numpy.ndarray data into corresponding Protocol Buffers messages and deserializes received Protocol Buffers messages back into numpy.ndarray data type.

# Create a TensorFlow processor service.
tf_predictor = Model(
    inference_spec=InferenceSpec(processor="tensorflow_cpu_2.7"),
    model_data="oss://<YourOssBucket>/path-to-tensorflow-saved-model"
).deploy(
    instance_type="ecs.c6.xlarge",
    # Optional: Services using TensorFlow processor use TensorFlowSerializer by default.
    # serializer=TensorFlowSerializer(),
)

# Services started with a TensorFlow processor allow getting the model's service signature through an API.
print(tf_predictor.inspect_signature_def())

# Input for a TensorFlow processor requires a Dict. The key is the name of the model's input signature, and the value is the specific input data.
tf_result = tf_predictor.predict(data={
    "flatten_input": numpy.zeros(28*28*2).reshape((-1, 28, 28))
})

assert result["dense_1"].shape == (2, 10)

PyTorchSerializer

Use a PAI built-in PyTorch processor to deploy a model in the TorchScript format as an inference service. The input and output data format for this service is Protocol Buffers. For more information about the file format, see pytorch_predict_proto.

The SDK provides a built-in PyTorchSerializer to send requests with data of the numpy.ndarray type and converts the prediction result to a numpy.ndarray. The PyTorchSerializer handles conversion between Protocol Buffers messages and numpy.ndarray.

# Create a service that uses a PyTorch processor.
torch_predictor = Model(
    inference_spec=InferenceSpec(processor="pytorch_cpu_1.10"),
    model_data="oss://<YourOssBucket>/path-to-torch_script-model"
).deploy(
    instance_type="ecs.c6.xlarge",
    # Optional: Services using PyTorch processor use PyTorchSerializer by default.
    # serializer=PyTorchSerializer(),
)

# 1. Reshape input data into a shape that the model supports.
# 2. If there are multiple inputs, pass them as a List/Tuple. Each item in the list is a numpy.ndarray.
torch_result = torch_predictor.predict(data=numpy.zeros(28 * 28 * 2).reshape((-1, 28, 28)))
assert torch_result.shape == (2, 10)

Custom Serializer

Customize a Serializer class based on the data formats supported by the inference service. The custom Serializer class must inherit from pai.serializers.SerializerBase and implement serialize and deserialize methods.

The following example shows a custom NumpySerializer. When predict is called, the overall flow is:
1. The client passes a numpy.ndarray or a pandas.DataFrame as input to predict. The NumpySerializer.serializer method then serializes the input into npy format and sends it to the server.
2. The server receives the data in npy format, deserializes it, and generates an inference result. The service then serializes the result into npy format and returns it.
3. The client receives the returned data in npy format. The NumpySerializer.deserialize method then deserializes the data into a numpy.ndarray.

Deploy and invoke services locally

For custom image deployments, the SDK provides a local execution mode. This mode is not applicable to services deployed using a processor. Run the inference service locally by passing instance_type="local" to the model.deploy method. The SDK uses docker to start a model service locally and automatically downloads required model data from OSS and mounts it to the local container.

from pai.predictor import LocalPredictor

p: LocalPredictor = model.deploy(
    # Specify to run locally.
    instance_type="local",
    serializer=JsonSerializer()
)

p.predict(data)

# Delete the corresponding docker container.
p.delete_service()

Related information

For the complete process of training and deploying a PyTorch model using the PAI Python SDK, see Train and deploy a PyTorch model using the PAI Python SDK.