Triton Inference Server is an inference serving engine designed for deep learning and machine learning models. It supports deploying models from AI frameworks like TensorRT, TensorFlow, PyTorch, and ONNX for online inference. It also includes features like multi-model management and custom backends. This topic shows you how to deploy a Triton Inference Server model service using an image.
Deploy a service: single model
1. Prepare the model and configuration file
Create a model repository in an Object Storage Service (OSS) bucket and organize your model and configuration files using the required directory structure. For more information, see Manage directories.
Each model repository must contain at least one model version directory and a model configuration file:
Model version directory: Contains the model files. The directory must be named with a number that is the version number. Higher numbers indicate newer versions.
Model configuration file: Provides basic information about the model. This file is typically named
config.pbtxt.
Assuming the model repository is located at oss://examplebucket/models/triton/, the directory structure is as follows:
triton
└──resnet50_pt
├── 1
│ └── model.pt
├── 2
│ └── model.pt
├── 3
│ └── model.pt
└── config.pbtxtThe following is an example of a config.pbtxt file:
name: "resnet50_pt"
platform: "pytorch_libtorch"
max_batch_size: 128
input [
{
name: "INPUT__0"
data_type: TYPE_FP32
dims: [ 3, -1, -1 ]
}
]
output [
{
name: "OUTPUT__0"
data_type: TYPE_FP32
dims: [ 1000 ]
}
]
# Use a GPU for inference.
# instance_group [
# {
# kind: KIND_GPU
# }
# ]
# Model version configuration
# version_policy: { all { }}
# version_policy: { latest: { num_versions: 2}}
# version_policy: { specific: { versions: [1,3]}}The following table describes the key parameters in the config.pbtxt file.
Parameter | Required | Description |
name | No | The name of the model. By default, this is the name of the model repository directory. If you specify a name, it must match the directory name. |
platform/backend | Yes | At least one of
|
max_batch_size | Yes | Specifies the maximum batch size for model requests. Set this parameter to |
input | Yes | Specifies the following properties:
|
output | Yes | Specifies the following properties:
|
instance_group | No | By default, if GPU resources are specified in your service configuration, inference runs on a GPU; otherwise, it runs on the CPU. You can also explicitly specify the resource for inference by configuring the The |
version_policy | No | Specifies the model version. The following are configuration examples:
|
2. Deploy the Triton Inference Server service
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Scenario-based Model Deployment section, click Triton Deployment.
On the Triton Deployment page, configure the following key parameters. For information about other parameters, see Custom deployment.
Parameter
Description
Service Name
Enter a custom service name.
Model Settings
In this solution, set Type to OSS. Set OSS to the OSS storage path of the model that you prepared in Step 1. For example,
oss://example/models/triton/.(Optional) Enable gRPC. By default, the service starts only an HTTP service on port
8000. To support gRPC calls, click Convert to Custom Deployment in the upper-right corner of the page and make the following changes:In the Environment Information section, change the Port Number to 8001.
Under , enable gRPC.
After you configure the parameters, click Deploy.
Deploy a multi-model service
The procedure for deploying a multi-model service in EAS is the same as deploying a single-model service. You only need to create model repositories as shown in the following example. The service loads all models and deploys them in a single service instance. For more information, see Deploy a service: single model.
triton
├── resnet50_pt
| ├── 1
| │ └── model.pt
| └── config.pbtxt
├── densenet_onnx
| ├── 1
| │ └── model.onnx
| └── config.pbtxt
└── mnist_savedmodel
├── 1
│ └── model.savedmodel
│ ├── saved_model.pb
| └── variables
| ├── variables.data-00000-of-00001
| └── variables.index
└── config.pbtxtDeploy a service: use a backend
A backend is the component that performs inference. It can call existing model frameworks (such as TensorRT, ONNX Runtime, PyTorch, or TensorFlow) or implement custom inference logic, such as model pre-processing or post-processing.
Backends can be written in C++ or Python. Python is more flexible and easier to use than C++. This section focuses on how to use the Python backend.
1. Update the model and configuration file
This section uses PyTorch as an example to show how to customize a model's computation logic using a Python backend. The model directory structure is as follows:
resnet50_pt
├── 1
│ ├── model.pt
│ └── model.py
└── config.pbtxtCompared to a standard model directory structure, a backend requires a model.py file in the model version directory to define the custom inference logic. The config.pbtxt file must also be modified accordingly.
Customize the inference logic
The
model.pyfile must define a class namedTritonPythonModeland implement three key interface functions:initialize,execute, andfinalize. The following is an example of the file content:import json import os import torch from torch.utils.dlpack import from_dlpack, to_dlpack import triton_python_backend_utils as pb_utils class TritonPythonModel: """The class name must be "TritonPythonModel".""" def initialize(self, args): """ The initializer function. This is optional. It is called once when the model is loaded and can be used to initialize information related to model properties and configurations. Parameters ---------- args : A dictionary where both keys and values are strings. It includes: * model_config: Model configuration information in JSON format. * model_instance_kind: The device model. * model_instance_device_id: The device ID. * model_repository: The model repository path. * model_version: The model version. * model_name: The model name. """ # Convert the model configuration content from a JSON string to a Python dictionary. self.model_config = model_config = json.loads(args["model_config"]) # Get the properties from the model configuration file. output_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT__0") # Convert Triton types to numpy types. self.output_dtype = pb_utils.triton_string_to_numpy(output_config["data_type"]) # Get the path of the model repository. self.model_directory = os.path.dirname(os.path.realpath(__file__)) # Get the device used for model inference. This example uses a GPU. self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print("device: ", self.device) model_path = os.path.join(self.model_directory, "model.pt") if not os.path.exists(model_path): raise pb_utils.TritonModelException("Cannot find the pytorch model") # Load the PyTorch model to the GPU using .to(self.device). self.model = torch.jit.load(model_path).to(self.device) print("Initialized...") def execute(self, requests): """ The model execution function. This must be implemented. This function is called for every inference request. If the batch parameter is set, you must also implement the batch processing feature yourself. Parameters ---------- requests : A list of requests of the pb_utils.InferenceRequest type. Returns ------- A list of responses of the pb_utils.InferenceResponse type. The length of the list must be the same as the length of the request list. """ output_dtype = self.output_dtype responses = [] # Traverse the request list and create a corresponding response for each request. for request in requests: # Get the input tensor. input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT__0") # Convert the Triton tensor to a Torch tensor. pytorch_tensor = from_dlpack(input_tensor.to_dlpack()) if pytorch_tensor.shape[2] > 1000 or pytorch_tensor.shape[3] > 1000: responses.append( pb_utils.InferenceResponse( output_tensors=[], error=pb_utils.TritonError( "Image shape should not be larger than 1000" ), ) ) continue # Perform inference computation on the GPU. prediction = self.model(pytorch_tensor.to(self.device)) # Convert the Torch output tensor to a Triton tensor. out_tensor = pb_utils.Tensor.from_dlpack("OUTPUT__0", to_dlpack(prediction)) inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor]) responses.append(inference_response) return responses def finalize(self): """ Called when the model is uninstalled. This is optional and can be used for model cleanup tasks. """ print("Cleaning up...")ImportantIf you use a GPU for inference, setting
instance_group.kindtoGPUin theconfig.pbtxtfile has no effect. You must load the model to the GPU usingmodel.to(torch.device("cuda"))and assign the input tensor to the GPU during request computation by callingpytorch_tensor.to(torch.device("cuda")). You can use a GPU for inference by simply configuring GPU resources when you deploy the service.If you use batching, setting the
max_batch_sizeparameter in theconfig.pbtxtfile is ineffective. You must implement the logic for batching requests within theexecutefunction.You must return one response for each request.
Update the configuration file
The following is an example of the
config.pbtxtfile:name: "resnet50_pt" backend: "python" max_batch_size: 128 input [ { name: "INPUT__0" data_type: TYPE_FP32 dims: [ 3, -1, -1 ] } ] output [ { name: "OUTPUT__0" data_type: TYPE_FP32 dims: [ 1000 ] } ] parameters: { key: "FORCE_CPU_ONLY_INPUT_TENSORS" value: {string_value: "no"} }The key parameters are described as follows. Other configurations remain the same.
backend: Must be set to
python.parameters: An optional configuration. When using a GPU for inference, you can set the
FORCE_CPU_ONLY_INPUT_TENSORSparameter tonoto avoid unnecessary overhead from copying input tensors between the CPU and GPU.
2. Deploy the service
Using a Python backend requires you to configure shared memory. To do this, go to the section and enter the following JSON configuration.
{
"metadata": {
"name": "triton_server_test",
"instance": 1
},
"cloud": {
"computing": {
"instance_type": "ml.gu7i.c8m30.1-gu30",
"instances": null
}
},
"containers": [
{
"command": "tritonserver --model-repository=/models",
"image": "eas-registry-vpc.<region>.cr.aliyuncs.com/pai-eas/tritonserver:23.02-py3",
"port": 8000,
"prepare": {
"pythonRequirements": [
"torch==2.0.1"
]
}
}
],
"storage": [
{
"mount_path": "/models",
"oss": {
"path": "oss://oss-test/models/triton_backend/"
}
},
{
"empty_dir": {
"medium": "memory",
// Configure the shared memory as 1 GB.
"size_limit": 1
},
"mount_path": "/dev/shm"
}
]
}Where:
name: The custom name of the model service.
storage.oss.path: The path to the OSS bucket where your model repository is located.
containers.image: Replace
<region>with the ID of the current region. For example, the region ID for China (Shanghai) iscn-shanghai.
Call the service: send service requests
To use the model service, send requests from a client. The following Python code provides an example.
Send an HTTP request
The service accepts HTTP requests on port 8000.
import numpy as np
import tritonclient.http as httpclient
# The URL is the endpoint generated after the EAS service is deployed.
url = '1859257******.cn-hangzhou.pai-eas.aliyuncs.com/api/predict/triton_server_test'
triton_client = httpclient.InferenceServerClient(url=url)
image = np.ones((1,3,224,224))
image = image.astype(np.float32)
inputs = []
inputs.append(httpclient.InferInput('INPUT__0', image.shape, "FP32"))
inputs[0].set_data_from_numpy(image, binary_data=False)
outputs = []
outputs.append(httpclient.InferRequestedOutput('OUTPUT__0', binary_data=False)) # Get a 1000-dimensional vector.
# Specify the model name, request token, input, and output.
results = triton_client.infer(
model_name="<model_name>",
model_version="<version_num>",
inputs=inputs,
outputs=outputs,
headers={"Authorization": "<test-token>"},
)
output_data0 = results.as_numpy('OUTPUT__0')
print(output_data0.shape)
print(output_data0)The following table describes the key parameter settings.
Parameter | Description |
url | Configure the service endpoint. The service endpoint must omit |
model_name | The name of the model directory, such as |
model_version | The specific model version number. You can send a request to only one model version at a time. |
headers | Replace |
Send a gRPC request
When the port number is set to 8001 and gRPC settings are configured, the service supports gRPC requests.
#!/usr/bin/env python
import grpc
from tritonclient.grpc import service_pb2, service_pb2_grpc
import numpy as np
if __name__ == "__main__":
# Define the endpoint of the service.
host = (
"service_name.115770327099****.cn-beijing.pai-eas.aliyuncs.com:80"
)
# Service token. Use a real token in actual applications.
token = "test-token"
# Model name and version.
model_name = "resnet50_pt"
model_version = "1"
# Create gRPC metadata for token authentication.
metadata = (("authorization", token),)
# Create a gRPC channel and stub to communicate with the server.
channel = grpc.insecure_channel(host)
grpc_stub = service_pb2_grpc.GRPCInferenceServiceStub(channel)
# Build the inference request.
request = service_pb2.ModelInferRequest()
request.model_name = model_name
request.model_version = model_version
# Construct the input tensor, which corresponds to the input parameter defined in the model configuration file.
input = service_pb2.ModelInferRequest().InferInputTensor()
input.name = "INPUT__0"
input.datatype = "FP32"
input.shape.extend([1, 3, 224, 224])
# Construct the output tensor, which corresponds to the output parameter defined in the model configuration file.
output = service_pb2.ModelInferRequest().InferRequestedOutputTensor()
output.name = "OUTPUT__0"
# Create the input request.
request.inputs.extend([input])
request.outputs.extend([output])
# Construct a random number array and serialize it into a byte sequence as input data.
request.raw_input_contents.append(np.random.rand(1, 3, 224, 224).astype(np.float32).tobytes()) # Numeric type
# Initiate the inference request and receive the response.
response, _ = grpc_stub.ModelInfer.with_call(request, metadata=metadata)
# Extract the output tensor from the response.
output_contents = response.raw_output_contents[0] # Assume there is only one output tensor.
output_shape = [1, 1000] # Assume the shape of the output tensor is [1, 1000].
# Convert the output bytes to a numpy array.
output_array = np.frombuffer(output_contents, dtype=np.float32)
output_array = output_array.reshape(output_shape)
# Print the model's output result.
print("Model output:\n", output_array)The following table describes the key parameter settings.
Parameter | Description |
host | The service endpoint. You must omit |
token | Replace <test-token> with your service token. You can find the token on the Public Endpoint tab. |
model_name | The name of the model directory, such as |
model_version | The specific model version number. You can send a request to only one model version at a time. |
FAQ
Q: How do I debug a Triton-deployed service online?
The online debugging feature requires a request body in JSON format.
When you initialize HTTPClient, you can set verbose=True to print the JSON-formatted data of the request and response.
triton_client = httpclient.InferenceServerClient(url=url, verbose=True)The following is an example result.
POST /api/predict/triton_test/v2/models/resnet50_pt/versions/1/infer, headers {'Authorization': '************1ZDY3OTEzNA=='}
b'{"inputs":[{"name":"INPUT__0","shape":[1,3,32,32],"datatype":"FP32","data":[1.0,1.0,1.0,.....,1.0]}],"outputs":[{"name":"OUTPUT__0","parameters":{"binary_data":false}}]}'Based on this output, specify the request path and request body to perform online debugging:

Q: How do I perform one-click stress testing on a Triton-deployed service?
For more information about how to obtain the request path and request body format, see Q: How do I debug a Triton-deployed service online?.
The following steps describe how to perform stress testing using a single piece of data as an example. For more information about stress testing, see Stress testing for services in common scenarios.
On the Stress Testing Task tab, click Add Stress Testing Task, select the deployed Triton service, and then enter the stress testing endpoint.
Set Data Source to Single Data and run the following code to convert the JSON-formatted request body into a Base64-encoded string.
import base64 # Existing JSON request body string json_str = '{"inputs":[{"name":"INPUT__0","shape":[1,3,32,32],"datatype":"FP32","data":[1.0,1.0,.....,1.0]}]}' # Direct encoding base64_str = base64.b64encode(json_str.encode('utf-8')).decode('ascii') print(base64_str)
References
For information about how to deploy an EAS service using the TensorFlow Serving inference engine, see Use a TensorFlow Serving image to deploy a model service.
You can also develop a custom image and use it to deploy an EAS service. For more information, see Custom images.