All Products
Search
Document Center

Platform For AI:Deploy a model service by using a Triton Inference Server image

Last Updated:Dec 25, 2025

Triton Inference Server is an inference serving engine designed for deep learning and machine learning models. It supports deploying models from AI frameworks like TensorRT, TensorFlow, PyTorch, and ONNX for online inference. It also includes features like multi-model management and custom backends. This topic shows you how to deploy a Triton Inference Server model service using an image.

Deploy a service: single model

1. Prepare the model and configuration file

Create a model repository in an Object Storage Service (OSS) bucket and organize your model and configuration files using the required directory structure. For more information, see Manage directories.

Each model repository must contain at least one model version directory and a model configuration file:

  • Model version directory: Contains the model files. The directory must be named with a number that is the version number. Higher numbers indicate newer versions.

  • Model configuration file: Provides basic information about the model. This file is typically named config.pbtxt.

Assuming the model repository is located at oss://examplebucket/models/triton/, the directory structure is as follows:

triton
└──resnet50_pt
    ├── 1
    │   └── model.pt
    ├── 2
    │   └── model.pt
    ├── 3
    │   └── model.pt
    └── config.pbtxt

The following is an example of a config.pbtxt file:

name: "resnet50_pt"
platform: "pytorch_libtorch"
max_batch_size: 128
input [
  {
    name: "INPUT__0"
    data_type: TYPE_FP32
    dims: [ 3, -1, -1 ]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]

# Use a GPU for inference.
# instance_group [
#   { 
#     kind: KIND_GPU
#   }
# ]

# Model version configuration
# version_policy: { all { }}
# version_policy: { latest: { num_versions: 2}}
# version_policy: { specific: { versions: [1,3]}}

The following table describes the key parameters in the config.pbtxt file.

Parameter

Required

Description

name

No

The name of the model. By default, this is the name of the model repository directory. If you specify a name, it must match the directory name.

platform/backend

Yes

At least one of platform or backend must be configured.

  • platform: Specifies the model framework. Common frameworks include tensorrt_plan, onnxruntime_onnx, pytorch_libtorch, tensorflow_savedmodel, and tensorflow_graphdef.

  • backend: Specifies the model framework or a custom inference logic written in Python.

    • You can specify the same frameworks as platform, but with different names, such as tensorrt, onnxruntime, pytorch, and tensorflow.

    • To use custom inference logic in Python, see Deploy a service: use a backend.

max_batch_size

Yes

Specifies the maximum batch size for model requests. Set this parameter to 0 to disable batching.

input

Yes

Specifies the following properties:

  • name: The name of the input data.

  • data_type: The data type.

  • dims: The dimension.

output

Yes

Specifies the following properties:

  • name: The name of the input data.

  • data_type: The data type.

  • dims: The dimension.

instance_group

No

By default, if GPU resources are specified in your service configuration, inference runs on a GPU; otherwise, it runs on the CPU. You can also explicitly specify the resource for inference by configuring the instance_group parameter, as shown in the following example:

instance_group [
   { 
     kind: KIND_GPU
   }
 ]

The kind can be set to KIND_GPU or KIND_CPU

version_policy

No

Specifies the model version. The following are configuration examples:

version_policy: { all { }}
version_policy: { latest: { num_versions: 2}}
version_policy: { specific: { versions: [1,3]}}
  • If this parameter is not configured, Triton loads the model with the highest version number by default. In the example, version 3 of the resnet50_pt model is loaded.

  • all{}: Loads all versions of the model. In the example, versions 1, 2, and 3 of resnet50_pt are loaded.

  • latest{num_versions:}: For example, if you set num_versions: 2, Triton loads the two latest versions (2 and 3) of the resnet50_pt model.

  • specific{versions:[]}: Loads the specified versions. In the example, versions 1 and 3 of resnet50_pt are loaded.

2. Deploy the Triton Inference Server service

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).

  2. On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Scenario-based Model Deployment section, click Triton Deployment.

  3. On the Triton Deployment page, configure the following key parameters. For information about other parameters, see Custom deployment.

    Parameter

    Description

    Service Name

    Enter a custom service name.

    Model Settings

    In this solution, set Type to OSS. Set OSS to the OSS storage path of the model that you prepared in Step 1. For example, oss://example/models/triton/.

  4. (Optional) Enable gRPC. By default, the service starts only an HTTP service on port 8000. To support gRPC calls, click Convert to Custom Deployment in the upper-right corner of the page and make the following changes:

    • In the Environment Information section, change the Port Number to 8001.

    • Under Service Features > Advanced Network, enable gRPC.

  5. After you configure the parameters, click Deploy.

Deploy a multi-model service

The procedure for deploying a multi-model service in EAS is the same as deploying a single-model service. You only need to create model repositories as shown in the following example. The service loads all models and deploys them in a single service instance. For more information, see Deploy a service: single model.

triton
├── resnet50_pt
|   ├── 1
|   │   └── model.pt
|   └── config.pbtxt
├── densenet_onnx
|   ├── 1
|   │   └── model.onnx
|   └── config.pbtxt
└── mnist_savedmodel
    ├── 1
    │   └── model.savedmodel
    │       ├── saved_model.pb
    |       └── variables
    |           ├── variables.data-00000-of-00001
    |           └── variables.index
    └── config.pbtxt

Deploy a service: use a backend

A backend is the component that performs inference. It can call existing model frameworks (such as TensorRT, ONNX Runtime, PyTorch, or TensorFlow) or implement custom inference logic, such as model pre-processing or post-processing.

Backends can be written in C++ or Python. Python is more flexible and easier to use than C++. This section focuses on how to use the Python backend.

1. Update the model and configuration file

This section uses PyTorch as an example to show how to customize a model's computation logic using a Python backend. The model directory structure is as follows:

resnet50_pt
├── 1
│   ├── model.pt
│   └── model.py
└── config.pbtxt

Compared to a standard model directory structure, a backend requires a model.py file in the model version directory to define the custom inference logic. The config.pbtxt file must also be modified accordingly.

  • Customize the inference logic

    The model.py file must define a class named TritonPythonModel and implement three key interface functions: initialize, execute, and finalize. The following is an example of the file content:

    import json
    import os
    import torch
    from torch.utils.dlpack import from_dlpack, to_dlpack
    
    import triton_python_backend_utils as pb_utils
    
    
    class TritonPythonModel:
        """The class name must be "TritonPythonModel"."""
    
        def initialize(self, args):
            """
            The initializer function. This is optional. It is called once when the model is loaded and can be used to initialize information related to model properties and configurations.
            Parameters
            ----------
            args : A dictionary where both keys and values are strings. It includes:
              * model_config: Model configuration information in JSON format.
              * model_instance_kind: The device model.
              * model_instance_device_id: The device ID.
              * model_repository: The model repository path.
              * model_version: The model version.
              * model_name: The model name.
            """
    
            # Convert the model configuration content from a JSON string to a Python dictionary.
            self.model_config = model_config = json.loads(args["model_config"])
    
            # Get the properties from the model configuration file.
            output_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT__0")
    
            # Convert Triton types to numpy types.
            self.output_dtype = pb_utils.triton_string_to_numpy(output_config["data_type"])
    
            # Get the path of the model repository.
            self.model_directory = os.path.dirname(os.path.realpath(__file__))
    
            # Get the device used for model inference. This example uses a GPU.
            self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
            print("device: ", self.device)
    
            model_path = os.path.join(self.model_directory, "model.pt")
            if not os.path.exists(model_path):
                raise pb_utils.TritonModelException("Cannot find the pytorch model")
            # Load the PyTorch model to the GPU using .to(self.device).
            self.model = torch.jit.load(model_path).to(self.device)
    
            print("Initialized...")
    
        def execute(self, requests):
            """
            The model execution function. This must be implemented. This function is called for every inference request. If the batch parameter is set, you must also implement the batch processing feature yourself.
            Parameters
            ----------
            requests : A list of requests of the pb_utils.InferenceRequest type.
    
            Returns
            -------
            A list of responses of the pb_utils.InferenceResponse type. The length of the list must be the same as the length of the request list.
            """
    
            output_dtype = self.output_dtype
    
            responses = []
    
            # Traverse the request list and create a corresponding response for each request.
            for request in requests:
                # Get the input tensor.
                input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT__0")
                # Convert the Triton tensor to a Torch tensor.
                pytorch_tensor = from_dlpack(input_tensor.to_dlpack())
    
                if pytorch_tensor.shape[2] > 1000 or pytorch_tensor.shape[3] > 1000:
                    responses.append(
                        pb_utils.InferenceResponse(
                            output_tensors=[],
                            error=pb_utils.TritonError(
                                "Image shape should not be larger than 1000"
                            ),
                        )
                    )
                    continue
    
                # Perform inference computation on the GPU.
                prediction = self.model(pytorch_tensor.to(self.device))
    
                # Convert the Torch output tensor to a Triton tensor.
                out_tensor = pb_utils.Tensor.from_dlpack("OUTPUT__0", to_dlpack(prediction))
    
                inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor])
                responses.append(inference_response)
    
            return responses
    
        def finalize(self):
            """
            Called when the model is uninstalled. This is optional and can be used for model cleanup tasks.
            """
            print("Cleaning up...")
    
    Important
    • If you use a GPU for inference, setting instance_group.kind to GPU in the config.pbtxt file has no effect. You must load the model to the GPU using model.to(torch.device("cuda")) and assign the input tensor to the GPU during request computation by calling pytorch_tensor.to(torch.device("cuda")). You can use a GPU for inference by simply configuring GPU resources when you deploy the service.

    • If you use batching, setting the max_batch_size parameter in the config.pbtxt file is ineffective. You must implement the logic for batching requests within the execute function.

    • You must return one response for each request.

  • Update the configuration file

    The following is an example of the config.pbtxt file:

    name: "resnet50_pt"
    backend: "python"
    max_batch_size: 128
    input [
      {
        name: "INPUT__0"
        data_type: TYPE_FP32
        dims: [ 3, -1, -1 ]
      }
    ]
    output [
      {
        name: "OUTPUT__0"
        data_type: TYPE_FP32
        dims: [ 1000 ]
      }
    ]
    
    parameters: {
        key: "FORCE_CPU_ONLY_INPUT_TENSORS"
        value: {string_value: "no"}
    }

    The key parameters are described as follows. Other configurations remain the same.

    • backend: Must be set to python.

    • parameters: An optional configuration. When using a GPU for inference, you can set the FORCE_CPU_ONLY_INPUT_TENSORS parameter to no to avoid unnecessary overhead from copying input tensors between the CPU and GPU.

2. Deploy the service

Using a Python backend requires you to configure shared memory. To do this, go to the Custom Model Deployment > JSON On Premises Deployment section and enter the following JSON configuration.

{
  "metadata": {
    "name": "triton_server_test",
    "instance": 1
  },
  "cloud": {
        "computing": {
            "instance_type": "ml.gu7i.c8m30.1-gu30",
            "instances": null
        }
    },
  "containers": [
    {
      "command": "tritonserver --model-repository=/models",
      "image": "eas-registry-vpc.<region>.cr.aliyuncs.com/pai-eas/tritonserver:23.02-py3",
      "port": 8000,
      "prepare": {
        "pythonRequirements": [
          "torch==2.0.1"
        ]
      }
    }
  ],
  "storage": [
    {
      "mount_path": "/models",
      "oss": {
        "path": "oss://oss-test/models/triton_backend/"
      }
    },
    {
      "empty_dir": {
        "medium": "memory",
        // Configure the shared memory as 1 GB.
        "size_limit": 1
      },
      "mount_path": "/dev/shm"
    }
  ]
}

Where:

  • name: The custom name of the model service.

  • storage.oss.path: The path to the OSS bucket where your model repository is located.

  • containers.image: Replace <region> with the ID of the current region. For example, the region ID for China (Shanghai) is cn-shanghai.

Call the service: send service requests

To use the model service, send requests from a client. The following Python code provides an example.

Send an HTTP request

The service accepts HTTP requests on port 8000.

import numpy as np
import tritonclient.http as httpclient

# The URL is the endpoint generated after the EAS service is deployed.
url = '1859257******.cn-hangzhou.pai-eas.aliyuncs.com/api/predict/triton_server_test'

triton_client = httpclient.InferenceServerClient(url=url)

image = np.ones((1,3,224,224))
image = image.astype(np.float32)

inputs = []
inputs.append(httpclient.InferInput('INPUT__0', image.shape, "FP32"))
inputs[0].set_data_from_numpy(image, binary_data=False)
outputs = []
outputs.append(httpclient.InferRequestedOutput('OUTPUT__0', binary_data=False))  # Get a 1000-dimensional vector.

# Specify the model name, request token, input, and output.
results = triton_client.infer(
    model_name="<model_name>",
    model_version="<version_num>",
    inputs=inputs,
    outputs=outputs,
    headers={"Authorization": "<test-token>"},
)
output_data0 = results.as_numpy('OUTPUT__0')
print(output_data0.shape)
print(output_data0)

The following table describes the key parameter settings.

Parameter

Description

url

Configure the service endpoint. The service endpoint must omit http://. On the Elastic Algorithm Service (EAS) page, click the service name. On the Service Details tab, click View Endpoint Information to view the public endpoint.

model_name

The name of the model directory, such as resnet50_pt

model_version

The specific model version number. You can send a request to only one model version at a time.

headers

Replace <test-token> with your service token. You can find the token on the Public Endpoint tab.

Send a gRPC request

When the port number is set to 8001 and gRPC settings are configured, the service supports gRPC requests.

#!/usr/bin/env python
import grpc
from tritonclient.grpc import service_pb2, service_pb2_grpc
import numpy as np

if __name__ == "__main__":
    # Define the endpoint of the service.
    host = (
        "service_name.115770327099****.cn-beijing.pai-eas.aliyuncs.com:80"
    )
    # Service token. Use a real token in actual applications.
    token = "test-token"
    # Model name and version.
    model_name = "resnet50_pt"
    model_version = "1"
    
    # Create gRPC metadata for token authentication.
    metadata = (("authorization", token),)

    # Create a gRPC channel and stub to communicate with the server.
    channel = grpc.insecure_channel(host)
    grpc_stub = service_pb2_grpc.GRPCInferenceServiceStub(channel)
    
    # Build the inference request.
    request = service_pb2.ModelInferRequest()
    request.model_name = model_name
    request.model_version = model_version
    
    # Construct the input tensor, which corresponds to the input parameter defined in the model configuration file.
    input = service_pb2.ModelInferRequest().InferInputTensor()
    input.name = "INPUT__0"
    input.datatype = "FP32"
    input.shape.extend([1, 3, 224, 224])
     # Construct the output tensor, which corresponds to the output parameter defined in the model configuration file.
    output = service_pb2.ModelInferRequest().InferRequestedOutputTensor()
    output.name = "OUTPUT__0"
    
    # Create the input request.
    request.inputs.extend([input])
    request.outputs.extend([output])
    # Construct a random number array and serialize it into a byte sequence as input data.
    request.raw_input_contents.append(np.random.rand(1, 3, 224, 224).astype(np.float32).tobytes()) # Numeric type
        
    # Initiate the inference request and receive the response.
    response, _ = grpc_stub.ModelInfer.with_call(request, metadata=metadata)
    
    # Extract the output tensor from the response.
    output_contents = response.raw_output_contents[0]  # Assume there is only one output tensor.
    output_shape = [1, 1000]  # Assume the shape of the output tensor is [1, 1000].
    
    # Convert the output bytes to a numpy array.
    output_array = np.frombuffer(output_contents, dtype=np.float32)
    output_array = output_array.reshape(output_shape)
    
    # Print the model's output result.
    print("Model output:\n", output_array)

The following table describes the key parameter settings.

Parameter

Description

host

The service endpoint. You must omit http:// from the URL and append :80 to the end. You can find the public endpoint by navigating to the Elastic Algorithm Service (EAS) page, click the service name, and on the Service Details tab, clicking View Endpoint Information.

token

Replace <test-token> with your service token. You can find the token on the Public Endpoint tab.

model_name

The name of the model directory, such as resnet50_pt.

model_version

The specific model version number. You can send a request to only one model version at a time.

FAQ

Q: How do I debug a Triton-deployed service online?

The online debugging feature requires a request body in JSON format.

When you initialize HTTPClient, you can set verbose=True to print the JSON-formatted data of the request and response.

triton_client = httpclient.InferenceServerClient(url=url, verbose=True)

The following is an example result.

POST /api/predict/triton_test/v2/models/resnet50_pt/versions/1/infer, headers {'Authorization': '************1ZDY3OTEzNA=='}
b'{"inputs":[{"name":"INPUT__0","shape":[1,3,32,32],"datatype":"FP32","data":[1.0,1.0,1.0,.....,1.0]}],"outputs":[{"name":"OUTPUT__0","parameters":{"binary_data":false}}]}'

Based on this output, specify the request path and request body to perform online debugging:

image

Q: How do I perform one-click stress testing on a Triton-deployed service?

For more information about how to obtain the request path and request body format, see Q: How do I debug a Triton-deployed service online?.

The following steps describe how to perform stress testing using a single piece of data as an example. For more information about stress testing, see Stress testing for services in common scenarios.

  1. On the Stress Testing Task tab, click Add Stress Testing Task, select the deployed Triton service, and then enter the stress testing endpoint.

  2. Set Data Source to Single Data and run the following code to convert the JSON-formatted request body into a Base64-encoded string.

    import base64
    
    # Existing JSON request body string
    json_str = '{"inputs":[{"name":"INPUT__0","shape":[1,3,32,32],"datatype":"FP32","data":[1.0,1.0,.....,1.0]}]}'
    # Direct encoding
    base64_str = base64.b64encode(json_str.encode('utf-8')).decode('ascii')
    print(base64_str)

    image

References