All Products
Search
Document Center

Platform For AI:Deploy an asynchronous inference service

Last Updated:Apr 15, 2026

PAI asynchronous inference services decouple request submission from result retrieval, avoiding timeouts in long-running tasks such as AIGC and video processing.

Background

Features

  • Asynchronous inference

    Low-latency scenarios typically use synchronous inference, where a client sends a request and waits for the result on the same connection.

    When inference times are long or unpredictable, synchronous waiting can cause dropped HTTP connections and client timeouts. Asynchronous inference decouples submission from retrieval: the client sends a request and later polls for the result or subscribes to a notification.

  • Queue service

    Near-real-time inference scenarios, such as short video processing, video or audio stream analysis, or computationally intensive image processing, do not require an immediate response but must return results within a specific time frame. These scenarios face the following challenges:

    • A round-robin load balancing algorithm is unsuitable. Requests must be distributed based on the actual load of each instance.

    • If an instance fails, its unfinished tasks must be reassigned to other healthy instances.

    PAI provides a queue service framework to solve these challenges.

How it works

image
  • When you create an asynchronous inference service, two sub-services are integrated within it: an inference sub-service and a queue sub-service. The queue sub-service has two built-in message queues: an input queue and a sink queue. The EAS framework in each inference sub-service instance automatically subscribes to the input queue, fetches request data via streaming, calls local interfaces to process the data, and writes the result to the sink queue.

  • If the sink queue is full, the service framework stops consuming from the input queue. This prevents processing requests whose results cannot be written to the sink queue.

    If you do not need a sink queue, for example, if you write inference results directly to OSS or your own message middleware, you can return an empty response from the synchronous HTTP inference interface. In this case, the sink queue is ignored.

  • A highly available queue sub-service receives client requests. Inference sub-service instances subscribe to requests based on their concurrency capacity. The queue sub-service ensures that active requests on each instance do not exceed its subscription window.

    Note

    For example, if each inference sub-service instance can process only five voice streams, set the window size to 5. When an instance finishes processing a stream and commits the result, the queue sub-service pushes a new stream to the instance. This ensures no instance processes more than five streams at a time.

  • If an instance fails, the queue sub-service marks it as unhealthy, re-queues its active requests, and dispatches them to healthy instances, ensuring no data is lost.

Create an asynchronous inference service

Creating an asynchronous inference service also creates a service group with the same name. The queue sub-service is created automatically within this group. By default, the queue sub-service starts with one instance and scales dynamically with inference sub-service instances, up to two instances. Each instance uses 1 CPU core and 4 GB of memory by default. To customize these settings, see Queue sub-service parameters.

EAS asynchronous inference services adapt synchronous inference logic for asynchronous execution. The following deployment methods are supported:

Console

  1. Go to the Custom Deployment page and configure the following key parameters. For information about other parameters, see Custom Deployment.

    • Deployment Method: Select Image-based Deployment or Processor-based Deployment, and select the Asynchronous Queue checkbox.

    image

  2. After you configure the parameters, click Deploy.

CLI

  1. Prepare the service configuration file service.json.

    • For a processor-based deployment:

      {
        "processor": "pmml",
        "model_path": "http://example.oss-cn-shanghai.aliyuncs.com/models/lr.pmml",
        "metadata": {
          "name": "pmmlasync",
          "type": "Async",
          "cpu": 4,
          "instance": 1,
          "memory": 8000
        }
      }

      Key parameters:

      • type: Set to Async to create an asynchronous inference service.

      • model_path: Replace the value with the path to your model.

    • For an image-based deployment:

      {
          "metadata": {
              "name": "image_async",
              "instance": 1,
              "rpc.worker_threads": 4,
              "type": "Async"
          },
          "cloud": {
              "computing": {
                  "instance_type": "ecs.gn6i-c16g1.4xlarge"
              }
          },
          "queue": {
              "cpu": 1,
              "min_replica": 1,
              "memory": 4000,
              "resource": ""
          },
          "containers": [
              {
                  "image": "eas-registry-vpc.cn-beijing.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.1",
                  "script": "python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-7B-Chat",
                  "port": 8000
              }
          ]
      }

      Key parameters:

      • type: Set to Async to create an asynchronous inference service.

      • instance: The number of inference sub-service instances, excluding queue sub-service instances.

      • rpc.worker_threads: The number of worker threads in the EAS framework. This value determines the subscription window size from the queue. If the number of threads is set to 4, a maximum of four data entries can be subscribed from the queue at a time. The queue sub-service will not push new messages to the instance until one of the current messages is processed.

        For example, for a video stream processing service where a single inference sub-service instance can process only two video streams at a time, you can set this parameter to 2. The queue sub-service will push a maximum of two video stream URLs to the inference sub-service. It will not send a new URL until the instance completes processing a previous one. This ensures the instance never handles more than two streams at once.

  2. Create the service.

    After you log on to the EASCMD client, run the create command to create the service. For information about how to log on to the EASCMD client, see Download and authenticate the client. Example:

    eascmd create service.json

Access the asynchronous inference service

A service group with the same name as the asynchronous inference service is created by default. The queue sub-service serves as the traffic entry point. Access it using the following paths. For more information, see Access the queue service.

Endpoint type

Format

Example

Input queue endpoint

{domain}/api/predict/{service_name}

xxx.cn-shanghai.pai-eas.aliyuncs.com/api/predict/{service_name}

Sink queue endpoint

{domain}/api/predict/{service_name}/sink

xxx.cn-shanghai.pai-eas.aliyuncs.com/api/predict/{service_name}/sink

Manage the asynchronous inference service

Managing an asynchronous inference service is the same as managing a regular service, because the system handles sub-services automatically. For example, deleting an asynchronous inference service deletes both the queue and inference sub-services. Updating the inference sub-service leaves the queue sub-service unchanged to ensure maximum availability.

Because of the sub-service architecture, configuring one service instance results in two instances appearing in the list: one for the inference sub-service and one for the queue sub-service.

image

The number of instances for an asynchronous inference service refers to the number of inference sub-service instances. The number of queue sub-service instances changes automatically with the number of inference sub-service instances. For example, when you scale out the number of inference sub-service instances to 3, the number of queue sub-service instances scales out to 2.

image

The following rules govern the ratio of instances between the two sub-services:

  • When the asynchronous inference service is stopped, the number of instances for both the queue sub-service and the inference sub-service scales in to 0. The instance list will be empty.

  • When the number of inference sub-service instances is 1, the number of queue sub-service instances is also 1, unless specified otherwise in the queue sub-service configuration.

  • If there are more than two inference instances, the number of queue instances remains at two, unless specified otherwise in the queue sub-service configuration.

  • If you configure auto scaling with a minimum of 0 instances, the queue sub-service maintains 1 standby instance even when the inference sub-service scales to 0.

Queue sub-service parameters

The queue sub-service works properly with the default configuration in most cases. To customize it, configure the queue field at the top level of a JSON file. Example:

{  
  "queue": {
     "sink": {
        "memory_ratio": 0.3
     },
     "source": {
        "auto_evict": true
     }
 }

The following sections describe the configuration items in detail.

Configure queue sub-service resources

By default, queue sub-service resources are configured according to the fields in the metadata. However, in some use cases, you may need to configure the resources for the queue sub-service separately.

  • Use queue.resource to specify the resource group for the queue sub-service.

    {
      "queue": {
        "resource": "eas-r-slzkbq4tw0p6xd****"  # By default, it follows the inference sub-service's resource group.
      }
    }
    • By default, the queue sub-service uses the same resource group as the inference sub-service.

    • To deploy the queue sub-service in a public resource group, set resource to an empty string (""). This is useful when your dedicated resource group has insufficient CPU and memory.

      Note

      We recommend deploying the queue sub-service in a public resource group.

  • Use queue.cpu and queue.memory to specify the CPU (in cores) and memory (in MB) for each queue sub-service instance.

    {
      "queue": {
         "cpu": 2,  # Default: 1.
         "memory": 8000  # Default: 4000.
      }
    }

    If you do not configure resources, each queue sub-service instance uses 1 CPU core and 4 GB of memory by default. This meets the needs of most scenarios.

    Important
    • If you have more than 200 subscribers (for example, inference sub-service instances), we recommend that you set the number of CPU cores to 2 or more.

    • We do not recommend reducing the memory configuration of the queue sub-service in a production environment.

  • Use queue.min_replica to configure the minimum number of queue sub-service instances.

    {
      "queue": {
         "min_replica": 3  # Default: 1.
      }
    }

    When you use an asynchronous inference service, the number of queue sub-service instances is automatically adjusted based on the runtime number of inference sub-service instances. The default adjustment range is [1, min(2, number of inference sub-service instances)]. In special cases, if you configure auto scaling for the asynchronous inference service and allow the number of instances to be scaled in to 0, one queue sub-service instance is automatically retained. You can also use queue.min_replica to adjust the minimum number of retained queue sub-service instances.

    Note

    Increasing the number of queue sub-service instances improves availability but does not improve performance.

Configure queue sub-service features

The queue sub-service has several configurable features.

  • Use queue.sink.auto_evict or queue.source.auto_evict to enable automatic data eviction for the sink queue or input queue, respectively.

    {
      "queue": {
         "sink": {
            "auto_evict": true  # Enable eviction for the sink queue. Default: false.
          },
          "source": {
             "auto_evict": true  # Enable eviction for the input queue. Default: false.
          }
      }
    }

    By default, automatic data eviction is disabled. If a queue is full, you cannot write more data. If you allow data overflow, enable automatic eviction, which makes the queue evict the oldest message to make room for a new one.

  • Use queue.max_delivery to configure the maximum delivery attempts.

    {
       "queue": {
          "max_delivery": 10  # Max delivery attempts: 10. Default: 5. If set to 0, max delivery attempts is disabled and a message can be delivered indefinitely.
       }
    }

    If a message's delivery attempts exceed this threshold, it is marked as a dead letter. For more information, see dead-letter policy.

  • Use queue.max_idle to configure the maximum processing time for a message.

    {
        "queue": {
          "max_idle": "1m"  # Set the maximum processing time for a single message to 1 minute. If this time is exceeded, the message will be delivered to other subscribers. After delivery, the delivery count increases by 1. The default value is 0, which means there is no maximum processing time.
        }
    }

    The time duration configured in the example is 1 minute. Multiple time units are supported, such as h (hour), m (minute), and s (second). If the processing time for a single message exceeds the configured duration, one of the following occurs:

    • If the threshold set by queue.max_delivery is not exceeded, the message is delivered to other subscribers.

    • If the threshold set by queue.max_delivery is exceeded, the dead-letter policy is applied to the message.

  • Use queue.dead_message_policy to configure the dead-letter policy.

    {
        "queue": {
          "dead_message_policy":  "Rear"  # Valid values: Rear (default) or Drop. Rear moves the message to the end of the queue. Drop deletes the message. 																 
        }
    }

Configure queue limits

Maximum queue length and maximum payload size are inversely related. They follow these formulas:

image

Queue sub-service instance memory is fixed. Therefore, if you increase the maximum payload size per message, the maximum queue length decreases.

Note
  • With the default 4 GB memory configuration, where the default maximum payload size is 8 KB, the input and sink queues can each store 230,399 messages. If you need to store more messages in the queue sub-service, increase the memory size as needed, as described in the memory configuration section above. The system reserves 10% of the total memory.

  • For the same queue, you cannot configure both the maximum length and the maximum payload size.

  • Use queue.sink.max_length or queue.source.max_length to configure the maximum length of the sink queue or input queue, respectively.

    {
        "queue": {
           "sink": {
              "max_length": 8000  # Configure the maximum length of the sink queue to 8,000 messages.
           },
           "source": {
              "max_length": 2000  # Configure the maximum length of the input queue to 2,000 messages.
           }
        }
    }
  • Use queue.sink.max_payload_size_kb or queue.source.max_payload_size_kb to configure the maximum payload size per message for the sink queue or input queue, respectively.

    {
        "queue": {
           "sink": {
              "max_payload_size_kb": 10  # Configure the maximum payload size per message in the sink queue to 10 KB. The default is 8 KB.
           },
           "source": {
              "max_payload_size_kb": 1024  # Configure the maximum payload size per message in the input queue to 1024 KB (1 MB). The default is 8 KB.
           }
        }
    }

Configure memory allocation ratio

  • Use queue.sink.memory_ratio to adjust the memory allocation between the input and sink queues.

    {
        "queue": {
           "sink": {
              "memory_ratio": 0.9  # Configure the memory ratio for the sink queue. The default value is 0.5.
           }
        }
    }
    Note

    By default, the input queue and the sink queue evenly share the memory of the queue sub-service instance. If your output data (e.g., images) is larger than your input data (e.g., text), you can increase queue.sink.memory_ratio to allocate more memory to the sink queue. Conversely, if your service takes images as input and outputs text, you can decrease queue.sink.memory_ratio.

Horizontal auto scaling

How it works

The system dynamically scales inference instances based on queue length, including scaling to zero when the queue is empty to reduce costs. The following diagram illustrates auto scaling for asynchronous inference services.

image

Procedure

  1. In the service list, click the name of the target service to go to the service details page.

  2. Switch to the Auto Scaling tab. In the Auto Scaling section, click Enable Auto Scaling.

  3. In the Auto Scaling Settings dialog box, configure the parameters.

    • Basic settings:

      Parameter

      Description

      Example

      Minimum Replicas

      Minimum replicas for scale-in operations. Minimum value: 0.

      0

      Maximum Replicas

      Maximum replicas for scale-out operations. Maximum value: 1000.

      10

      General Scaling Metrics

      Built-in performance metrics used to trigger scaling.

      Asynchronous Queue Length represents the average number of queued tasks per instance.

      Select Asynchronous Queue Length and set the threshold to 10.

    • Advanced settings:

      Parameter

      Description

      Example

      Scale-out Starts in

      Observation window for scale-out decisions. After scale-out triggers, the system observes metrics during this period. If metric values fall below threshold, scale-out cancels. Unit: seconds.

      The default value is 0 seconds, which means the scale-out is executed immediately.

      0

      Scale-in Starts in

      Observation window for scale-in decisions—the key parameter to prevent service jitter. Scale-in occurs only after metrics remain below threshold for this entire duration. Unit: seconds.

      Default: 300 seconds. This protects against frequent scale-in events from traffic fluctuations. Do not set too low to maintain service stability.

      300

      Scale-in to 0 Instance Starts in

      When Minimum Replicas is 0, this parameter defines wait time before replica count reduces to 0.

      600

      Scale-from-Zero Replica Count

      Replica count to add when service scales from 0 replicas.

      1

    For more information about parameters and using the EASCMD client, see Horizontal auto scaling.