Deploy asynchronous inference service - PAI EAS queue service - Platform For AI

For AIGC, video processing, and other long-running inference workloads, synchronous inference can cause connection timeouts and uneven replica loads. PAI asynchronous inference lets you submit requests and retrieve results through subscription or polling.

Background information

Features

Asynchronous inference

Low-latency online inference typically uses synchronous inference: the client sends a request and waits for the result on the same connection.

When inference times are long or unpredictable, synchronous waiting can cause dropped HTTP connections and client timeouts. With asynchronous inference, the client sends a request and retrieves the result later by polling or subscribing to notifications.
Queue service
Near-real-time scenarios such as short video processing, audio/video stream analysis, or intensive image processing must return results within a specific timeframe. These scenarios face the following challenges:
- The round-robin load balancing algorithm is unsuitable. Requests must be distributed based on the actual load of each replica.
- If a replica fails, its unfinished tasks must be reassigned to other healthy replicas for processing.
PAI provides a queue service framework to solve these request distribution problems.

How it works

An asynchronous inference service contains two sub-services: an inference sub-service and a queue sub-service. The queue sub-service has two built-in queues: an input queue and a sink queue. Requests go to the input queue first. Each inference sub-service replica subscribes to the input queue, processes requests, and writes responses to the sink queue.
When the sink queue is full, the service framework stops consuming from the input queue to prevent undeliverable results.

If you write inference results directly to OSS or your own message middleware, return an empty response from the HTTP inference interface. The sink queue is then ignored.
The queue sub-service receives client requests and distributes them to inference replicas based on concurrency capacity. Each replica subscribes to a window of requests, preventing overload and ensuring all data is eventually returned to the client.

Note
For example, if each replica can process five audio streams, set the window size to 5. When a replica finishes one stream and commits the result, the queue sub-service pushes a new stream. This limits each replica to five concurrent streams.
The queue sub-service monitors replica connections. If a replica fails, its unprocessed requests are redistributed to healthy replicas, ensuring no data is lost.

Create an asynchronous inference service

Creating an asynchronous inference service automatically creates a same-named service group with a queue sub-service. The queue sub-service defaults to one replica (1 core, 4 GB memory) and scales up to two replicas with the inference sub-service. To customize, adjust the Queue sub-service parameters.

EAS supports two deployment methods for asynchronous inference:

Deploy via console

Go to the Custom Deployment page and configure the following key parameters. Other parameters are described in Custom Deployment.
- Deployment Method: Select Image-based Deployment or Processor-based Deployment, and select the Asynchronous Queue checkbox.
After you configure the parameters, click Deploy.

Deploy via eascmd client

Prepare the service configuration file named service.json.
- Use a model and processor-based deployment.
```
{
  "processor": "pmml",
  "model_path": "http://example.oss-cn-shanghai.aliyuncs.com/models/lr.pmml",
  "metadata": {
    "name": "pmmlasync",
    "type": "Async",
    "cpu": 4,
    "instance": 1,
    "memory": 8000
  }
}
```
  Key parameters are described below. Other parameters are covered in JSON-based deployment.
  - type: Set this parameter to Async to create an asynchronous inference service.
  - model_path: Replace the value with the path to your model.
- Use an image-based deployment.
```
{
    "metadata": {
        "name": "image_async",
        "instance": 1,
        "rpc.worker_threads": 4,
        "type": "Async"
    },
    "cloud": {
        "computing": {
            "instance_type": "ecs.gn6i-c16g1.4xlarge"
        }
    },
    "queue": {
        "cpu": 1,
        "min_replica": 1,
        "memory": 4000,
        "resource": ""
    },
    "containers": [
        {
            "image": "eas-registry-vpc.cn-beijing.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.1",
            "script": "python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-7B-Chat",
            "port": 8000
        }
    ]
}
```
  Key parameters are described below. Other parameters are covered in JSON-based deployment.
  - type: Set this parameter to Async to create an asynchronous inference service.
  - instance: The number of replicas for the inference sub-service. This does not include the replicas of the queue sub-service.
  - rpc.worker_threads: The number of threads for the EAS service framework, which equals the subscription window size. The queue sub-service pushes at most this many messages concurrently and waits for results before sending more.
    
    For example, for a video stream service where each replica handles two streams at a time, set this to 2. The queue sub-service pushes at most two video stream URLs and sends a new one only after receiving a result.
Create the service.
After you log on to the eascmd client (Download and authenticate the client), run the create command:
```
eascmd create service.json
```

Access an asynchronous inference service

The system creates a same-named service group. Because the queue sub-service handles incoming traffic, access it directly through the following endpoints. Access a queue service.

Endpoint type	Format	Example
Input queue endpoint	`{domain}/api/predict/{service_name}`	`xxx.cn-shanghai.pai-eas.aliyuncs.com/api/predict/{service_name}`
Sink queue endpoint	`{domain}/api/predict/{service_name}/sink`	`xxx.cn-shanghai.pai-eas.aliyuncs.com/api/predict/{service_name}/sink`

Manage an asynchronous inference service

Manage an asynchronous inference service like a regular service. The system manages sub-services automatically: deleting the service removes both sub-services, and updating the inference sub-service leaves the queue sub-service unchanged.

Even with one configured replica, the instance list shows an additional queue sub-service instance.

The replica count refers to inference sub-service replicas. Queue sub-service replicas scale automatically. For example, scaling inference replicas to 3 increases queue replicas to 2.

Replica scaling rules:

When the service is stopped, both sub-services scale to 0 replicas.
With one inference replica, the queue sub-service also has one replica (unless configured otherwise).
With two or more inference replicas, the queue sub-service maintains two replicas (unless configured otherwise).
If auto scaling allows a minimum of 0 replicas, the queue sub-service retains one standby replica when inference replicas scale to 0.

Queue sub-service parameters

The queue sub-service works with the default configuration in most cases. Customize it in the top-level queue field of the JSON file:

{  
  "queue": {
     "sink": {
        "memory_ratio": 0.3
     },
     "source": {
        "auto_evict": true,
     }
 }

The following sections describe the configuration options.

Queue sub-service resources

By default, queue sub-service resources inherit from metadata. Configure them separately if needed.

Declare the resource group for the queue sub-service using queue.resource.
```
{
  "queue": {
    "resource": "eas-r-slzkbq4tw0p6xd****"  // By default, it uses the resource group of the inference sub-service.
  }
}
```
- Defaults to the inference sub-service resource group.
- To deploy the queue sub-service in a public resource group, set resource to an empty string (""). This is useful when your dedicated resource group lacks CPU or memory.
  
  Note
  Deploy the queue sub-service in a public resource group when possible.
Declare the CPU (in cores) and memory (in MB) for each queue sub-service replica using queue.cpu and queue.memory.
```
{
  "queue": {
     "cpu": 2,  // Default: 1.
     "memory": 8000  // Default: 4000.
  }
}
```
The default (1 CPU core, 4 GB memory) is sufficient for most scenarios.
Important
- For more than 200 subscribers (inference sub-service replicas), configure 2 or more CPU cores.
- Do not reduce queue sub-service memory in production.
Configure the minimum number of replicas for the queue sub-service using queue.min_replica.
```
{
  "queue": {
     "min_replica": 3  // Default: 1.
  }
}
```
Queue sub-service replicas scale automatically with running inference replicas. The default range is [1, min{2, the number of inference sub-service replicas}]. If auto scaling allows scaling to 0, one queue replica is retained. Use queue.min_replica to adjust this minimum.

Note
More queue replicas improve availability, not performance.

Queue sub-service features

The queue sub-service supports the following feature configurations.

Configure automatic data eviction for the sink and input queues using queue.sink.auto_evict or queue.source.auto_evict, respectively.
```
{
  "queue": {
     "sink": {
        "auto_evict": true  // Enables automatic eviction for the sink queue. Default: false.
      },
      "source": {
         "auto_evict": true  // Enables automatic eviction for the input queue. Default: false.
      }
  }
}
```
Automatic eviction is disabled by default—a full queue rejects new data. Enable eviction to drop the oldest data and make room for new entries.

Configure the maximum number of delivery attempts using queue.max_delivery.

{
   "queue": {
      "max_delivery": 10  // The maximum number of delivery attempts is 10. Default: 5. If set to 0, this feature is disabled, and data can be delivered an unlimited number of times.
   }
}

When delivery attempts exceed the threshold, the message is marked as a dead letter. Dead-letter policy.

Configure the maximum processing time for a message using queue.max_idle.
```
{
    "queue": {
      "max_idle": "1m"  // Configures the maximum processing time for a single message to 1 minute. If this time is exceeded, the message is delivered to another subscriber, and the delivery count is incremented. The default value is 0, which means no maximum processing time.
    }
}
```
Supported time units: h (hour), m (minute), and s (second). If processing exceeds the configured duration:
- If the queue.max_delivery threshold is not exceeded, the message is redelivered to other subscribers.
- If the queue.max_delivery threshold is exceeded, the dead-letter policy applies.

Configure the dead-letter policy using queue.dead_message_policy.

{
    "queue": {
      "dead_message_policy":  "Rear"  // The value can be Rear (default) or Drop. Rear moves the message to the end of the queue. Drop deletes the message. 																 
    }
}

Queue length or maximum payload size

Queue replica memory is fixed: increasing the maximum payload size per message reduces the maximum queue length.

Note

With default settings (4 GB memory, 8 KB max payload), each queue stores up to 230,399 messages. To store more, increase the memory. The system reserves 10% of total memory.
You cannot configure both maximum length and maximum payload size for the same queue.

Configure the maximum length of the sink and input queues using queue.sink.max_length or queue.source.max_length, respectively.

{
    "queue": {
       "sink": {
          "max_length": 8000  // Configures the maximum length of the sink queue to 8,000 messages.
       },
       "source": {
          "max_length": 2000  // Configures the maximum length of the input queue to 2,000 messages.
       }
    }
}

Configure the maximum payload size per message for the sink and input queues using queue.sink.max_payload_size_kb or queue.source.max_payload_size_kb, respectively.

{
    "queue": {
       "sink": {
          "max_payload_size_kb": 10  // Configures the maximum payload size per message for the sink queue to 10 KB. Default: 8 KB.
       },
       "source": {
          "max_payload_size_kb": 1024  // Configures the maximum payload size per message for the input queue to 1024 KB (1 MB). Default: 8 KB.
       }
    }
}

Memory allocation ratio

Adjust the memory allocation between the input and sink queues using queue.sink.memory_ratio.
```
{
    "queue": {
       "sink": {
          "memory_ratio": 0.9  // Configures the memory ratio for the sink queue. Default: 0.5.
       }
    }
}
```
Note
By default, the input and sink queues share memory equally. Increase queue.sink.memory_ratio if the sink queue needs more space (for example, text input and image output), or decrease it for the reverse.

Horizontal auto scaling

How it works

The system dynamically scales inference replicas based on queue state, including scaling to zero when the queue is empty. The following diagram illustrates the mechanism.

Procedure

In the service list, click the target service name.
Go to the Auto Scaling tab. In the Auto Scaling section, click Enable Auto Scaling.

In the Auto Scaling Settings dialog box, configure the parameters.

Basic settings:

Parameter	Description	Example
Minimum Replicas	Minimum replicas for scale-in operations. Minimum value: 0.	0
Maximum Replicas	Maximum replicas for scale-out operations. Maximum value: 1000.	10
General Scaling Metrics	Built-in performance metrics used to trigger scaling. Asynchronous Queue Length represents the average number of queued tasks per replica.	Select Asynchronous Queue Length and set the threshold to 10.

Advanced settings:

Parameter	Description	Example
Scale-out Starts in	Observation window for scale-out decisions. After scale-out triggers, the system observes metrics during this period. If metric values fall below threshold, scale-out cancels. Unit: seconds. The default value is `0` seconds, which means the scale-out is performed immediately.	0
Scale-in Starts in	Observation window for scale-in decisions—the key parameter to prevent service jitter. Scale-in occurs only after metrics remain below threshold for this entire duration. Unit: seconds. Default: `300` seconds. This protects against frequent scale-in events from traffic fluctuations. Do not set too low to maintain service stability.	300
Scale-in to 0 Instance Starts in	When Minimum Replicas is `0`, this parameter defines wait time before replica count reduces to `0`.	600
Scale-from-Zero Replica Count	Replica count to add when service scales from `0` replicas.	1

Full parameter details and eascmd usage are in Horizontal auto scaling.