This topic describes quasi-real-time inference scenarios, how to use on-demand GPU-accelerated instances in quasi-real-time inference scenarios, and how to build a cost-effective service in a quasi-real-time inference scenario.
Introduction
Workloads in a quasi-real-time inference scenario have one or more of the following characteristics:
Sparse invocations
The number of daily invocations ranges from several to tens of thousands, and the actual daily usage time of GPU resources is far less than the normal usage time of 8 to 12 hours. As a result, a large number of GPU resources are often idle.
Long processing time
In most cases, the processing time of quasi-real-time inference ranges from seconds to minutes. For example, the processing of a typical computer vision (CV) task is completed within seconds, and the processing of a video task and an AI Generated Content (AIGC) task is completed within minutes.
Tolerable cold starts
Cold starts of GPUs are tolerable, or the probability of cold starts is low when business traffic fluctuates.
GPU-accelerated instances of Function Compute provide the following benefits for the workloads of quasi-real-time inference:
Native serverless
On-demand GPU-accelerated instances provided by Function Compute automatically manage GPU resources. Resources can be automatically scaled based on traffic changes. During off-peak hours, resources can be scaled down to zero. During peak hours, resources can be scaled up in seconds. After you deploy your business in Function Compute, infrastructure is fully managed by Function Compute. This way, you need to manage only the iteration of your business.
Optimal specifications
Function Compute provides various specifications for GPU-accelerated instances to meet your business requirements. You can select different types of graphic cards and configure custom vCPU, vGPU, memory, and disk specifications. The minimum size of vGPU memory is 1 GB.
Optimal cost-efficiency
Function Compute supports the pay-as-you-go billing method and per-second billing. This helps reduce resource costs. For workloads that have low GPU utilization, costs can be reduced by more than 70%.
Support for traffic spikes
Function Compute provides abundant GPU resources. When traffic burst occurs in your business, Function Compute provides a large number of GPU computing resources in seconds. This helps prevent negative impacts on business caused by insufficient and delayed supply of GPU computing power.
Workflow
By default, Function Compute allocates on-demand GPU-accelerated instances to provide the infrastructure for quasi-real-time inference scenarios after a GPU function is deployed. For information about the differences between on-demand instances and provisioned instances, see Instance types and usage modes.
You can send an inference request to a trigger of a GPU function. For example, you can send an HTTP request to an HTTP trigger to trigger function execution. After the function is triggered, it runs and performs model inference in a GPU container and returns the inference result in the response. Function Compute can automatically orchestrate and scale GPU resources to meet your business requirements. You only need to pay for the GPU resources used during request processing.
Container support
GPU-accelerated instances of Function Compute can be used only in Custom Container runtimes. For more information about Custom Container runtimes, see Introduction to Custom Container.
Custom Container functions require a web server carried within an image to execute different code paths and trigger functions through events or HTTP requests. The web server mode is suitable for multi-path request execution scenarios such as AI learning and inference.
Specifications for GPU-accelerated instances
In inference scenarios, you can select different GPU types and configure specifications of GPU-accelerated instances based on the computing power required by your business. The specifications of GPU-accelerated instances include CPU, GPU memory, memory, and disk capacity. For more information about specifications of GPU-accelerated instances, see Instance specifications
Deployment methods
You can deploy your models in Function Compute by using one of the following methods:
Use the Function Compute console. For more information, see Create a function in the Function Compute console.
Call SDKs. For more information, see List of operations by function.
Use Serverless Devs. For more information, see Common commands of Serverless Devs.
For more deployment examples, see start-fc-gpu.
Concurrent requests
The maximum number of concurrent requests that a GPU function can process in a region is based on the concurrency of the GPU-accelerated instance and the maximum number of graphic cards that can be used.
Concurrency of a GPU-accelerated instance
By default, the concurrency of a GPU-accelerated instance is set to 1, which means the instance can process only one request at a time. You can change the concurrency of a GPU-accelerated instance as needed by using the Function Compute console or Serverless Devs. For more information, see Configure instance concurrency. The following items describe the recommended concurrency settings for various scenarios:
Compute-intensive inference applications: We recommend that you use the default value 1.
Inference applications that support batch aggregation of requests: We recommend that you configure the concurrency settings based on the number of inference requests that can be aggregated in a batch.
Maximum number of graphic cards that can be used
By default, the maximum number of GPUs that can be allocated to a region within an Alibaba Cloud account is 30. You can view the actual quota in the Quota Center console. If the current quota cannot meet your business requirements, you can apply for a quota adjustment in the Quota Center console.
Relationship between GPU instance specifications and instance concurrency
An Ada.1 GPU has 48 GB of memory, and a Tesla series GPU has 16 GB of memory. Function Compute allocates the full memory of a GPU card to a single GPU container. Because the default GPU card quota is a maximum of 30 per region, a maximum of 30 GPU containers can run simultaneously in that region.
If the instance concurrency of a GPU function is 1, the function can process up to 30 inference requests concurrently in a region.
If the instance concurrency of a GPU function is 5, the function can process up to 150 inference requests concurrently in a region.
Cold starts
If no requests are being processed for a certain period, Function Compute automatically releases all on-demand GPU-accelerated instances. In this case, a cold start occurs when the first new request arrives, because Function Compute requires additional time to set up a new instance to handle the request. The process involves preparing GPU resources, pulling container images, starting GPU containers, loading and initializing algorithm models, and launching inference applications. Each of these steps contributes to the overall cold start duration. For more information, see Best practice for reducing cold start latencies.
The duration of a cold start of an AI application depends on the image size, model size, and time consumed by initialization. You can observe the time consumed by a cold start and estimate the probability of cold starts by using monitoring metrics.
Durations of cold starts
The following figure shows the durations of end-to-end cold starts for common models on the GPU-accelerated instances of Function Compute.

The time required for an end-to-end cold start ranges from 10 to 30 seconds. The time is the total time consumed by the cold start and the processing of the first request.
Probability of cold starts
A cold start of a serverless GPU in Function Compute can be completed within seconds. A cold start of a GPU on a Kubernetes-based platform requires minutes to complete. The probability of cold starts of a Function Compute instance reduces as the concurrency of the instance increases. Less cold starts have less impact on business.
Cost evaluation
The unit prices provided in the following examples are for reference only. The actual prices provided by your business manager shall prevail.
The lower your daily GPU utilization is before you use Function Compute, the more costs you can save after you use Function Compute.
In the following examples, a GPU-accelerated instance of Elastic Compute Service (ECS) is compared with a GPU-accelerated instance of Function Compute. Both instances use Tesla T4 GPUs. The unit price of a GPU-accelerated ECS instance that has the same GPU specifications as a GPU-accelerated Function Compute instance is approximately USD 2/hour. For more information about billing, see Billing for Elastic GPU Service.
Example 1
Assume that 3,600 invocations, each of which lasts for 1 second, are initiated for your GPU function per day. The function uses a GPU-accelerated instance whose memory is 4 GB and model size is 3 GB.
Your daily GPU utilization is 4.1%. The GPU utilization is calculated by using the following formula: 3,600/86,400 = 0.041. In this case, the GPU memory usage is excluded.
The daily fee that you are charged for using GPU resources in ECS is USD 48. The fee is calculated by using the following formula: 2 × 24 = 48.
The daily fee that you are charged for using GPU resources in Function Compute is USD 0.259. The fee is calculated by using the following formula: 3,600 × 4 × 0.000018 = 0.259.
GPU-accelerated instances of Function Compute reduce the cost by more than 99%.
Example 2
Assume 50,000 invocations, each of which lasts for 1 second, are initiated for your GPU function per day. The function uses a GPU-accelerated instance whose memory is 4 GB and model size is 3 GB.
Your daily GPU utilization is 57%. The GPU utilization is calculated by using the following formula: 50,000/86,400 = 0.57. In this case, the GPU memory usage is excluded.
The daily fee that you are charged for using GPU resources in ECS is USD 48. The fee is calculated by using the following formula: 2 × 24 = 48.
The daily fee that you are charged for using GPU resources in Function Compute is USD 3.6. The fee is calculated by using the following formula: 50,000 x 4 x 0.000018 = 3.6.
GPU-accelerated instances of Function Compute reduce the cost by more than 90%.