This topic describes quasi-real-time inference scenarios. This topic also describes how to use on-demand GPU-accelerated instances in quasi-real-time inference scenarios and how to build a cost-effective service in a quasi-real-time inference scenario.
Introduction
Workloads in a quasi-real-time inference scenario have one or more of the following characteristics:
Sparse invocations
The number of daily invocations ranges from several to tens of thousands, and the actual daily usage time of GPU resources is far less than the normal usage time of 8 to 12 hours. As a result, a large number of GPU resources are often idle.
Long processing time
In most cases, the processing time of quasi-real-time inference ranges from seconds to minutes. For example, the processing of a typical computer vision (CV) task is completed within seconds, and the processing of a video task and an AI Generated Content (AIGC) task is completed within minutes.
Tolerable cold starts
Cold starts of GPUs can be tolerated, or the probability of cold starts is low when business traffic fluctuates.
GPU-accelerated instances of Function Compute provide the following benefits for the workloads of quasi-real-time inference:
Native serverless
On-demand GPU-accelerated instances provided by Function Compute automatically manage GPU resources. Resources can be automatically scaled based on traffic changes. During off-peak hours, resources can be scaled down to zero. During peak hours, resources can be scaled up in seconds. After you deploy your business in Function Compute, infrastructure is fully managed by Function Compute. This way, you need to manage only the iteration of your business.
Optimal specifications
Function Compute provides various specifications for GPU-accelerated instances to meet your business requirements. You can select different types of graphic cards and configure custom vCPU, vGPU, memory, and disk specifications. The minimum size of vGPU memory is 1 GB.
Optimal cost-efficiency
Function Compute supports the pay-as-you-go billing method and per-second billing. This helps reduce resource costs. For workloads that have low GPU utilization, costs can be reduced by more than 70%.
Burst traffic support
Function Compute provides abundant GPU resources. When traffic burst occurs in your business, Function Compute provides a large number of GPU computing resources in seconds. This helps prevent negative impacts on business caused by insufficient and delayed supply of GPU computing power.
How it works
By default, on-demand GPU-accelerated instances are used to process requests in quasi-real-time inference scenarios after you deploy a GPU function. Function Compute also provides provisioned GPU-accelerated instances. For more information, see Instance modes.
You can send an inference request to a trigger of a GPU function. For example, you can send an HTTP request to an HTTP trigger to trigger function execution. After a GPU function is triggered, it runs and completes model inference in a GPU container and the inference result is returned in the response. Function Compute can perform automatic orchestration and elastic scaling on GPU resources to meet your business requirements. You need to pay for only the GPU resources that are used during request processing.
Container support
GPU-accelerated instances of Function Compute can be used only in Custom Container runtimes. For more information about Custom Container, see Overview.
Specifications of GPU-accelerated instances
In inference scenarios, you can select different GPU card types and configure specifications of GPU-accelerated instances based on the compute power required by your business. The specifications of GPU-accelerated instances include the GPU memory, memory, and disk capacity. For more information about specifications of GPU-accelerated instances, see Instance specifications.
Deployment methods
You can deploy your models in Function Compute by using one of the following methods:
Use the Function Compute console. For more information, see Create a function in the Function Compute console.
Call SDKs. For more information, see List of operations by function.
Use Serverless Devs. For more information, see Serverless Devs commands.
For more deployment examples, see start-fc-gpu.
Concurrent requests
The maximum number of concurrent requests that a GPU function can process in a region is based on the concurrency of the GPU-accelerated instance and the maximum number of graphic cards that can be used.
Concurrency of a GPU-accelerated instance
By default, the concurrency of a GPU-accelerated instance is set to 1. That means a GPU-accelerated instance can process only one request at a time. You can change the concurrency of a GPU-accelerated instance by using the Function Compute console or Serverless Devs. For more information, see Configure instance concurrency. We recommend that you configure concurrency settings for a GPU-accelerated instance based on your business requirements.
Compute-intensive inference applications: We recommend that you use the default value 1.
Inference applications that support batch aggregation of requests: We recommend that you configure the concurrency settings based on the number of inference requests that can be aggregated in a batch.
Maximum number of physical GPU cards
For more information about the maximum number of GPUs, see Quotas and limits.
Cold starts
If no requests are being processed for a period of time, all on-demand GPU-accelerated instances are released by Function Compute. In this case, a clod start occurs when the first new request is sent and Function Compute requires more time to pull an instance to process the request. This includes preparing GPU resources, pulling container images, starting GPU containers, loading and initializing algorithm models, and starting inference applications. For more information, see Best practice for reducing cold start latencies.
The duration of a cold start of an AI application depends on the image size, model size, and time consumed by initialization. You can observe the time consumed by a cold start and estimate the probability of cold starts by using monitoring metrics.
Cold starts
The following figure shows the durations of end-to-end cold starts for common models on the GPU-accelerated instances of Function Compute.

The time required for an end-to-end cold start ranges from 10 to 30 seconds. The time is the total time consumed by the cold start and the processing of the first request.
Probability of cold starts
A cold start of a serverless GPU in Function Compute can be completed within seconds. A cold start of a GPU on a Kubernetes-based platform requires minutes to complete. The probability of cold starts of a Function Compute instance reduces as the concurrency of the instance increases. Less cold starts have less impact on business.
Cost evaluation
The unit prices provided in the following examples are for reference only. The actual prices provided by your business manager shall prevail.
The lower your daily GPU utilization is before you use Function Compute, the more costs you can save after you use Function Compute.
In the following examples, a GPU-accelerated instance of Elastic Compute Service (ECS) is compared with a GPU-accelerated instance of Function Compute. Both instances use Tesla T4 graphic cards. The unit price of a GPU-accelerated ECS instance with the same specification is approximately USD 2/hour. For more information about billing, see Billing for Elastic GPU Service.
Example 1
Assume that 3,600 invocations, each of which lasts for 1 second, are initiated for your GPU function per day. The function uses a GPU-accelerated instance whose memory is 4 GB and model size is 3 GB.
Your daily GPU utilization is 4.1%. The GPU utilization is calculated by using the following formula: 3,600/86,400 = 0.041. In this case, the GPU memory usage is excluded.
The daily fee that you are charged for using GPU resources in ECS is USD 48. The fee is calculated by using the following formula: 2 × 24 = 48.
Average daily GPU resource fee of Function Compute = 3,600 seconds × 4 GB × USD 0.000105/GB-second = USD 1.512
Compared with ECS, Function Compute reduces costs by 95%.
Example 2
Assume 50,000 invocations, each of which lasts for 1 second, are initiated for your GPU function per day. The function uses a GPU-accelerated instance whose memory is 4 GB and model size is 3 GB.
The daily GPU utilization is 57%. The GPU utilization is calculated by using the following formula: 50,000/86,400 = 0.57. In this case, the GPU memory usage is excluded.
The daily fee that you are charged for using GPU resources in ECS is USD 48. The fee is calculated by using the following formula: 2 × 24 = 48.
Average daily GPU resource fee of Function Compute = 50,000 seconds × 4 GB × USD 0.000105/GB-second = USD 21
Compared with ECS, Function Compute reduces costs by 55%.