This topic describes how to use provisioned GPU-accelerated instances in real-time inference scenarios and how to build a latency-sensitive real-time inference service by using provisioned GPU-accelerated instances.
Background information
Scenarios
Workloads of real-time inference scenarios feature one or more of the following characteristics:
Low latency
Workloads of real-time inference scenarios have high requirements on the time efficiency and response time of each request. The long tail latency must be within hundreds of milliseconds for 90% of requests.
Core links
In most cases, real-time inference occurs in core business links and requires a high success rate of inference. Long-term retries need to be avoided. The following items provide examples:
Launch page commercial and homepage product recommendation: User-specific commercials and products can be displayed on the lunch pages and home pages based on user preferences.
Real-time production in streaming media: In scenarios such as interactive streaming, live streaming, and ultra-low latency playback, audio and video streams must be transmitted at extremely low end-to-end latency. Performance and user experience must also be guaranteed in scenarios such as real-time AI-based video super resolution and video recognition.
Peak and off-peak hours
Business traffic has peak hours and off-peak hours. The traffic fluctuation trend changes with user habits.
Low resource utilization
In most cases, GPU resources are planned based on traffic peaks. A large amount of resources are wasted during off-peak hours and the resource utilization is generally lower than 30%.
Benefits
Function Compute provides the following benefits for real-time inference workloads.
Provisioned GPU-accelerated instances
Function Compute allows you to use GPU-accelerated instances in the on-demand mode and provisioned mode. To eliminate the impacts of cold starts and meet the low-latency response requirements for real-time inferences, you can use provisioned GPU-accelerated instances. For more information about the provisioned mode, see Configure provisioned instances and auto scaling rules.
Auto scaling policies for provisioned GPU-accelerated instances (recommended)
Function Compute allows you to configure metric-based auto scaling policies and scheduled auto scaling policies for provisioned GPU-accelerated instances. The metrics that are used in metric-based auto scaling policies include concurrency, GPU streaming multiprocessor (SM) utilization, GPU memory utilization, GPU encoder utilization, and GPU decoder utilization. You can use different auto scaling policies in different scenarios of traffic trends to meet the computing power requirements on GPU-accelerated instances and reduce deployment costs.
Service quality guaranteed at comparatively low costs
The billing cycle of provisioned GPU-accelerated instances is different from that of on-demand GPU-accelerated instances. Provisioned instances are billed based on the instance lifetime. After you allocate provisioned GPU-accelerated instances, fees are generated no matter whether requests are being processed. Therefore, the cost of provisioned GPU-accelerated instances is higher than on-demand GPU-accelerated instances. However, compared with self-built GPU clusters, the cost is reduced by more than 50%.
Optimal specifications
Function Compute allows you to select the GPU type and configure GPU specifications, such as CPU, GPU memory, memory, and disk capacity, based on your business requirements. The minimum GPU memory size can be configured in increments of 1 GB. This allows you to configure the optimal instance specifications based on your business requirements.
Burst traffic support
Function Compute provides abundant GPU resources. When traffic burst occurs in your business, Function Compute provides a large number of GPU computing resources in seconds. This helps prevent negative impacts on business caused by insufficient and delayed supply of GPU computing power.
How it works
After you deploy a GPU function, you can configure an auto scaling policy to allocate provisioned GPU-accelerated instances. The instances provide the infrastructure that is required for real-time inference scenarios. Function Compute performs Horizontal Pod Autoscaling (HPA) for provisioned GPU-accelerated instances based on the metrics that you configured. Requests are preferentially sent to the provisioned GPU-accelerated instances for inference. Function Compute eliminates cold starts and allows services to run at low latencies.
Basic information about real-time inference scenarios
Container support
GPU-accelerated instances of Function Compute can be used only in Custom Container runtimes. For more information about Custom Container runtimes, see Overview.
Specifications for GPU-accelerated instances
In inference scenarios, you can select different GPU card types and configure specifications of GPU-accelerated instances based on the computing power required by your business. The specifications of GPU-accelerated instances include the GPU memory, memory, and disk capacity. For more information about specifications of GPU-accelerated instances, see Instance specifications.
Model deployment methods
You can deploy your models in Function Compute by using one of the following methods:
Use the Function Compute console. For more information, see Create a function in the Function Compute console.
Call SDKs. For more information, see List of operations by function.
Use Serverless Devs. For more information, see Serverless Devs commands.
For more deployment examples, see start-fc-gpu.
Auto scaling for provisioned instances

Scheduled auto scaling policy
You can configure a scheduled auto scaling policy in Function Compute. For more information, see Scheduled Setting Modification. If traffic changes at regular intervals in a real-time inference scenario, you can configure a scheduled auto scaling policy to allocate and release provisioned GPU-accelerated instances at specified points in time. This way, provisioned GPU-accelerated instances can be allocated several minutes before traffic spikes and released when traffic falls. This allows you to ensure the optimal performance at a low cost.
Metric-based auto scaling policies
The following table describes the metrics that can be tracked for GPU functions in Function Compute. To configure dynamic scaling policies, you can select metrics based on your business requirements.
In real-time inference scenarios, we recommend that you use the ProvisionedConcurrencyUtilization metric as the HPA metric. This is because the concurrency and QPS metrics are business-oriented metrics and other GPU resource utilization metrics are resource-oriented metrics. The changes in business metrics affect resource metrics. If you use a business-oriented metric, the scaling of provisioned GPU-accelerated instances can be triggered in a more efficient manner. This ensures the quality of service.
Metric | Description | Valid values |
ProvisionedConcurrencyUtilization | Concurrency utilization of provisioned instances. This metric collects the ratio of in-use instance concurrency to the allocated provisioned concurrency of the function. | [0, 1], which corresponds to the utilization rate from 0% to 100%. |
GPUSmUtilization | GPU SM utilization. This metric collects statistics on the maximum GPU SM utilization of multiple instances. | |
GPUMemoryUtilization | GPU memory utilization. This metric collects the maximum GPU memory utilization of multiple instances. | |
GPUDecoderUtilization | GPU hardware decoder utilization. This metric collects the maximum utilization of GPU hardware decoders of multiple instances. | |
GPUEncoderUtilization | GPU hardware encoder utilization. This metric collects the maximum utilization of GPU hardware encoders of multiple instances. |
Model warmup
To resolve the issue that initial requests take a long time after a model is released, Function Compute provides the model warmup feature. The model warmup feature enables a model to enter the working state immediately after it is launched.
We recommend that you configure the initialize lifecycle hook in Function Compute to warm up models. Function Compute automatically executes the business logic in initialize to warm up models. For more information, see Lifecycle hooks for function instances.
You can perform the following operations to warm up a model.
Add the model warmup logic to the
initializelifecycle hook of the instance.Add the
/initializeinvocation path of the POST method to the HTTP server that you build, and place the model warmup logic under the/initializepath. You can have the model perform simple inferences to achieve the warmup effect. The following sample code provides an example in Python:def prewarm_inference(): res = model.inference() @app.route('/initialize', methods=['POST']) def initialize(): request_id = request.headers.get("x-fc-request-id", "") print("FC Initialize Start RequestId: " + request_id) # Prewarm model and perform naive inference task. prewarm_inference() print("FC Initialize End RequestId: " + request_id) return "Function is initialized, request_id: " + request_id + "\n"On the function configuration page, configure the instance lifecycle hook.
On the Configurations tab of the Function Details page, click Modify in the Instance Lifecycle Hook section. In the instance lifecycle hook panel, configure the Initializer hook.

Configure and verify an auto scaling policy
This topic describes two methods to configure auto scaling policies for GPU-accelerated instances:
Use Serverless Devs to configure auto scaling policies for GPU-accelerated instances
Configure auto scaling policies for GPU-accelerated instances in the Function Compute console
After you configure an auto scaling policy, you can perform a stress test to view the effect of the auto scaling policy. For more information, see Perform a stress test.
Use Serverless Devs to configure an auto scaling policy for GPU-accelerated instances
Before you start
Perform the following operations in the region where the GPU-accelerated instances reside:
Create a Container Registry Enterprise Edition instance or Personal Edition instance. We recommend that you create an Enterprise Edition instance. For more information, see Step 1: Create a Container Registry Enterprise Edition instance.
Create a namespace and an image repository. For more information, see Step 2: Create a namespace and Step 3: Create an image repository.
Procedure
Run the following command to clone the project:
git clone https://github.com/devsapp/start-fc-gpu.gitDeploy the project.
Run the following command to go to the project directory:
cd fc-http-gpu-inference-paddlehub-nlp-porn-detection-lstm/src/The following code snippet shows the structure of the project.
. ├── hook │ └── index.js └── src ├── code │ ├── Dockerfile │ ├── app.py │ ├── hub_home │ │ ├── conf │ │ ├── modules │ │ └── tmp │ └── test │ └── client.py └── s.yamlRun the following command to use Docker to build an image and push the image to your image repository:
export IMAGE_NAME="registry.cn-shanghai.aliyuncs.com/fc-gpu-demo/paddle-porn-detection:v1" # sudo docker build -f ./code/Dockerfile -t $IMAGE_NAME . # sudo docker push $IMAGE_NAMEImportantThe PaddlePaddle framework is large and it takes a long period of time (about 1 hour) to build an image for the first time. Therefore, we provide a public image that resides in a virtual private cloud (VPC) for you to use. If you use the public image, you do not need to execute the preceding docker build and docker push commands.
Edit the s.yaml file.
edition: 1.0.0 name: container-demo access: {access} vars: region: cn-shanghai services: gpu-best-practive: component: devsapp/fc props: region: ${vars.region} service: name: gpu-best-practive-service internetAccess: true logConfig: enableRequestMetrics: true enableInstanceMetrics: true logBeginRule: DefaultRegex project: log-ca041e7c29f2a47eb8aec48f94b**** # Use the name of the Log Service project that you created. logstore: config***** # Use the name of the Logstore that you created. role: acs:ram::143199913651****:role/aliyunfcdefaultrole function: name: gpu-porn-detection description: This is the demo function deployment handler: not-used timeout: 1200 caPort: 9000 memorySize: 8192 # Set the memory size to 8 GB. cpu: 2 gpuMemorySize: 8192 # Set the GPU memory to 8 GB. diskSize: 512 instanceType: fc.gpu.tesla.1 # Deploy GPU-accelerated instances that use Tesla GPUs. instanceConcurrency: 1 runtime: custom-container environmentVariables: FCGPU_RUNTIME_SHMSIZE : '8589934592' customContainerConfig: image: registry.cn-shanghai.aliyuncs.com/serverless_devs/gpu-console-supervising:paddle-porn-detection # The public image is used as an example. Use the actual name of your image. accelerationType: Default triggers: - name: httpTrigger type: http config: authType: anonymous methods: - GET - POSTRun the following command to deploy the function:
sudo s deploy --skip-push true -t s.yamlWhen the execution is complete, a URL is returned in the output. You can use the URL to test the function.
Test the function and log on to the Function Compute console to view the monitoring results.
Run the curl command to test the function. In the command, use the URL obtained in the previous step.
curl https://gpu-poretection-gpu-bes-service-gexsgx****.cn-shanghai.fcapp.run/invoke -H "Content-Type: text/plain" --data "Nice to meet you"If the following output is returned, the test is passed.
[{"text": "Nice to meet you", "porn_detection_label": 0, "porn_detection_key": "not_porn", "porn_probs": 0.0, "not_porn_probs": 1.0}]%In the Function Compute console, choose . Click the services and functions that are deployed in Step 2. Then, click the Metrics tab to view the changes of GPU-related metrics.

Configure an auto scaling policy for provisioned instances.
Create the provision.json template.
The following sample code shows an example. This template uses the instance concurrency as the tracking metric. The minimum number of instances is 2 and the maximum number of instances is 30.
{ "target": 2, "targetTrackingPolicies": [ { "name": "scaling-policy-demo", "startTime": "2023-01-01T16:00:00.000Z", "endTime": "2024-01-01T16:00:00.000Z", "metricType": "ProvisionedConcurrencyUtilization", "metricTarget": 0.3, "minCapacity": 2, "maxCapacity": 30 } ] }Run the following command to deploy the scaling policy:
sudo s provision put --config ./provision.json --qualifier LATEST -t s.yaml -a {access}Run the
sudo s provision listcommand for verification. You can see the following output. The values oftargetandcurrentare equal, indicating that the provisioned instances are correctly pulled up and the auto scaling policy is deployed as expected.[2023-05-10 14:49:03] [INFO] [FC] - Getting list provision: gpu-best-practive-service gpu-best-practive: - serviceName: gpu-best-practive-service qualifier: LATEST functionName: gpu-porn-detection resource: 143199913651****#gpu-best-practive-service#LATEST#gpu-porn-detection target: 2 current: 2 scheduledActions: null targetTrackingPolicies: - name: scaling-policy-demo startTime: 2023-01-01T16:00:00.000Z endTime: 2024-01-01T16:00:00.000Z metricType: ProvisionedConcurrencyUtilization metricTarget: 0.3 minCapacity: 2 maxCapacity: 30 currentError: alwaysAllocateCPU: trueAfter the provisioned instances are allocated, your model is successfully deployed and ready for service.
Release provisioned instances for a function.
Run the following command to disable an auto scaling policy and set the number of provisioned instances to 0:
sudo s provision put --target 0 --qualifier LATEST -t s.yaml -a {access}Run the following command to check whether the auto scaling policy is disabled:
s provision list -a {access}If the following output is returned, the auto scaling policy is disabled:
[2023-05-10 14:54:46] [INFO] [FC] - Getting list provision: gpu-best-practive-service End of method: provision
Configure an auto scaling policy for GPU-accelerated instances in the Function Compute console
Prerequisites
A service and GPU function are created in Function Compute. For more information, see Create a service and Create a Custom Container function.
Procedure
Log on to the Function Compute console. In the left-side navigation pane, click Services & Functions.
Enable instance-level metrics for the service. For more information, see Enable collection of instance-level metrics.
After you enable instance-level metrics, you can view the GPU-related resources that are consumed by function invocations on the function monitoring page in the Function Compute console.
Click the function that you want to manage. On the page that appears, click the Trigger Management (URL) tab to obtain the URL of the HTTP trigger for subsequent function tests.

Test the function and log on to the Function Compute console to view the monitoring results.
Run the curl command to test the function. In the command, use the URL obtained in the previous step.
curl https://gpu-poretection-gpu-bes-service-gexsgx****.cn-shanghai.fcapp.run/invoke -H "Content-Type: text/plain" --data "Nice to meet you"If the following output is returned, the test is passed.
[{"text": "Nice to meet you", "porn_detection_label": 0, "porn_detection_key": "not_porn", "porn_probs": 0.0, "not_porn_probs": 1.0}]%In the Function Compute console, choose . Click the services and functions that are deployed in Step 2. Then, click the Metrics tab to view the changes of GPU-related metrics.

On the function details page, click the Auto Scaling tab and click Create Rule.
On the page for creating an auto scaling rule, configure the following parameters based on your business requirements and click Create.
Specify the version and minimum number of instances, and retain the default values for other parameters.

In the Metric-based Setting Modification section, click + Add Configuration and configure the policy.
The following figure provides an example.
After the configuration is complete, you can choose to view the change of the Function Provisioned Instances (count) metric.
If you no longer require provisioned GPU-accelerated instances, delete the provisioned GPU-accelerated instances at your earliest opportunity.
Perform a stress test
You can use a common stress test tool, such as Apache Bench, to perform stress tests on HTTP functions.
After a stress test is complete, log on to the Function Compute console and click the function that you want to manage. On the function details page, choose to view the test results. The metric details show that provisioned instances of the function are automatically scaled out during the stress test and scaled in after the stress test. The following figure shows an example.

FAQ
How much does it cost to use a real-time inference service in Function Compute?
For information about the billing of Function Compute, see Billing overview. The billing method of provisioned instances is different from that of on-demand instances. Take note of your bill details.
Why do latencies continue to occur after I configure an auto scaling policy?
You can configure a more aggressive auto scaling policy to allocate instances in advance of traffic spikes to prevent latencies caused by a burst of requests.
Why is the number of instances not increased when the tracking metric reaches the threshold?
The metrics of Function Compute are collected at the minute level. The scale-out mechanism is triggered only after the metric value reaches the threshold for a period of time.