All Products
Search
Document Center

Function Compute:Real-time inference scenarios

Last Updated:Dec 15, 2025

This topic describes how to use provisioned GPU-accelerated instances in real-time inference scenarios and how to build a latency-sensitive real-time inference service by using provisioned GPU-accelerated instances.

Background information

Scenarios

Workloads of real-time inference scenarios feature one or more of the following characteristics:

  • Low latency

    Workloads of real-time inference scenarios have high requirements on the time efficiency and response time of each request. The long tail latency must be within hundreds of milliseconds for 90% of requests.

  • Core links

    In most cases, real-time inference occurs in core business links and requires a high success rate of inference. Long-term retries need to be avoided. The following items provide examples:

    • Launch page commercial and homepage product recommendation: User-specific commercials and products can be displayed on the lunch pages and home pages based on user preferences.

    • Real-time production in streaming media: In scenarios such as interactive streaming, live streaming, and ultra-low latency playback, audio and video streams must be transmitted at extremely low end-to-end latency. Performance and user experience must also be guaranteed in scenarios such as real-time AI-based video super resolution and video recognition.

  • Peak and off-peak hours

    Business traffic has peak hours and off-peak hours. The traffic fluctuation trend changes with user habits.

  • Low resource utilization

    In most cases, GPU resources are planned based on traffic peaks. A large amount of resources are wasted during off-peak hours and the resource utilization is generally lower than 30%.

Benefits

Function Compute provides the following benefits for real-time inference workloads.

  • Provisioned GPU-accelerated instances

    Function Compute allows you to use GPU-accelerated instances in the on-demand mode and provisioned mode. To eliminate the impacts of cold starts and meet the low-latency response requirements for real-time inferences, you can use provisioned GPU-accelerated instances. For more information about the provisioned mode, see Configure provisioned instances and auto scaling rules.

  • Auto scaling policies for provisioned GPU-accelerated instances (recommended)

    Function Compute allows you to configure metric-based auto scaling policies and scheduled auto scaling policies for provisioned GPU-accelerated instances. The metrics that are used in metric-based auto scaling policies include concurrency, GPU streaming multiprocessor (SM) utilization, GPU memory utilization, GPU encoder utilization, and GPU decoder utilization. You can use different auto scaling policies in different scenarios of traffic trends to meet the computing power requirements on GPU-accelerated instances and reduce deployment costs.

  • Service quality guaranteed at comparatively low costs

    The billing cycle of provisioned GPU-accelerated instances is different from that of on-demand GPU-accelerated instances. Provisioned instances are billed based on the instance lifetime. After you allocate provisioned GPU-accelerated instances, fees are generated no matter whether requests are being processed. Therefore, the cost of provisioned GPU-accelerated instances is higher than on-demand GPU-accelerated instances. However, compared with self-built GPU clusters, the cost is reduced by more than 50%.

  • Optimal specifications

    Function Compute allows you to select the GPU type and configure GPU specifications, such as CPU, GPU memory, memory, and disk capacity, based on your business requirements. The minimum GPU memory size can be configured in increments of 1 GB. This allows you to configure the optimal instance specifications based on your business requirements.

  • Burst traffic support

    Function Compute provides abundant GPU resources. When traffic burst occurs in your business, Function Compute provides a large number of GPU computing resources in seconds. This helps prevent negative impacts on business caused by insufficient and delayed supply of GPU computing power.

How it works

After you deploy a GPU function, you can configure an auto scaling policy to allocate provisioned GPU-accelerated instances. The instances provide the infrastructure that is required for real-time inference scenarios. Function Compute performs Horizontal Pod Autoscaling (HPA) for provisioned GPU-accelerated instances based on the metrics that you configured. Requests are preferentially sent to the provisioned GPU-accelerated instances for inference. Function Compute eliminates cold starts and allows services to run at low latencies.

image

Basic information about real-time inference scenarios

Container support

GPU-accelerated instances of Function Compute can be used only in Custom Container runtimes. For more information about Custom Container runtimes, see Overview.

Specifications for GPU-accelerated instances

In inference scenarios, you can select different GPU card types and configure specifications of GPU-accelerated instances based on the computing power required by your business. The specifications of GPU-accelerated instances include the GPU memory, memory, and disk capacity. For more information about specifications of GPU-accelerated instances, see Instance specifications.

Model deployment methods

You can deploy your models in Function Compute by using one of the following methods:

For more deployment examples, see start-fc-gpu.

Auto scaling for provisioned instances

reserved-instance-elastic-scaling

Scheduled auto scaling policy

You can configure a scheduled auto scaling policy in Function Compute. For more information, see Scheduled Setting Modification. If traffic changes at regular intervals in a real-time inference scenario, you can configure a scheduled auto scaling policy to allocate and release provisioned GPU-accelerated instances at specified points in time. This way, provisioned GPU-accelerated instances can be allocated several minutes before traffic spikes and released when traffic falls. This allows you to ensure the optimal performance at a low cost.

Metric-based auto scaling policies

The following table describes the metrics that can be tracked for GPU functions in Function Compute. To configure dynamic scaling policies, you can select metrics based on your business requirements.

In real-time inference scenarios, we recommend that you use the ProvisionedConcurrencyUtilization metric as the HPA metric. This is because the concurrency and QPS metrics are business-oriented metrics and other GPU resource utilization metrics are resource-oriented metrics. The changes in business metrics affect resource metrics. If you use a business-oriented metric, the scaling of provisioned GPU-accelerated instances can be triggered in a more efficient manner. This ensures the quality of service.

Metric

Description

Valid values

ProvisionedConcurrencyUtilization

Concurrency utilization of provisioned instances. This metric collects the ratio of in-use instance concurrency to the allocated provisioned concurrency of the function.

[0, 1], which corresponds to the utilization rate from 0% to 100%.

GPUSmUtilization

GPU SM utilization. This metric collects statistics on the maximum GPU SM utilization of multiple instances.

GPUMemoryUtilization

GPU memory utilization. This metric collects the maximum GPU memory utilization of multiple instances.

GPUDecoderUtilization

GPU hardware decoder utilization. This metric collects the maximum utilization of GPU hardware decoders of multiple instances.

GPUEncoderUtilization

GPU hardware encoder utilization. This metric collects the maximum utilization of GPU hardware encoders of multiple instances.

Model warmup

To resolve the issue that initial requests take a long time after a model is released, Function Compute provides the model warmup feature. The model warmup feature enables a model to enter the working state immediately after it is launched.

We recommend that you configure the initialize lifecycle hook in Function Compute to warm up models. Function Compute automatically executes the business logic in initialize to warm up models. For more information, see Lifecycle hooks for function instances.

You can perform the following operations to warm up a model.

  1. Add the model warmup logic to the initialize lifecycle hook of the instance.

    Add the /initialize invocation path of the POST method to the HTTP server that you build, and place the model warmup logic under the /initialize path. You can have the model perform simple inferences to achieve the warmup effect. The following sample code provides an example in Python:

    def prewarm_inference():
        res = model.inference()
    
    @app.route('/initialize', methods=['POST'])
    def initialize():
        request_id = request.headers.get("x-fc-request-id", "")
        print("FC Initialize Start RequestId: " + request_id)
    
        # Prewarm model and perform naive inference task.    
        prewarm_inference()
        
        print("FC Initialize End RequestId: " + request_id)
        return "Function is initialized, request_id: " + request_id + "\n"
  2. On the function configuration page, configure the instance lifecycle hook.

    On the Configurations tab of the Function Details page, click Modify in the Instance Lifecycle Hook section. In the instance lifecycle hook panel, configure the Initializer hook.

    image.png

Configure and verify an auto scaling policy

This topic describes two methods to configure auto scaling policies for GPU-accelerated instances:

After you configure an auto scaling policy, you can perform a stress test to view the effect of the auto scaling policy. For more information, see Perform a stress test.

Use Serverless Devs to configure an auto scaling policy for GPU-accelerated instances

Before you start

Procedure

  1. Run the following command to clone the project:

    git clone https://github.com/devsapp/start-fc-gpu.git
  2. Deploy the project.

    1. Run the following command to go to the project directory:

      cd fc-http-gpu-inference-paddlehub-nlp-porn-detection-lstm/src/

      The following code snippet shows the structure of the project.

      .
      ├── hook
      │   └── index.js
      └── src
          ├── code
          │   ├── Dockerfile
          │   ├── app.py
          │   ├── hub_home
          │   │   ├── conf
          │   │   ├── modules
          │   │   └── tmp
          │   └── test
          │       └── client.py
          └── s.yaml
    2. Run the following command to use Docker to build an image and push the image to your image repository:

      export IMAGE_NAME="registry.cn-shanghai.aliyuncs.com/fc-gpu-demo/paddle-porn-detection:v1"
      # sudo docker build -f ./code/Dockerfile -t $IMAGE_NAME .
      # sudo docker push $IMAGE_NAME
      Important

      The PaddlePaddle framework is large and it takes a long period of time (about 1 hour) to build an image for the first time. Therefore, we provide a public image that resides in a virtual private cloud (VPC) for you to use. If you use the public image, you do not need to execute the preceding docker build and docker push commands.

    3. Edit the s.yaml file.

      edition: 1.0.0
      name: container-demo
      access: {access}
      vars:
        region: cn-shanghai
      services:
        gpu-best-practive:
          component: devsapp/fc
          props:
            region: ${vars.region}
            service:
              name: gpu-best-practive-service
              internetAccess: true
              logConfig:
                enableRequestMetrics: true
                enableInstanceMetrics: true
                logBeginRule: DefaultRegex
                project: log-ca041e7c29f2a47eb8aec48f94b****   # Use the name of the Log Service project that you created.
                logstore: config*****  # Use the name of the Logstore that you created.
              role: acs:ram::143199913651****:role/aliyunfcdefaultrole 
            function:
              name: gpu-porn-detection
              description: This is the demo function deployment
              handler: not-used
              timeout: 1200
              caPort: 9000
              memorySize: 8192    # Set the memory size to 8 GB.
              cpu: 2
              gpuMemorySize: 8192     # Set the GPU memory to 8 GB.
              diskSize: 512
              instanceType: fc.gpu.tesla.1    # Deploy GPU-accelerated instances that use Tesla GPUs.
              instanceConcurrency: 1
              runtime: custom-container
              environmentVariables:
                FCGPU_RUNTIME_SHMSIZE : '8589934592'
              customContainerConfig:
                image: registry.cn-shanghai.aliyuncs.com/serverless_devs/gpu-console-supervising:paddle-porn-detection  # The public image is used as an example. Use the actual name of your image.
                accelerationType: Default
            triggers:
              - name: httpTrigger
                type: http
                config:
                  authType: anonymous
                  methods:
                    - GET
                    - POST
    4. Run the following command to deploy the function:

      sudo s deploy --skip-push true -t s.yaml

      When the execution is complete, a URL is returned in the output. You can use the URL to test the function.

  3. Test the function and log on to the Function Compute console to view the monitoring results.

    1. Run the curl command to test the function. In the command, use the URL obtained in the previous step.

      curl https://gpu-poretection-gpu-bes-service-gexsgx****.cn-shanghai.fcapp.run/invoke -H "Content-Type: text/plain" --data "Nice to meet you"

      If the following output is returned, the test is passed.

      [{"text": "Nice to meet you", "porn_detection_label": 0, "porn_detection_key": "not_porn", "porn_probs": 0.0, "not_porn_probs": 1.0}]%
    2. In the Function Compute console, choose Advanced Features > Monitoring. Click the services and functions that are deployed in Step 2. Then, click the Metrics tab to view the changes of GPU-related metrics.

      gpu-index-changes

  4. Configure an auto scaling policy for provisioned instances.

    1. Create the provision.json template.

      The following sample code shows an example. This template uses the instance concurrency as the tracking metric. The minimum number of instances is 2 and the maximum number of instances is 30.

      {
        "target": 2,
        "targetTrackingPolicies": [
          {
            "name": "scaling-policy-demo",
            "startTime": "2023-01-01T16:00:00.000Z",
            "endTime": "2024-01-01T16:00:00.000Z",
            "metricType": "ProvisionedConcurrencyUtilization",
            "metricTarget": 0.3,
            "minCapacity": 2,
            "maxCapacity": 30
          }
        ]
      }
    2. Run the following command to deploy the scaling policy:

      sudo s provision put --config ./provision.json --qualifier LATEST -t s.yaml -a {access}
    3. Run the sudo s provision list command for verification. You can see the following output. The values of target and current are equal, indicating that the provisioned instances are correctly pulled up and the auto scaling policy is deployed as expected.

      [2023-05-10 14:49:03] [INFO] [FC] - Getting list provision: gpu-best-practive-service
      gpu-best-practive:
        -
          serviceName:            gpu-best-practive-service
          qualifier:              LATEST
          functionName:           gpu-porn-detection
          resource:               143199913651****#gpu-best-practive-service#LATEST#gpu-porn-detection
          target:                 2
          current:                2
          scheduledActions:       null
          targetTrackingPolicies:
            -
              name:         scaling-policy-demo
              startTime:    2023-01-01T16:00:00.000Z
              endTime:      2024-01-01T16:00:00.000Z
              metricType:   ProvisionedConcurrencyUtilization
              metricTarget: 0.3
              minCapacity:  2
              maxCapacity:  30
          currentError:
          alwaysAllocateCPU:      true

      After the provisioned instances are allocated, your model is successfully deployed and ready for service.

  5. Release provisioned instances for a function.

    1. Run the following command to disable an auto scaling policy and set the number of provisioned instances to 0:

      sudo s provision put --target 0 --qualifier LATEST -t s.yaml -a {access}
    2. Run the following command to check whether the auto scaling policy is disabled:

      s provision list -a {access}

      If the following output is returned, the auto scaling policy is disabled:

      [2023-05-10 14:54:46] [INFO] [FC] - Getting list provision: gpu-best-practive-service
      End of method: provision

Configure an auto scaling policy for GPU-accelerated instances in the Function Compute console

Prerequisites

A service and GPU function are created in Function Compute. For more information, see Create a service and Create a Custom Container function.

Procedure

  1. Log on to the Function Compute console. In the left-side navigation pane, click Services & Functions.

  2. Enable instance-level metrics for the service. For more information, see Enable collection of instance-level metrics.

    After you enable instance-level metrics, you can view the GPU-related resources that are consumed by function invocations on the function monitoring page in the Function Compute console.

  3. Click the function that you want to manage. On the page that appears, click the Trigger Management (URL) tab to obtain the URL of the HTTP trigger for subsequent function tests.

    get-httptrigger-url

  4. Test the function and log on to the Function Compute console to view the monitoring results.

    1. Run the curl command to test the function. In the command, use the URL obtained in the previous step.

      curl https://gpu-poretection-gpu-bes-service-gexsgx****.cn-shanghai.fcapp.run/invoke -H "Content-Type: text/plain" --data "Nice to meet you"

      If the following output is returned, the test is passed.

      [{"text": "Nice to meet you", "porn_detection_label": 0, "porn_detection_key": "not_porn", "porn_probs": 0.0, "not_porn_probs": 1.0}]%
    2. In the Function Compute console, choose Advanced Features > Monitoring. Click the services and functions that are deployed in Step 2. Then, click the Metrics tab to view the changes of GPU-related metrics.

      gpu-index-changes

  5. On the function details page, click the Auto Scaling tab and click Create Rule.

  6. On the page for creating an auto scaling rule, configure the following parameters based on your business requirements and click Create.

    1. Specify the version and minimum number of instances, and retain the default values for other parameters.

      gpu-Scaling-Rule-1

    2. In the Metric-based Setting Modification section, click + Add Configuration and configure the policy.

      The following figure provides an example.gpu-Scaling-Rule-2

    After the configuration is complete, you can choose Metrics > Function Metrics to view the change of the Function Provisioned Instances (count) metric.

Important

If you no longer require provisioned GPU-accelerated instances, delete the provisioned GPU-accelerated instances at your earliest opportunity.

Perform a stress test

You can use a common stress test tool, such as Apache Bench, to perform stress tests on HTTP functions.

After a stress test is complete, log on to the Function Compute console and click the function that you want to manage. On the function details page, choose Metrics > Function Metrics to view the test results. The metric details show that provisioned instances of the function are automatically scaled out during the stress test and scaled in after the stress test. The following figure shows an example.

pressure-test

FAQ

How much does it cost to use a real-time inference service in Function Compute?

For information about the billing of Function Compute, see Billing overview. The billing method of provisioned instances is different from that of on-demand instances. Take note of your bill details.

Why do latencies continue to occur after I configure an auto scaling policy?

You can configure a more aggressive auto scaling policy to allocate instances in advance of traffic spikes to prevent latencies caused by a burst of requests.

Why is the number of instances not increased when the tracking metric reaches the threshold?

The metrics of Function Compute are collected at the minute level. The scale-out mechanism is triggered only after the metric value reaches the threshold for a period of time.