Basic CPU instances are usually sufficient for general-purpose computing scenarios on Function Compute, such as web services and data processing. However, for scenarios that require large-scale parallel computing or deep learning, such as audio and video processing, artificial intelligence (AI) inference, and image editing, GPU-accelerated instances can significantly improve computing efficiency.
Function Compute offers three instance types for GPU workloads: Elastic Instance, Resident Instance, and Mixed Mode. Choose the type and specifications that fit your needs to balance resource utilization, performance, and stability.
Instance type selection
For CPU functions, only Elastic Instances are supported. For GPU Functions, you can choose the most suitable instance type based on your business's resource utilization, latency sensitivity, and cost stability requirements. You can switch between the three instance types at any time without service interruption.
You can bind Provisioned Instances only to GPU Functions that belong to the Ada, Ada.2, Ada.3, Hopper, or Xpu.1 series.
Elastic Instance
If you set the minimum number of of instances for a function to 0, instances automatically scale based on the request volume and are released when there are no requests. This enables a pay-as-you-go model in which you are charged only for what you use, maximizing cost savings. Higher request frequency leads to better resource utilization and greater cost savings compared to virtual machines.
Cold start behavior
Yes, cold starts can occur. For latency-sensitive workloads, you can mitigate cold starts by setting the minimum number of instances to 1 or higher. This pre-allocates elastic resources, allowing instances to be activated quickly to handle incoming requests.
Billing (Pay-as-you-go)
Function costs include charges for active Elastic Instances and Elastic Instances in Shallow Hibernation. If you set the minimum number of instances to 1 or more, we recommend enabling the Shallow Hibernation mode. In this state, you are not charged for vCPU resources, and GPU resources are billed at only one-fifth of the active rate, significantly lowering costs compared to active elastic instances.
For more information about the use cases for active and Shallow Hibernation states, see Elastic Instance.
Provisioned Instance
This instance type applies only to GPU Functions. You purchase a Provisioned Resource Pool in advance and then allocate a specific number and type of provisioned instances to your function. This approach provides predictable, fixed costs and is ideal for workloads with high resource utilization, strict latency requirements, or stable billing requirements.
After purchasing a monthly provisioned resource pool, the platform allocates a certain quota of boost instances in addition to your subscription-based provisioned instances. This boost instance quota is not billed.
Cold start behavior
No, there are no cold starts. When you use Provisioned Instances, requests within your allocated capacity receive a real-time response. The maximum number of concurrent requests a function can handle is calculated as: (Number of allocated Provisioned Instances) × (Instance concurrency)+ the boost instance quota. Any requests that exceed this limit are throttled.
Billing (Subscription)
The function cost is the total subscription fee for all purchased provisioned resource pools. The boost instance quota is not billed.
Provisioned Instance and Elastic Instance (Mixed Mode)
This mode applies only to GPU Functions. It combines the benefits of Provisioned Instances and Elastic Instances, making it ideal for workloads with significant traffic fluctuations. The system first uses the provisioned resource pool to handle steady-state traffic. When requests exceed the capacity of the provisioned pool, the system automatically scales out by launching elastic instances. This approach guarantees stable baseline capacity while effectively managing sudden traffic bursts.
Cold start behavior
Partially. Requests handled by the provisioned resource pool (up to the minimum number of instances) are processed in real-time with no cold starts. However, when traffic triggers auto-scaling and new elastic instances are launched, those new instances will experience a cold start.
Billing
The cost in Mixed Mode consists of both subscription and pay-as-you-go components:
Provisioned portion: Billed against your purchased Provisioned Resource Pool quota.
Elastic portion: Instances launched beyond the provisioned quota are billed on a pay-as-you-go basis, with the same rates as active and shallow hibernation elastic instances.
Instance specifications
CPU instances
vCPU (core)
Memory size (MB)
Maximum code package size (GB)
Maximum function execution duration (s)
Maximum disk size (GB)
Maximum bandwidth (Gbps)
0.05 to 16
Note: The value must be a multiple of 0.05.
128 to 32768
Note: The value must be a multiple of 64.
10
86400
10
Valid values:
512 MB. This is the default value.
10 GB.
5
NoteThe ratio of vCPUs to memory size (in GB) must be between 1:1 and 1:4.
GPU instance hardware specifications
NoteInstance type
GPU Memory
FP16 computing power
FP32 computing power
Max cards per instance
fc.gpu.tesla.1
16 GB
65 TFLOPS
8 TFLOPS
4 cards
fc.gpu.ampere.1
24 GB
125 TFLOPS
31.2 TFLOPS
8 cards
fc.gpu.ada.1
48 GB
119 TFLOPS
60 TFLOPS
fc.gpu.ada.2
24 GB
166 TFLOPS
83 TFLOPS
fc.gpu.ada.3
48 GB
148 TFLOPS
73.5 TFLOPS
fc.gpu.hopper.1
96 GB
148 TFLOPS
44 TFLOPS
fc.gpu.hopper.2
141 GB
148 TFLOPS
44 TFLOPS
fc.gpu.blackwell.1
32 GB
104.8 TFLOPS
104.8 TFLOPS
fc.gpu.xpu.1
96 GB
123 TFLOPS
61.5 TFLOPS
16 cards
vCPU and memory configuration for GPU instances
NoteMulti-card resource formula: Total vCPUs = vCPUs per card × Number of cards; Total memory = Memory per card × Number of cards.
Instance type
vCPU (per card)
Memory range (per card)
Memory increment
fc.gpu.tesla.1
4 cores
4 to 16 GB (4096 to 16384 MB)
4 GB (4096 MB)
8 cores
8 to 32 GB (8192 to 32768 MB)
16 cores
16 to 64 GB (16384 to 65536 MB)
fc.gpu.ampere.1
8 cores
8 to 32 GB (8192 to 32768 MB)
16 cores
16 to 64 GB (16384 to 65536 MB)
fc.gpu.ada.1
fc.gpu.ada.2
fc.gpu.ada.3
4 cores
16 to 32 GB (16384 to 32768 MB)
8 cores
32 to 64 GB (32768 to 65536 MB)
16 cores
64 to 120 GB (65536 to 122880 MB)
fc.gpu.hopper.1
4 cores
16 to 32 GB (16384 to 32768 MB)
8 cores
32 to 64 GB (32768 to 65536 MB)
16 cores
64 to 96 GB (65536 to 98304 MB)
24 cores
96 to 120 GB (98304 to 122880 MB)
fc.gpu.hopper.2
4 cores
16 to 32 GB (16384 to 32768 MB)
8 cores
32 to 64 GB (32768 to 65536 MB)
16 cores
64 to 128 GB (65536 to 131072 MB)
24 cores
96 to 248 GB (98304 to 253952 MB)
fc.gpu.blackwell.1
4 cores
16 to 32 GB (16384 to 32768 MB)
8 cores
32 to 64 GB (32768 to 65536 MB)
16 cores
64 to 120 GB (65536 to 122880 MB)
24 cores
96 to 184 GB (98304 to 188416 MB)
fc.gpu.xpu.1
4 cores
16 to 48 GB (16384 to 49152 MB)
8 cores
32 to 96 GB (32768 to 98304 MB)
12 cores
48 to 120 GB (49152 to 122880 MB)
GPU-accelerated instances also support the following resource specifications.
Image size (GB)
Maximum function execution duration (s)
Disk size (GB)
Maximum bandwidth (Gbps)
ACR Enterprise Edition (Standard Edition): 15
ACR Enterprise Edition (Premium Edition): 15
ACR Enterprise Edition (Basic Edition): 15
ACR Personal Edition (Free): 15
86400
512 MB
10 GB to 200 GB, in 10 GB increments
5
NoteSetting the instance type to g1 is equivalent to setting it to fc.gpu.tesla.1.
Tesla series GPU-accelerated instances are supported in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), Japan (Tokyo), US (Virginia), and Singapore.
Ada series GPU-accelerated instances are supported in the following regions: China (Beijing), China (Hangzhou), China (Shanghai), China (Shenzhen), Singapore, and US (Virginia).
Relationship between GPU instance specifications and instance concurrency
An Ada.1 GPU has 48 GB of memory, and a Tesla series GPU has 16 GB of memory. Function Compute allocates the full memory of a GPU card to a single GPU container. Because the default GPU card quota is a maximum of 30 per region, a maximum of 30 GPU containers can run simultaneously in that region.
If the instance concurrency of a GPU function is 1, the function can process up to 30 inference requests concurrently in a region.
If the instance concurrency of a GPU function is 5, the function can process up to 150 inference requests concurrently in a region.
Single-instance concurrency
To improve resource utilization, you can configure single-instance concurrency based on your application's resource requirements. In this configuration, multiple tasks can run on a single instance and share CPU and memory resources, which improves overall resource utilization. For more information, see Configure instance concurrency.
Execution duration for single-instance, single-concurrency
When an instance executes a single request, the execution duration is measured from when the request arrives at the instance to when the request execution is complete.
Execution duration for single-instance, multiple-concurrency
When an instance executes multiple requests concurrently, the execution duration is measured from the time the first request arrives at the instance to the time the last request is completed. This resource reuse helps save costs.
References
For more information about the billing methods and billable items of Function Compute, see Billing overview.
When you use an API to create a function, you can use the
instanceTypeparameter to specify the instance type. For more information, see CreateFunction.To learn how to specify the instance type and specifications in the console, see Function creation.