This topic describes the offline asynchronous task scenarios of GPU-accelerated instances and how to use GPU-accelerated instances in asynchronous invocation and asynchronous task modes to process workloads in scenarios such as offline AI inference, AI training, and GPU acceleration. This topic also describes how to use a Custom Container runtime to meet the requirements of offline GPU-accelerated applications.
Introduction
Workloads in offline asynchronous scenarios have one or more of the following characteristics:
Long execution time
In most cases, tasks can be processed in minutes or hours. The tasks are not sensitive to response time.
Immediate responses
Responses are immediately returned after invocations are triggered. This way, the execution of the main logic is not blocked by time-consuming processes.
Real-time sensing of task status
The execution status of offline GPU tasks can be viewed in real time, and users can cancel running tasks.
Parallel processing
Offline GPU tasks process large amounts of data and require large amount of GPU resources. Parallel running speeds up the processing.
Data source integration
Offline GPU tasks require various data sources. During the execution processes, frequent interactions with multiple Alibaba Cloud storage services such as Object Storage Service (OSS) and Alibaba Cloud message services such as ApsaraMQ services are required. For more information, see Trigger overview.
Function Compute provides the following benefits for offline asynchronous workloads:
Simplified business architecture
Logic that is time-consuming, resource-intensive, or error-prone can be separated from main processes to improve the system response speed, resource utilization, and service availability.
Shortest execution path
Enterprises can build an asynchronous processing platform for AI applications based on the asynchronous GPU processing capabilities that are provided by Function Compute at low costs.
Adequate GPU resource supply
Function Compute provides abundant GPU resources and can deliver massive GPU computing resources within seconds when large-scale offline tasks arise. This prevents service interruptions caused by insufficient and delayed supply of GPU computing power. Function Compute is suitable for offline workloads with surges and declines, during which the traffic is unpredictable.
Data source integration
Function Compute supports various trigger sources, such as OSS and ApsaraMQ, to simplify data source interaction and processing.
How it works
After you deploy a GPU function, you can choose to submit offline GPU tasks by using the asynchronous invocation mode or asynchronous task mode. By default, Function Compute allocates on-demand GPU-accelerated instances to provide underlying infrastructure required for offline asynchronous application scenarios. You can also use provisioned GPU-accelerated instances. For more information, see Instance types and specifications.
When Function Compute receives multiple offline GPU tasks that are asynchronously submitted, Function Compute automatically allocates multiple on-demand GPU-accelerated instances to process the tasks in parallel. Abundant GPU computing resources are provided to ensure parallel running of offline GPU tasks to minimize queuing time. If the offline GPU tasks exceed the processing capabilities of GPU resources of an Alibaba Account in a region, the excess tasks are queued. You can view the numbers of queued, ongoing, and completed GPU tasks. You can also cancel tasks that are no longer needed. When an offline GPU task is processed, a callback can be triggered to perform specific operations on downstream Alibaba Cloud services based on the execution status of the GPU task.
Container support
GPU-accelerated instances of Function Compute can be used only in Custom Container runtimes. For more information about Custom Container runtimes, see Introduction to Custom Container.
Custom container functions require a web server within the image to execute different code paths and trigger functions through events or HTTP requests. The web server mode is suitable for multi-path request execution scenarios such as AI learning and inference.
GPU specifications
You can select a GPU card type and configure GPU specifications based on your business requirements, specifically the CPU power, GPU memory, memory, and disks that are required by your algorithm models. For more information about specifications of GPU-accelerated instances, see Instance specifications.
Deployment methods
You can deploy your models in Function Compute by using one of the following methods:
Use the Function Compute console. For more information, see Create a function in the Function Compute console.
Call SDKs. For more information, see List of operations by function.
Use Serverless Devs. For more information, see Common commands of Serverless Devs.
For more deployment examples, see start-fc-gpu.
Asynchronous mode
The execution of offline applications lasts for a long period of time. Therefore, functions must be triggered in the asynchronous mode. After functions are triggered, responses are immediately returned. The asynchronous invocation mode does not carry execution status information. You need to enable the asynchronous task mode. The execution status of each request in asynchronous tasks can be queried at any time. You can cancel requests that are being executed. For more information about asynchronous invocations and asynchronous tasks, see Overview of asynchronous invocations and Overview of asynchronous tasks.
Concurrent invocations
The maximum number of concurrent requests that a GPU function can process in a region is based on the concurrency of the GPU-accelerated instance and the maximum number of physical GPUs that can be used.
Concurrency of GPU-accelerated instances
By default, the concurrency of GPU-accelerated instances is 1. Each GPU-accelerated instance can process only one request or offline GPU task at a time. You can change the concurrency of a GPU-accelerated instance in the Function Compute console or by using Serverless Devs. For more information, see Create a web function. We recommend that you configure concurrency based on the actual scenarios. For compute-intensive offline GPU tasks, we recommend that you retain the default value, which is 1.
Maximum number of physical GPUs used in a region
By default, the maximum number of physical GPUs that can be allocated to a region within an Alibaba Cloud account is 30. You can view the actual number in the Quota Center. You can apply for more physical GPUs in the Quota Center.
Running duration
GPU-accelerated instances of Function Compute support a running duration of up to 86,400 seconds (24 hours). You can use GPU-accelerated instances alongside the asynchronous task mode to run or terminate requests with ease in time-consuming scenarios such as AI inference, AI training, audio and video processing, and 3D reconstruction.