This topic describes the offline asynchronous task scenarios of GPU-accelerated instances and how to use GPU-accelerated instances in asynchronous invocation and asynchronous task modes to process workloads in scenarios such as offline AI inference, offline AI training, and offline GPU acceleration. This topic also describes how to use a Custom Container runtime in non-web server mode to meet the requirements of offline GPU applications.
Introduction
Workloads in offline asynchronous scenarios feature one or more of the following characteristics:
Long execution time
In most cases, the processing duration of a task lasts minutes or hours. Tasks are not sensitive to response time.
Immediate responses
Responses are immediately returned after invocations are triggered. This way, execution of main logic is not blocked by time-consuming processes.
Real-time sensing of task status
The execution status of offline GPU tasks needs to be viewed in real time and the tasks can be canceled by users.
Parallel processing
Offline GPU tasks process large amounts of data and require large amount of GPU resources. Parallel running speeds up the processing.
Data source integration
Offline GPU tasks require various data sources. During the execution processes, frequent interactions with multiple Alibaba Cloud storage services such as Object Storage Service (OSS) and Alibaba Cloud message services such as ApsaraMQ services are required. For more information, see Overview.
Function Compute delivers the following benefits for offline asynchronous workloads:
Simplified business architecture
Time-consuming, resource-consuming, or error-prone logic can be separated from main processes to improve the system response speed, resource utilization, and service availability.
Shortest execution path
Enterprises can build an asynchronous processing platform for AI applications based on the asynchronous GPU processing capabilities that are provided by Function Compute at low costs.
Adequate GPU resource supply
Function Compute provides abundant GPU resources and can deliver massive GPU computing resources within seconds when large-scale offline tasks arise. This prevents service interruptions caused by insufficient and delayed supply of GPU computing power. Function Compute is suitable for offline workloads with surges and declines, during which the traffic is unpredictable.
Data source integration
Function Compute supports various trigger sources, such as OSS and ApsaraMQ, to simplify data source interaction and processing.
Workflow
After you deploy a GPU function, you can choose to submit offline GPU tasks by using the asynchronous invocation mode or asynchronous task mode. By default, Function Compute uses on-demand GPU-accelerated instances to provide the infrastructure required for offline asynchronous applications. You can also use provisioned GPU-accelerated instances. For more information, see Provisioned mode.
When Function Compute receives multiple offline GPU tasks that are asynchronously submitted, Function Compute automatically activates multiple on-demand GPU-accelerated instances to process the tasks in parallel. Abundant GPU computing resources are provided to ensure parallel running of offline GPU tasks to minimize queuing time. If the number of offline GPU tasks exceeds the processing capability of GPU resources of an Alibaba Cloud account in a region, the excess tasks are queued. You can view the numbers of queued, in-process, and completed GPU tasks. You can also cancel tasks that are no longer needed. After an offline GPU task is processed, specific operations can be triggered on Alibaba Cloud services based on the execution status of the GPU task.
Container support
GPU-accelerated instances of Function Compute can be used only in Custom Container runtimes. For more information about Custom Container runtimes, see Overview.
GPU specifications
You can select a GPU type and configure GPU specifications for GPU-accelerated instances based on your business requirements. For example, you can configure the CPU capacity, GPU computing power and GPU memory, memory, and disks based on algorithm models. For more information about specifications of GPU-accelerated instances, see Instance specifications.
Deployment methods
You can deploy your models in Function Compute by using one of the following methods:
Use the Function Compute console. For more information, see Create a function in the Function Compute console.
Call SDKs. For more information, see List of operations by function.
Use Serverless Devs. For more information, see Serverless Devs commands.
For more deployment examples, see start-fc-gpu.
Asynchronous mode
The execution of offline applications lasts for a long period of time. Therefore, functions must be triggered in the asynchronous mode. After functions are triggered, responses are immediately returned. The asynchronous invocation mode does not carry execution status. You must use the asynchronous task mode. The execution status of each request in asynchronous tasks can be queried at any time. You can cancel requests that are being executed. For more information about asynchronous invocations and asynchronous tasks, see Asynchronous invocation and Asynchronous task management.
Concurrent invocations
The maximum number of concurrent requests that a GPU function can process in a region is based on the concurrency of the GPU-accelerated instance and the maximum number of physical GPUs that can be used.
Concurrency of GPU-accelerated instances
By default, the concurrency of GPU-accelerated instances is 1. Each GPU-accelerated instance can process only one request or offline GPU task at a time. You can change the concurrency of a GPU-accelerated instance in the Function Compute console or by using Serverless Devs. For more information, see Configure instance concurrency. We recommend that you configure concurrency to meet your business requirements in different scenarios. For compute-intensive offline GPU tasks, we recommend that you retain the default value 1.
Maximum number of physical GPUs that can be used
For more information about the maximum number of GPUs, see Limits for GPUs.
Running duration
GPU-accelerated instances of Function Compute support a running duration of up to 86,400 seconds (24 hours). You can use GPU-accelerated instances alongside the asynchronous task mode to run or terminate requests with ease in time-consuming scenarios such as AI inference, AI training, audio and video processing, and 3D reconstruction.