This topic provides answers to some commonly asked questions about GPU-accelerated instances.
What are the driver and CUDA versions of GPU-accelerated instances in Function Compute?
What do I do if "CUFFT_INTERNAL_ERROR" is reported during function execution?
What do I do if a CUDA GPG error occurs when I build an image?
Why do my provisioned GPU-accelerated instances fail to be allocated?
What do I do if a GPU image fails to be converted to an accelerated image?
Should a model be integrated into or separated from an image?
What do I do if the end-to-end latency of my function is high and fluctuates greatly?
What are the driver and CUDA versions of GPU-accelerated instances in Function Compute?
The following items list the versions of the main components of GPU-accelerated instances:
Driver versions: Drivers include kernel-mode drivers (KMDs) such as
nvidia.koand CUDA user-mode drivers (UMDs) such aslibcuda.so. NVIDIA provides the drivers that are used by GPU-accelerated instances in Function Compute. The driver versions used by GPU-accelerated instances may change as a result of feature iteration, new GPU releases, bug fixes, and driver lifecycle expiration. We recommend that you do not add driver-related components to your image. For more information, see What do I do if the system fails to find the NVIDIA driver?CUDA Toolkit versions: CUDA Toolkit includes various components, such as CUDA Runtime, cuDNN, and cuFFT. The CUDA Toolkit version is determined by the container image you use.
The GPU drivers and CUDA Toolkit, both released by NVIDIA, are related to each other. For more information, see NVIDIA CUDA Toolkit Release Notes.
The current driver version of GPU-accelerated instances in Function Compute is 570.133.20, and the version of the corresponding CUDA UMD is 12.8. For optimal compatibility, we recommend that you use CUDA Toolkit 11.8 or later, but not exceeding the version of the CUDA UMD.
What do I do if "CUFFT_INTERNAL_ERROR" is reported during function execution?
The cuFFT library in CUDA 11.7 has forward compatibility issues. If you encounter this error with newer GPU models, we recommend that you upgrade to at least CUDA 11.8. For more information about GPU models, see Instance Specifications.
Take PyTorch as an example. After the upgrade, you can use the following code snippet for verification. If no errors are reported, the upgrade is successful.
import torch
out = torch.fft.rfft(torch.randn(1000).cuda())What do I do if a CUDA GPG error is reported when I build an image?
The following GPG error is reported during an image building process:
W: GPG error: https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64 InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 InRelease' is not signed.In this case, you can append the following script to the RUN rm command line of the Dockerfile file and rebuild your image.
RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys A4B469963BF863CCWhy is the instance type of my GPU-accelerated instance g1?
The g1 instance type is equivalent to the fc.gpu.tesla.1 instance type. For more information, see Instance Specifications.
Why do my provisioned GPU-accelerated instances fail to be allocated?
The allocation of provisioned instances may fail due to the following reasons:
The startup of the provisioned instances times out.
Error code: FunctionNotStarted.
Error message: Function instance health check failed on port XXX in 120 seconds.
Solution: Check the application startup logic to see if it includes the logic for downloading models from the Internet and loading large models (over 10 GB). We recommend that you start the web server before you run the model loading logic.
The maximum number of instances for the function level or region is reached.
Error code: ResourceThrottled.
Error message: Reserve resource exceeded limit.
Solution: By default, an Alibaba Cloud account is limited to 30 physical GPUs allocated per region. You can view the actual quota in the Quota Center console. If you require more physical GPUs, you can apply for a quota adjustment in the Quota Center console.
What is the limit on the size of a GPU image?
The image size limit applies only to compressed images. You can check the size of a compressed image in the Container Registry console. You can also run the docker images command to query the size of an image before compression.
In most cases, an uncompressed image smaller than 20 GB can be deployed to Function Compute and will function as expected.
What do I do if a GPU image fails to be converted to an accelerated image?
The time required to convert an image increases as the size of your image grows. This may lead to a conversion failure. You can re-trigger the conversion of the GPU image by editing and re-saving the function configurations in the Function Compute console. While editing, you do not need to actually modify the parameters if you want to retain the existing settings.
Should a model be integrated into or separated from an image?
If your model files are large, undergo frequent iterations, or would exceed the image size limit when published together with the image, we recommend that you separate the model from the image. In such cases, you can store the model in a File Storage NAS (NAS) file system or an Object Storage Service (OSS) file system.
How do I perform a model warm-up?
We recommend that you warm up a model by using the /initialize method. Production traffic is directed to the model only after the warm-up based on the /initialize method is completed. You can refer to the following topics to learn more about model warm-up:
What do I do if the [FunctionNotStarted] Function Instance health check failed on port xxx in 120 seconds error is reported when I start a GPU image?
Cause: The AI/GPU application takes too long to start. As a result, the health check of Function Compute fails. In most cases, starting AI/GPU applications is time-consuming due to lengthy model loading times, which can cause the web server startup to time out.
Solution:
Avoid dynamically loading the model over the Internet during application startup. We recommend that you place the model in an image or a NAS file system and load it from the nearest path.
Place model initialization in the
/initializemethod and prioritize completing the application startup. In other words, load the model after the web server has started.NoteFor more information about the lifecycle of a function instance, see Function instance lifecycle.
What do I do if the end-to-end latency of my function is large and fluctuates greatly?
Make sure that the state of image acceleration is Available in the environment information.
Check the type of the NAS file system. If your function needs to read data, such as a model, from a NAS file system, we recommend that you use a Performance NAS file system instead of a Capacity one to ensure optimal performance. For more information, see General-purpose NAS file systems.
What do I do if the system fails to find the NVIDIA driver?
This issue arises when you use the docker run --gpus all command to specify a container and then build an application image using the docker commit method. The built image contains local NVIDIA driver information, which prevents the driver from being properly mounted after the image is deployed to Function Compute. As a result, the system cannot find the NVIDIA driver.
To solve the issue, we recommend that you use Dockerfile to build an application image. For more information, see Dockerfile.
Additionally, do not include driver-related components in your image, and avoid making your application dependent on specific driver versions. For example, do not package libcuda.so, which provides the CUDA Driver API, in your image, as this dynamic library is closely tied to the device's driver version. Including such libraries in your image may result in compatibility issues and unexpected application behavior if there is a version mismatch with the underlying system.
When you create a function instance, Function Compute proactively injects user-mode driver components into the container. These components are aligned with the driver version provided by Function Compute. This approach is consistent with GPU container virtualization technologies such as NVIDIA Container Runtime, where driver-specific tasks are delegated to the infrastructure provider, thereby maximizing the compatibility of GPU container images across different environments. The drivers used for Function Compute GPU instances are supplied by NVIDIA. With ongoing feature iterations, new GPU models, bug fixes, and the driver lifecycle changes, the driver version used by GPU instances may change in the future.
If you are already using NVIDIA Container Runtime or other GPU container virtualization technologies, avoid creating images with the docker commit command. Images created this way may contain injected driver components. When running these images in Function Compute, mismatches between component versions and the platform can result in undefined behavior, such as application errors.
What do I do if GPU-accelerated instances fail to provision during on-demand invocation, and a "ResourceExhausted" or "ResourceThrottled" error is reported?
GPU resources are relatively scarce, so on-demand invocations may be affected by fluctuations in the resource pool, which can prevent instances from being provisioned in time. For more predictable resource availability, we recommend that you configure auto-scaling rules for your functions to reserve GPU resources in advance. For more information, see Configure auto scaling rules. For details on the billing of provisioned instances, see Billing overview.