This topic provides answers to some commonly asked questions about GPU-accelerated instances.
What are the driver and CUDA versions of GPU-accelerated instances in Function Compute?
What do I do if "CUFFT_INTERNAL_ERROR" is reported during function execution?
What do I do if a CUDA GPG error occurs when I build an image?
What do I do if a GPU image fails to be converted to an accelerated image?
Should a model be integrated into or separated from an image?
What do I do if the end-to-end latency of my function is high and fluctuates greatly?
What are the usage notes for GPU functions with provisioned instances?
What are the driver and CUDA versions of GPU-accelerated instances in Function Compute?
The following items list the versions of the main components of GPU-accelerated instances:
Driver versions: Drivers include kernel-mode drivers (KMDs) such as
nvidia.koand CUDA user-mode drivers (UMDs) such aslibcuda.so. NVIDIA provides the drivers used by GPU-accelerated instances in Function Compute. The driver versions may change as a result of feature iteration, new GPU releases, bug fixes, and driver lifecycle expiration. We recommend that you do not add driver-related components to your image. For more information, see What do I do if the system fails to find the NVIDIA driver?.CUDA Toolkit versions: CUDA Toolkit includes various components, such as CUDA Runtime, cuDNN, and cuFFT. The CUDA Toolkit version is determined by the container image you use.
The NVIDIA GPU drivers and CUDA Toolkit require version compatibility to function correctly. For more information, see CUDA Toolkit Release Notes.
The current KMD version of GPU-accelerated instances in Function Compute is 570.133.20, and the corresponding CUDA UMD version is 12.8. For optimal compatibility, we recommend that you use CUDA Toolkit version 11.8 or later, but not exceeding the version of the CUDA UMD.
What do I do if "CUFFT_INTERNAL_ERROR" is reported during function execution?
The cuFFT library in CUDA 11.7 has forward compatibility issues. If you encounter this error, we recommend that you upgrade to CUDA 11.8 or later. For more information about GPU models, see Instance specifications.
Take PyTorch as an example. After the upgrade, you can use the following code snippet for verification. If no errors are reported, the upgrade is successful.
import torch
out = torch.fft.rfft(torch.randn(1000).cuda())What do I do if a CUDA GPG error occurs when I build an image?
The following GPG error is reported during the image building process:
W: GPG error: https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64 InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 InRelease' is not signed.In this case, you can append the following script to the RUN rm command line of the Dockerfile file and rebuild the image.
RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys A4B469963BF863CCWhy is the type of my GPU-accelerated instance g1?
The g1 instance type is equivalent to the fc.gpu.tesla.1 instance type. For more information, see Instance specifications.
Why does the snapshot launch fail?
Provisioned instances may fail to start due to the following reasons:
The startup of the provisioned instances times out
Error code: "FunctionNotStarted"
Error message: "Function instance health check failed on port XXX in 120 seconds"
Solution: Check the application startup logic to see if it includes the logic for downloading models from the Internet and loading large models (over 10 GB). We recommend that you start the web server before you run the model loading logic.
The maximum number of instances for the function or region is reached
Error code: "ResourceThrottled"
Error message: "Reserve resource exceeded limit"
Solution: By default, an Alibaba Cloud account is limited to 30 physical GPUs allocated per region. You can view the actual quota in the Quota Center. If you require more physical GPUs, you can apply for a quota adjustment in the Quota Center.
What do I do if elastic GPU instances fail to provision, and a "ResourceExhausted" or "ResourceThrottled" error is reported?
Because GPU resources are relatively scarce, fluctuations in the resource pool may prevent elastic GPU instances from being provisioned in time to meet invocation requests. For more predictable resource delivery, we recommend that you configure provisioned instances for your functions to reserve resources in advance. For details on the billing of elastic instances and provisioned instances, see Billing overview.
What is the limit on the size of a GPU image?
The image size limit applies only to compressed images. You can check the size of a compressed image in the Container Registry console. You can also run the docker images command to query the size of an image before compression.
In most cases, an uncompressed image smaller than 20 GB can be deployed to Function Compute and will function as expected.
What do I do if a GPU image fails to be converted to an accelerated image?
The time required to convert an image increases as the size of your image grows. This may lead to a conversion failure due to timeout. You can re-trigger the conversion of the GPU image by editing and re-saving the function configurations in the Function Compute console (without actually adjusting any parameters).
Should a model be integrated into or separated from an image?
If your model files are large, undergo frequent iterations, or would exceed the image size limit when published together with the image, we recommend that you separate the model from the image. In such cases, you can store the model in a NAS file system or an OSS file system. For more information, see Best practices for model storage in GPU-accelerated instances.
How do I perform a model warm-up?
We recommend that you warm up your model using the /initialize method. Production traffic is directed to the model only after the /initialize method is complete. For more information, see the following topics:
What do I do if "FunctionNotStarted" Function Instance health check failed on port xxx in 120 seconds is reported when I start a GPU image?
Cause: The AI/GPU application takes too long to start. As a result, the health check of Function Compute fails. In most cases, starting AI/GPU applications is time-consuming due to lengthy model loading times, which can cause the web server startup to time out.
Solutions:
Avoid dynamically loading the model over the Internet during application startup. We recommend that you place the model in an image or a NAS file system and load it from the nearest path.
Place model initialization in the
/initializemethod and prioritize completing the application startup. In other words, load the model after the web server has started.NoteFor more information about the lifecycle of a function instance, see Configure instance lifecycles.
What do I do if the end-to-end latency of my function is high and fluctuates greatly?
First, make sure that the state of image acceleration is Available in the environment information.
Check the type of the NAS file system. If your function needs to read data, such as a model, from a NAS file system, we strongly recommend that you use a Performance NAS file system instead of a Capacity one to ensure optimal performance. For more information, see General-purpose NAS file system.
What do I do if the system fails to find the NVIDIA driver?
This issue arises when you use the docker run --gpus all command to specify a container and then build an application image using the docker commit method. The built image contains local NVIDIA driver information, which prevents the driver from being properly mounted after the image is deployed to Function Compute. As a result, the system cannot find the NVIDIA driver.
To solve the issue, we recommend that you use Dockerfile to build an application image. For more information, see dockerfile.
Additionally, do not include driver-related components in your image, and avoid making your application dependent on specific driver versions. For example, do not package libcuda.so, which provides the CUDA Driver API, in your image, as this dynamic library is closely tied to the device's driver version. Including such libraries in your image may result in compatibility issues and unexpected application behavior if there is a version mismatch.
When you create a function instance, Function Compute proactively injects user-mode driver components into the container. These components align with the driver version provided by Function Compute. This approach is consistent with GPU container virtualization technologies such as NVIDIA Container Runtime, where driver-specific tasks are delegated to the infrastructure provider. This maximizes the compatibility of GPU container images across different environments. The drivers used for Function Compute GPU instances are supplied by NVIDIA. Due to ongoing feature iterations, new GPU models, bug fixes, and driver lifecycle changes, the driver version used by GPU instances may change in the future.
If you are already using NVIDIA Container Runtime or other GPU container virtualization technologies, avoid creating images with the docker commit command. Images created this way may contain injected driver components. When running these images in Function Compute, mismatches between component versions and the platform can result in undefined behavior, such as application errors.
What are the usage notes for GPU functions with provisioned instances?
CUDA version
We recommend that you use CUDA 12.2 or an earlier version.
Image permissions
We recommend that you run container images as the default root user.
Instance logon
You cannot log on to an idle GPU-accelerated instance because the GPUs are frozen.
Graceful instance rotation
Function Compute rotates idle GPU-accelerated instances based on the workload. To ensure service quality, we recommend that you add lifecycle hooks to function instances for model warm-up and pre-inference. This way, your inference service can be provided immediately after the launch of a new instance. For more information, see Model warm-up.
Model warm-up and pre-inference
To reduce the latency of the initial wake-up of an idle GPU-accelerated instance, we recommend that you use the
initializehook in your code to warm up or preload your model. For more information, see Model warm-up.Provisioned instance configurations
When you turn on the Idle Mode switch, the existing provisioned GPU-accelerated instances for the function are gracefully shut down. Provisioned instances are reallocated after they are released for a short period of time.
Built-in Metrics Server of inference frameworks
To improve the compatibility and performance of idle GPUs, we recommend that you disable the built-in Metrics Server of your inference frameworks, such as NVIDIA Triton Inference Server and TorchServe.