AIACC-Training FAQ - Elastic GPU Service - Alibaba Cloud Documentation Center

This topic provides answers to frequently asked questions about Apsara AI Accelerator (AIACC)-Training.

What can I do if the "NCCL unhandled error" error is returned when I perform distributed training on multiple GPUs in a container?
What can I do if Perseus fails to start and the "Undefined symbols" error about the framework is returned?
What can I do if the "libcuda.so.1: cannot open shared object file: No such file or directory" link error is returned when I start Perseus?
What can I do if an error similar to "libcudart.so.X.Y: cannot open shared object file: No such file or..." is returned when I start Perseus?
What can I do if an error similar to "unhandled system error" is generated in the nccl_comm.cpp file in the container environment?
What can I do if Perseus returns the "python: double free or corruption" error?
What can I do if Python fails to exit after the training is complete?
The startup process is extremely slow. A large number of threads are found in the system, including many Open Multi-Processing (OMP) threads. What can I do?

What can I do if the `NCCL unhandled error` error is returned when I perform distributed training on multiple GPUs in a container?

Set the NCCL_DEBUG environment variable to INFO. If the following log information is generated, add --shm-size=1g --ulimit memlock=-1 to the nvidia-docker run command that you run to start the container:


hzh-perseus-5868d9dfdb-q664k:34486:37433 [3] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
hzh-perseus-5868d9dfdb-q664k:34486:37433 [3] NCCL INFO include/shm.h:41 -> 2

hzh-perseus-5868d9dfdb-q664k:34486:37433 [3] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-31b93889a892fca7-0-2-3 (size 4460544)

What can I do if Perseus fails to start and the `Undefined symbols` error about the framework is returned?

Check whether the Perseus version is compatible with the framework version. The first part of the Perseus version number indicates the Perseus version and the second part indicates the matching framework version. The matching framework version must be the same as the framework version that you use.

What can I do if the `libcuda.so.1: cannot open shared object file: No such file or directory` link error is returned when I start Perseus?

Check whether the Compute Unified Device Architecture (CUDA) driver and CUDA SDK are installed.

What can I do if an error similar to `libcudart.so.X.Y: cannot open shared object file: No such file or...` is returned when I start Perseus?

Check whether the CUDA version is the same as the Perseus version.

What can I do if Perseus returns the `python: double free or corruption` error?

Possible causes:

Different shapes or sizes are specified for the same input tensor name in different ranks. In addition, the usage is essentially invalid. You cannot perform the AllReduce operation on different shapes.
The timing for executing different ranks varies greatly. You can synchronize data once after an epoch is complete. For example, you can use kv._barrier(); mx.nd.waitall() in MXNet.

What can I do if Python fails to exit after the training is complete?

If the training is complete in Python, an exit signal is sent to Perseus at the backend. Actually, the training is not complete in Python. This situation may be caused because unexpected data read processes keep alive in the model code. As a result, the main process fails to be stopped. One solution is to run the import sys; sys.exit(0) command to explicitly exit the main process after all processes are complete.

The startup process is extremely slow. A large number of threads are found in the system, including many Open Multi-Processing (OMP) threads. What can I do?

In most cases, it is recommended to allocate four or less OMP threads to a single GPU. You can specify the export OMP_NUM_THREADS to 4 or a lower value.

What can I do if the NCCL unhandled error error is returned when I perform distributed training on multiple GPUs in a container?

What can I do if Perseus fails to start and the Undefined symbols error about the framework is returned?

What can I do if the libcuda.so.1: cannot open shared object file: No such file or directory link error is returned when I start Perseus?

What can I do if an error similar to libcudart.so.X.Y: cannot open shared object file: No such file or... is returned when I start Perseus?

What can I do if an error similar to unhandled system error is generated in the nccl_comm.cpp file in the container environment?

What can I do if Perseus returns the python: double free or corruption error?