All Products
Search
Document Center

Elastic GPU Service:AIACC-Training FAQ

Last Updated:Oct 07, 2023

This topic provides answers to frequently asked questions about Apsara AI Accelerator (AIACC)-Training.

What can I do if the NCCL unhandled error error is returned when I perform distributed training on multiple GPUs in a container?

Set the NCCL_DEBUG environment variable to INFO. If the following log information is generated, add --shm-size=1g --ulimit memlock=-1 to the nvidia-docker run command that you run to start the container:


hzh-perseus-5868d9dfdb-q664k:34486:37433 [3] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
hzh-perseus-5868d9dfdb-q664k:34486:37433 [3] NCCL INFO include/shm.h:41 -> 2

hzh-perseus-5868d9dfdb-q664k:34486:37433 [3] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-31b93889a892fca7-0-2-3 (size 4460544)
                        

What can I do if Perseus fails to start and the Undefined symbols error about the framework is returned?

Check whether the Perseus version is compatible with the framework version. The first part of the Perseus version number indicates the Perseus version and the second part indicates the matching framework version. The matching framework version must be the same as the framework version that you use.

What can I do if the libcuda.so.1: cannot open shared object file: No such file or directory link error is returned when I start Perseus?

Check whether the Compute Unified Device Architecture (CUDA) driver and CUDA SDK are installed.

What can I do if an error similar to libcudart.so.X.Y: cannot open shared object file: No such file or... is returned when I start Perseus?

Check whether the CUDA version is the same as the Perseus version.

What can I do if an error similar to unhandled system error is generated in the nccl_comm.cpp file in the container environment?

The error may be caused by an invalid setting of the shared memory (SHM) size. Allocate more space to the SHM by increasing the value of the shm-size parameter when you start the container. For example, use --shm-size=1g --ulimit memlock=-1 in the startup command.

What can I do if Perseus returns the python: double free or corruption error?

Possible causes:

  • Different shapes or sizes are specified for the same input tensor name in different ranks. In addition, the usage is essentially invalid. You cannot perform the AllReduce operation on different shapes.

  • The timing for executing different ranks varies greatly. You can synchronize data once after an epoch is complete. For example, you can use kv._barrier(); mx.nd.waitall() in MXNet.

What can I do if Python fails to exit after the training is complete?

If the training is complete in Python, an exit signal is sent to Perseus at the backend. Actually, the training is not complete in Python. This situation may be caused because unexpected data read processes keep alive in the model code. As a result, the main process fails to be stopped. One solution is to run the import sys; sys.exit(0) command to explicitly exit the main process after all processes are complete.

The startup process is extremely slow. A large number of threads are found in the system, including many Open Multi-Processing (OMP) threads. What can I do?

In most cases, it is recommended to allocate four or less OMP threads to a single GPU. You can specify the export OMP_NUM_THREADS to 4 or a lower value.