Learning about AIACC-AGSpeed | Install and Use AGSpeed

AIACC-AGSpeed (AGSpeed) is designed to optimize the computing performance of PyTorch models on Alibaba Cloud GPU-accelerated compute-optimized instances. Compared to AIACC, AGSpeed delivers imperceptible computing optimization. This article describes how to install and use AGSpeed.

Prerequisites

An Alibaba Cloud GPU-accelerated instance that meets the following requirements is created:

The OS is Alibaba Cloud Linux, CentOS 7.x and later, or Ubuntu 16.04 and later.

An NVIDIA driver and CUDA 10.0 or later are installed.

Supported Versions

AGSpeed supports Python, PyTorch, and CUDA versions. The following table describes the supported versions.

Python	PyTorch	CUDA	Download link
3.7	1.12.0	11.3	wheel package (torch1.12.0_cu113-cp37)
	1.12.0	11.6	wheel package (torch1.12.0_cu116-cp37)
	1.12.1	11.3	wheel package (torch1.12.1_cu113-cp37)
	1.12.1	11.6	wheel package (torch1.12.1_cu116-cp37)
3.8	1.12.0	11.3	wheel package (torch1.12.0_cu113-cp38)
	1.12.0	11.6	wheel package (torch1.12.0_cu116-cp38)
	1.12.1	11.3	wheel package (torch1.12.1_cu113-cp38)
	1.12.1	11.6	wheel package (torch1.12.1_cu116-cp38)
3.9	1.12.0	11.3	wheel package (torch1.12.0_cu113-cp39)
	1.12.0	11.6	wheel package (torch1.12.0_cu116-cp39)
	1.12.1	11.3	wheel package (torch1.12.1_cu113-cp39)
	1.12.1	11.6	wheel package (torch1.12.1_cu116-cp39)

Install AGSpeed

1. Download the wheel package.

Select the wheel package that matches the version of Python, PyTorch, and CUDA installed on your machine. For more information, see the table above.

2. Run the following command to install AGSpeed.

Run the pip install command to install AGSpeed in your environment.

pip install ${WHEEL_NAME} # Replace ${WHEEL_NAME} with the name of the wheel package that you download

Use AGSpeed

We recommend that you use agspeed.optimize() to package the model when you complete the preparations and are ready to execute the training loop.

For example, you can package the model with agspeed.optimize() after you place the model on the device and are ready to perform DDP optimization.

1. Run the following command to import the AGSpeed module.

import agspeed # Import AGSpeed to register the IR optimization pass and the optimized NvFuser in the PyTorch backend. 
model=agspeed.optimize (model) # Optimize the model. This calls the API to automatically obtain the computing diagram and then optimize the diagram with AGSpeed Backend Autotuner.

2. If your model uses PyTorch automatic mixed precision (AMP), you need to add the cache_enabled=False parameter in the autocast() context. The following section provides a sample code.

Note
This step applies only to the models that use AMP. For models that use FP32, skip this step.

After TorchDynamo obtains the computing diagram, AGSpeed uses torch.jit.trace to convert the diagram to TorchScript IR for backend for optimization. In this case, performing torch.jit.trace directly in the context of the autocast() will cause a conflict. Therefore, you must disable the cache_enabled parameter by adding cache_enabled=False in the context of the autocast(). For more information, see PyTorch commit.

from torch.cuda.amp.autocast_model import autocast

# ...

# Add cache_enabled=False to the autocast context parameter
with autocast(cache_enabled=False):
    loss = model(inputs)

scaler.scale(loss).backward()
scaler.step(optimizer)

# ...

3. If you use PyTorch 1.12.x and the model to be trained includes a SiLU function, use the LD_PRELOAD environment variable to import the symbolic differential of the SiLU function.

Note
This step applies only when you are using PyTorch 1.12.x and the model you want to train contains the SiLU function. Skip this step for other scenarios.

In PyTorch 1.12.x, the backend TorchScript does not contain the symbolic differential of aten::silu, which means that the aten::silu operation will not be included in differentiable computing diagram, and cannot be fused by the backend NvFuser. PyTorch does not allow you to dynamically add symbolic differentials. Therefore, AGSpeed integrates SiLU's symbolic differential in another dynamic link library (LD_PRELOAD) and adds the symbolic differential of aten::silu to the TorchScript backend. Before you start the training task, we recommend that you use the LD_PRELOAD environment variable to import the symbolic differential of the SiLU function.

a) Run the following command to view the installation path of AGSpeed.

python -c "import agspeed; print(agspeed.__path__[0])"

The following figure provides an example of the output.Output

b) Run the following command to check whether the libsymbolic_expand.so file is included in the preceding path.

ls -l ${your_agspeed_install_path} # Replace ${your_agspeed_install_path} with the AGSpeed installation path on your server.

The following figure provides an example of the output, which indicates that the libsymbolic_expand.so file is included in the path.File

c) Run the following command to import LD_PRELOAD environment variables.

# Replace ${your_agspeed_install_path} with the AGSpeed installation path on your server. 
export LD_PRELOAD=${your_agspeed_install_path}/libsymbolic_expand.so
# Start Training...

The following figure provides an example of the output during the process, which indicates that the aten::silu symbolic differential has been added to the TorchScript backend.register

Sample Code

The following section provides an example on how to import AGSpeed to your training code. In the example, the plus sign (+) indicates a new line.

+ import agspeed

  # Define dataloader
  dataloader = ...

  # Define model object
  model = ResNet()

  # Set the model device
  model.to(device)

  # Define the optimizer
  optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

  # Set DDP
  if distributed:
      model = DDP(model)

+ model = agspeed.optimize(model)

  ############################## The following section provides samples of the train loop when the model uses FP32 and AMP precision separately##############################

    ############### FP32 ###############
    # If the model to be trained uses FP32 precision, you do not need to modify train loop
  for data, target in dataloader:
      loss = model(data)
      loss.backward()
      optimizer.step()
      optimizer.zero_grad()
    ############### FP32 ###############

    ############### AMP ###############
    # If the model to be trained uses AMP precision, you need to add cache_enabled=False in the context of autocast.
+ with autocast(cache_enabled=False):
      for data, target in dataloader:
        loss = model(data)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        optimizer.zero_grad()
                scaler.update()
    ############### AMP ###############

  ############################## Add the symbolic differential of SiLU function by LD_PRELOAD ##############################

 # The displayed path is the AGSpeed installation path on your server
  python -c "import agspeed; print(agspeed.__path__[0])"

 # Replace ${your_agspeed_install_path} with the installation path of AGSpeed on your server
+ export LD_PRELOAD=${your_agspeed_install_path}/libsymbolic_expand.so

 # Run the training command
 python train.py

Log Examples

The log examples help you check whether AGSpeed is enabled.

The log that indicates success import of AGSpeed

When you import AGSpeed, the IR optimization pass and the optimized NvFuser are automatically registered. The following log output indicates that AGSpeed is successfully imported. You can proceed with the next step.log1

AGSpeed Autotuning Log

AGSpeed performs Autotuning in the first few steps of the training process to automatically select the optimal solution for your training task. The following log output indicates that AGSpeed is enabled.

Community

Learning about AIACC-AGSpeed | Install and Use AGSpeed

Prerequisites

Supported Versions

Install AGSpeed

Use AGSpeed

Sample Code

Log Examples

Read previous post:

Read next post:

Alibaba Cloud Community

You may also like

Comments

Alibaba Cloud Community

Related Products

AIRec

Artificial Intelligence Service for Conversational Chatbots Solution

GPU(Elastic GPU Service)

Alibaba Cloud Model Studio