This topic describes how to develop an AI algorithm by using the cloud-native AI component set and the open source Fashion-MNIST dataset. The process includes model development, model training and optimization, model management, model evaluation, and model deployment.
Background information
The cloud-native AI component set includes components that can be independently deployed by using Helm charts. You can use these components to accelerate AI projects.
- Administrators manage users and permissions, allocate cluster resources, configure external storage, manage datasets, and monitor resource utilization by using dashboards.
- Developers use cluster resources and submit jobs. Developers are created by administrators and must be granted permissions before developers can perform development by using tools such as the CLI, web UI, or Jupyter Notebook.
Prerequisites
The following operations are completed by an administrator:
- A Container Service for Kubernetes (ACK) cluster is created. For more information,
see Create an ACK managed cluster.
- The disk size of each node in the cluster is at least 300 GB.
- If you require optimal data acceleration, use four Elastic Compute Service (ECS) instances that each provides eight V100 GPUs.
- If you require optimal topology awareness, use two ECS instances that each provides two V100 GPUs.
- All components in the cloud-native AI component set are installed in the cluster. For more information, see Deploy the cloud-native AI component set.
- AI Dashboard is ready for use. For more information about how to configure AI Dashboard, see Access AI Dashboard.
- AI Developer Console is ready for use. For more information about how to configure AI Developer Console, see Access the AI development console.
- The Fashion-MNIST dataset is downloaded and uploaded to an Object Storage Service (OSS) bucket. For more information about how to upload a model to an OSS bucket, see Upload objects.
- The address, username, and password of the Git repository that stores the test code is obtained.
- A kubectl client is connected to the cluster. For more information, see Connect to ACK clusters by using kubectl.
- Arena is installed. For more information, see Install Arena.
Test environment
- Step 1: Create a user and allocate resources and Step 2: Prepare a dataset must be performed by the administrator.
- The remaining steps can be performed by developers.
You must create a terminal in Jupyter Notebook or use a jump server in the cluster to submit Arena commands. We recommend that you create a terminal in Jupyter Notebook.
Host name | IP | Role | Number of GPUs | Number of vCPUs | Memory |
---|---|---|---|---|---|
cn-beijing.192.168.0.13 | 192.168.0.13 | Jump server | 1 | 8 | 30580004 KiB |
cn-beijing.192.168.0.16 | 192.168.0.16 | Worker | 1 | 8 | 30580004 KiB |
cn-beijing.192.168.0.17 | 192.168.0.17 | Worker | 1 | 8 | 30580004 KiB |
cn-beijing.192.168.0.240 | 192.168.0.240 | Worker | 1 | 8 | 30580004 KiB |
cn-beijing.192.168.0.239 | 192.168.0.239 | Worker | 1 | 8 | 30580004 KiB |
Experiment objectives
- Manage datasets.
- Use Jupyter Notebook to set up the development environment.
- Submit standalone training jobs.
- Submit distributed training jobs.
- Use Fluid to accelerate training jobs.
- Use the cybernetes scheduler to accelerate training jobs.
- Manage models.
- Evaluate models.
- Deploy an inference service.
Step 1: Create a user and allocate resources
- The username and password of a user. For more information about how to create a user, see Manage users.
- Resource quotas. For more information about how to allocate resource quotas, see Manage elastic quota groups.
- The endpoint of AI Developer Console if developers want to submit jobs by using AI Developer Console. For more information about how to access AI Developer Console, see Access the AI development console.
- The kubeconfig file that is used to log on to the cluster if developers want to submit jobs by using Arena. For more information about how to obtain the kubeconfig file that is used to log on to a cluster, see Step 2: Select a type of cluster credentials.
Step 2: Prepare a dataset
The administrator must prepare a dataset. In this example, the Fashion-MNIST dataset is used.
a: Add the Fashion-MNIST dataset
b: Accelerate the dataset
The administrator must accelerate the dataset by using AI Dashboard.
Step 3: Develop a model
- Use a custom image to create a Jupyter notebook (optional).
- Use the Jupyter notebook to develop and test a model.
- Use the Jupyter notebook to submit code to a Git repository.
- Use the Arena SDK to submit a training job.
a (optional): Use a custom image to create a Jupyter notebook
AI Developer Console provides various versions of images that support TensorFlow and PyTorch for you to create Jupyter notebooks. You can also use a custom image to meet your requirements.
b: Use the Jupyter notebook to develop and test a model
c: Use the Jupyter notebook to submit code to a Git repository
After the notebook is created, you can use the notebook to submit code to a Git repository.
d: Use the Arena SDK to submit a training job
Step 4: Train a model
Perform the following steps to submit a standalone TensorFlow training job, a distributed TensorFlow training job, a Fluid-accelerated training job, and a cybernetes-accelerated training job.
Submit a standalone TensorFlow training job
After you develop a model by using the notebook and save the model, you can use Arena or AI Developer Console to submit a training job.
Method 1: Use Arena to submit a standalone TensorFlow training job
arena \
submit \
tfjob \
-n ns1 \
--name=fashion-mnist-arena \
--data=fashion-mnist-jackwg-pvc:/root/data/ \
--env=DATASET_PATH=/root/data/ \
--env=MODEL_PATH=/root/saved_model \
--env=MODEL_VERSION=1 \
--env=GIT_SYNC_USERNAME=<GIT_USERNAME> \
--env=GIT_SYNC_PASSWORD=<GIT_PASSWORD> \
--sync-mode=git \
--sync-source=https://codeup.aliyun.com/60b4cf5c66bba1c04b442e49/tensorflow-fashion-mnist-sample.git \
--image="tensorflow/tensorflow:2.2.2-gpu" \
"python /root/code/tensorflow-fashion-mnist-sample/train.py --log_dir=/training_logs"
Method 2: Use AI Developer Console to submit a standalone TensorFlow training job
Submit a distributed TensorFlow training job
Method 1: Use Arena to submit a distributed TensorFlow training job
Method 2: Use AI Developer Console to submit a distributed TensorFlow training job
Submit a Fluid-accelerated training job
- The administrator accelerates the dataset on AI Dashboard.
- A developer uses Arena to submit a training job that uses the accelerated dataset.
- Use Arena to query the time that is required to complete the training job.
Use cybernetes to accelerate a training job
ACK provides the cybernetes scheduler that is optimized for AI and big data computing. cybernetes supports gang scheduling, capacity scheduling, and topology-aware scheduling. In this example, a training job that has topology-aware GPU scheduling enabled is used.
To ensure high performance for AI workloads, cybernetes uses an optimal scheduling solution based on the topological information about heterogeneous resources on nodes. The information includes how GPUs communicate with each other by using NVLink and PCIe switches, and the non-uniform memory access (NUMA) topology of CPUs. For more information about topology-aware GPU scheduling, see Overview of topology-aware GPU scheduling. For more information about topology-aware CPU scheduling, see Topology-aware CPU scheduling.
Perform the following steps to submit a training job that has topology-aware GPU scheduling enabled and a training job that has topology-aware GPU scheduling disabled. Then, compare the time that is required to complete the jobs.
Training job | Processing time per GPU (ns) | Total GPU processing time (ns) | Duration (s) |
---|---|---|---|
Topology-aware GPU scheduling enabled | 56.4 | 225.50 | 44 |
Topology-aware GPU scheduling disabled | 251.7 | 1006.44 | 120 |
kubectl label node cn-beijing.192.168.XX.XX0 ack.node.gpu.schedule=default --overwrite
Step 5: Manage the model
Step 6: Evaluate the model
- Use Arena to submit a training job that exports a checkpoint.
- Use Arena to submit an evaluation job.
- Use AI Developer Console to compare the evaluation results of different models.
Step 7: Deploy the model as a service
After a model is developed and evaluated, you can deploy the model as a service for your business. The following steps describe how to deploy the preceding model as an inference service named tf-serving. Arena supports various service architectures, such as Triton and Seldon. For more information, see Arena serve guide.
In this example, the model that is trained in Step 4: Train a model is used. The model is stored in the fashion-minist-demo PVC that is used in Step 2: Prepare a dataset. If you want to store the model to other types of storage, you must first create a PVC of the storage type that you want to use.
FAQ
- How do I install commonly used software in the notebook console?
To install commonly used software in the notebook console, run the following command:
apt-get install ${Software name}
- How do I resolve character set encoding errors?
Modify the /etc/locale file based on the following content and then reopen the terminal.
LC_CTYPE="da_DK.UTF-8" LC_NUMERIC="da_DK.UTF-8" LC_TIME="da_DK.UTF-8" LC_COLLATE="da_DK.UTF-8" LC_MONETARY="da_DK.UTF-8" LC_MESSAGES="da_DK.UTF-8" LC_PAPER="da_DK.UTF-8" LC_NAME="da_DK.UTF-8" LC_ADDRESS="da_DK.UTF-8" LC_TELEPHONE="da_DK.UTF-8" LC_MEASUREMENT="da_DK.UTF-8" LC_IDENTIFICATION="da_DK.UTF-8" LC_ALL=