JindoRuntime is the execution engine of JindoFS developed by the Alibaba Cloud E-MapReduce (EMR) team. JindoRuntime is based on C++ and provides dataset management and caching. JindoRuntime also supports Object Storage Service (OSS). Fluid enables the observability, auto scaling, and portability of datasets by managing and scheduling JindoRuntime. This topic describes how to use Fluid to deploy ImageNet datasets from OSS to a Kubernetes cluster. It also describes how to use Arena to train ResNet-50 models based on the deployed datasets.
Prerequisites
- A Container Service for Kubernetes (ACK) Pro cluster is created and the Kubernetes version of the cluster is 1.18 or later. For more information, see Create an ACK Pro cluster.
- The cloud-native AI suite is installed and the ack-fluid component is deployed.
- If you have not installed the cloud-native AI suite, enable Fluid acceleration when you install the suite. For more information, see Deploy the cloud-native AI suite.
- If you have already installed the cloud-native AI suite, go to the Cloud-native AI Suite page of the Container Service for Kubernetes (ACK) console and deploy the ack-fluid component.
- A kubectl client is connected to the ACK Pro cluster. For more information, see Connect to a cluster by using kubectl.
Arena 0.6.0 or later is installed. For more information, see Arena.
The code of Horovod 0.18.1 or later is obtained. For more information, see Horovod.
The code of TensorFlow benchmarks is obtained. For more information, see Benchmark.
Use Fluid to deploy ImageNet datasets from OSS to a Kubernetes cluster
If you do not want to use the datasets provided by Alibaba Cloud, you can download datasets from the official website of ImageNet. For more information, see images.
If you want to use the datasets provided by Alibaba Cloud, submit an issue to Fluid to request the datasets. For more information, see Fluid.
In this example, a training job that runs on four nodes provided by Alibaba Cloud with eight NVIDIA Tesla V100 GPUs is used to show how to use Fluid to deploy ImageNet datasets from OSS to a Kubernetes cluster.
Create a file named dataset.yaml and copy the following content to the file:
apiVersion: v1 kind: Secret metadata: name: mysecret stringData: fs.oss.accessKeyId: xxx fs.oss.accessKeySecret: xxx --- apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: imagenet spec: mounts: - mountPoint: oss://<oss_bucket>/<bucket_dir> options: fs.oss.endpoint: <oss_endpoint> name: imagenet encryptOptions: - name: fs.oss.accessKeyId valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeyId - name: fs.oss.accessKeySecret valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeySecret The nodeAffinity settings ensure that the data is cached to nodes that are equipped with NVIDIA Tesla V100 GPUs. required: nodeSelectorTerms: - matchExpressions: - key: aliyun.accelerator/nvidia_name operator: In values: - Tesla-V100-SXM2-16GB
To ensure that the datasets are mounted to JindoFS from OSS, you must set mountPoint, fs.oss.accessKeyId, fs.oss.accessKeySecret, and fs.oss.endpoint to proper values. For more information about the parameters, see Parameter Description.
Create a file named runtime and copy the following content to the file:
apiVersion: data.fluid.io/v1alpha1 kind: JindoRuntime metadata: name: imagenet spec: replicas: 4 tieredstore: levels: - mediumtype: SSD path: /var/lib/docker/jindo quota: 180G high: "0.99" low: "0.8"
In this example, the training job runs on four nodes with eight NVIDIA Tesla V100 GPUs. Therefore,
spec.replicas
is set to 4.Run the following commands to create a dataset and a JindoRuntime:
kubectl create -f dataset.yaml kubectl create -f runtime.yaml
Run the following command to check whether the Dataset is deployed:
kubectl get dataset
Expected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE imagenet 143.67GiB 0.00B 648.00GiB 0.0% Bound 5
Run the following command to check whether the JindoRuntime is deployed:
kubectl get jindoruntime
Expected output:
NAME MASTER PHASE WORKER PHASE FUSE PHASE AGE imagenet Ready Ready Ready 4m45s
Run the following command to check whether the persistent volume (PV) and persistent volume claim (PVC) are created:
kubectl get pv,pvc
Expected output:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE persistentvolume/imagenet 100Gi RWX Retain Bound default/imagenet 52m NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/imagenet Bound imagenet 100Gi RWX 52m
The preceding outputs indicate that the ImageNet datasets are deployed to the Kubernetes cluster.
Use Arena to submit a deep learning job
Run the following command to submit a ResNet-50 model training job that runs on four nodes with eight NVIDIA Tesla V100 GPUs:
arena submit mpi \
--name horovod-resnet50 \
--gpus=8 \
--workers=4 \
--working-dir=/horovod-demo/tensorflow-demo/ \
--data imagenet:/data \
-e DATA_DIR=/data/imagenet \
-e num_batch=1000 \
-e datasets_num_private_threads=8 \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod-benchmark-dawnbench-v2:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
./launch-example.sh 4 8
Description of Arena parameters:
Parameter | Description |
--name | Specifies the name of the training job. |
--workers | Specifies the number of worker nodes on which the job runs. |
--gpus | Specifies the number of GPUs that are used by each worker node on which the job runs. |
--working-dir | Specifies the working directory. |
--data | Mounts the |
-e | Specifies the path to the dataset. |
launch-example.sh 4 8 | Executes the script to start the job. |
Check whether the job is started
Run the following command to check whether the job is started:
arena get horovod-resnet50
Expected output:
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 56m
NAME STATUS TRAINER AGE INSTANCE NODE
horovod-resnet50 RUNNING MPIJOB 56m horovod-resnet50-launcher-xs7h4 192.168.0.15
horovod-resnet50 RUNNING MPIJOB 56m horovod-resnet50-worker-0 192.168.0.15
horovod-resnet50 RUNNING MPIJOB 56m horovod-resnet50-worker-1 192.168.0.14
horovod-resnet50 RUNNING MPIJOB 56m horovod-resnet50-worker-2 192.168.0.12
horovod-resnet50 RUNNING MPIJOB 56m horovod-resnet50-worker-3 192.168.0.13
The output shows that the ResNet-50 model training job is started.
Check the log of Arena
Run the following command to check the log of Arena:
arena logs --tail 100 -f horovod-resnet50
Performance comparison
In this topic, Fluid is used to deploy ImageNet datasets from OSS to a Kubernetes cluster, Arena is used to run a ResNet-50 model training job based on the deployed datasets, and on-premises caching is enabled based on JindoRuntime of JindoFS. This method provides higher performance than using OSSFS and reduces the time consumption by 76%.
Clear the environment
If you no longer use data acceleration, clear the environment.
Run the following command to delete the dataset and JindoRuntime:
kubectl delete -f dataset imagenet
kubectl delete -f jindoruntime imagenet