how to use JindoRuntime to accelerate access to hostPath volumes - Container Service for Kubernetes

JindoRuntime is a Fluid runtime engine developed based on the JindoFS file system of the Alibaba Cloud E-MapReduce (EMR) team. JindoFS is developed based on C++ and provides dataset management and data caching for Fluid. JindoRuntime can cache data stored in Kubernetes hostPath volumes to accelerate data access. In hybrid cloud environments, you can use hostPath volumes to mount self-managed storage systems. This helps accelerate access to the self-managed storage systems. This topic describes how to use JindoRuntime to accelerate access to hostPath volumes.

Prerequisites
Step 1: Prepare a mount point for the hostPath volume
Step 2: Create a Dataset object and a JindoRuntime object
(Optional) Step 3: Create a DataLoad object to prefetch data
Step 4: Create containers to access the hostPath volume

Prerequisites

A Container Service for Kubernetes (ACK) Pro cluster is created and the Kubernetes version of the cluster is 1.18 or later. For more information, see Create an ACK Pro cluster.
The cloud-native AI suite is installed and the ack-fluid component is deployed.
- If you have not installed the cloud-native AI suite, enable Fluid acceleration when you install the suite. For more information, see Deploy the cloud-native AI set.
- If you have already installed the cloud-native AI suite, go to the Cloud-native AI Suite page of the ACK console and deploy the ack-fluid component.

Step 1: Prepare a mount point for the hostPath volume

JindoRuntime uses a distributed cache system to accelerate access to hostPath volumes. Therefore, JindoFS users need to create a host path on the nodes where the master and workers run. To do this, perform the following operations.

Run the following command to create a subdirectory as the mount point of the hostPath volume in the /mnt directory:
```
$ mkdir /mnt/demo-remote-fs
```

Run the following command to create the /mnt/demo-remote-fs directory on the cn-beijing.192.168.1.45 and cn-beijing.192.168.2.234 nodes:

# The preceding nodes are used as an example. Replace them with the actual node names. 
ssh cn-beijing.192.168.1.45 "mkdir -p /mnt/demo-remote-fs"
ssh cn-beijing.192.168.2.234 "mkdir -p /mnt/demo-remote-fs"

Run the following command to add labels to the cn-beijing.192.168.1.45 and cn-beijing.192.168.2.234 nodes: The demo-remote-fs=true label limits the nodes to which the master and workers of JindoRuntime can be scheduled.
```
kubectl label node cn-beijing.192.168.1.45 demo-remote-fs=true
kubectl label node cn-beijing.192.168.2.234 demo-remote-fs=true
```

Step 2: Create a Dataset object and a JindoRuntime object

Create a file named dataset.yaml and add the following content to the file.

The following dataset.yaml file contains two Fluid objects to be created: Dataset and JindoRuntime.

Dataset: a Dataset object that is configured with the preceding mount point.
JindoRuntime: the configuration of the JindoFS distributed cache system, including the number of worker pods and the maximum cache size that each worker can use.

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: hostpath-demo-dataset
spec:
  mounts: 
    - mountPoint: local:///mnt/demo-remote-fs
      name: data
      path: /
  accessModes:
    - ReadOnlyMany
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: hostpath-demo-dataset
spec:
  master:
     nodeSelector:
       demo-remote-fs: "true"
  worker:
    nodeSelector:
      demo-remote-fs: "true"
  fuse:
    nodeSelector:
      demo-remote-fs: "true"
  replicas: 2
  tieredstore:
    levels:
      - mediumtype: MEM
        path: /dev/shm
        quota: 10Gi
        high: "0.99"
        low: "0.99"

The following table describes the object parameters in the configuration file.

Parameter	Description
*Dataset.spec.mounts[].mountPoint**	The data source to be mounted. You can mount a hostPath volume as a data source in the `local://<path>` format. `path` indicates the host path to be mounted, which must be an absolute path.
Dataset.spec.nodeAffinity	The constraint that limits the nodes to which the master and workers of JindoRuntime can be scheduled. This parameter is equivalent to the `Pod.Spec.Affinity.NodeAffinity` parameter.
JindoRuntime.spec.replicas	The number of worker pods to be deployed for JindoFS. You can modify the number based on your requirements.
*JindoRuntime.spec.tieredstore.levels[].mediumtype**	The type of cache. Supported cache types are HDD, SSD, and MEM. In AI training scenarios, we recommend that you use MEM. When MEM is used, you need to set path to a memory file system. For example, you can specify a temporary mount point and mount a temporary file system (TMPFS) file system to the mount point.
*JindoRuntime.spec.tieredstore.levels[].path**	The directory used by JindoFS workers to cache data. To accelerate data access, we recommend that you use `/dev/shm` or a path to which a memory file system is mounted.
*JindoRuntime.spec.tieredstore.levels[].quota**	The maximum cache size that each worker can use. You can modify the value based on your requirements.

Run the following commands to create a Dataset object and a JindoRuntime object:
```
kubectl create -f dataset.yaml
```
Run the following command to check whether the dataset is deployed:
```
kubectl get dataset hostpath-demo-dataset
```
Expected output:
Note
JindoFS needs to pull an image during the first-time launch. The process may require 2 to 3 minutes based on the network conditions.
```
NAME                    UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
hostpath-demo-dataset   1.98GiB          0.00B    20.00GiB         0.0%                Bound   3m54s
```
If the Dataset object is in the Bound state, JindoFS is running in the cluster and application pods can access the data defined in the Dataset object.

(Optional) Step 3: Create a DataLoad object to prefetch data

First-time queries cannot hit the cache. Consequently, pods cannot access data efficiently. Fluid allows you to create DataLoad objects to prefetch data in order to accelerate first-time queries.

Create a file named dataload.yaml and copy the following content to the file:

apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: dataset-warmup
spec:
  dataset:
    name: hostpath-demo-dataset
    namespace: default
  loadMetadata: true
  target:
    - path: /
      replicas: 1

The following table describes the object parameters.

Parameter	Description
spec.dataset.name	The name of the Dataset object to be prefetched.
spec.dataset.namespace	The namespace to which the Dataset object belongs. The namespace must be the same as the namespace of the DataLoad object.
spec.loadMetadata	Specify whether to synchronize the metadata before prefetching. Set the value to true for JindoRuntime.
*spec.target[].path**	The path or file to be prefetched. The path must be a relative path of the mount point specified in the Dataset object. For example, if the data source in the Dataset object is `pvc://my-pvc/mydata` and you set path to `/test`, the `/mydata/test` path in the file system used by `my-pvc` is prefetched.
*spec.target[].replicas**	The number of worker pods created to cache the prefetched path or file.

Run the following command to create the DataLoad object:
```
kubectl create -f dataload.yaml
```

Run the following command to query the status of the DataLoad object:

kubectl get dataload dataset-warmup

Expected output:

NAME             DATASET           PHASE      AGE   DURATION
dataset-warmup   pv-demo-dataset   Complete   62s   9s

Run the following command to query the status of the Dataset object:
```
kubectl get dataset
```
Expected output:
```
NAME                    UFS TOTAL SIZE   CACHED    CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
hostpath-demo-dataset   1.98GiB          1.98GiB   20.00GiB         100.0%              Bound   7m24s
```
After the prefetching process is complete, the size of the cached data (CACHED) equals the size of the dataset. This indicates that the entire dataset is cached and the percentage of data that is cached (CACHED PERCENTAGE) is 100%.

Step 4: Create containers to access the hostPath volume

Create a file named pod.yaml and add the following content to the file. Set the claimName parameter in the file to the name of the Dataset object created in Step 2.

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
    - name: nginx
      image: nginx
      command:
      - "bash"
      - "-c"
      - "sleep inf"
      volumeMounts:
        - mountPath: /data
          name: data-vol
  volumes:
    - name: data-vol
      persistentVolumeClaim:
        claimName: hostpath-demo-dataset # Specify the name of the Dataset object.

Run the following command to create a pod:
```
kubectl create -f pod.yaml
```

Run the following command to log on to the pod and access data from the pod:

$ kubectl exec -it nginx bash

Expected output:

# A file named demo-file exists in the /data directory of the NGINX pod. The file is 2 GB in size. 
$ ls -lh /data
total 2.0G
-rwxrwxr-x 1 root root 2.0G Jun  9 04:02 demo-file

# Run the cat /data/demofile > /dev/null command to read the demofile file and write the file to /dev/null, which takes 2.061 seconds. 
$ time cat /data/demofile > /dev/null
real    0m2.061s
user    0m0.015s
sys     0m0.581s

The entire dataset is cached to the JindoFS cache system. When queries hit the cache, data is directly retrieved from the cache instead of remotely fetched from the file system. This reduces the distance of data transmission and accelerates data access.

Container Service for Kubernetes:Accelerate access to hostPath volumes

Table of contents

Prerequisites

Step 1: Prepare a mount point for the hostPath volume

Step 2: Create a Dataset object and a JindoRuntime object

(Optional) Step 3: Create a DataLoad object to prefetch data

Step 4: Create containers to access the hostPath volume