JindoRuntime is a Fluid runtime engine developed by Alibaba Cloud E-MapReduce (EMR) based on JindoFS. JindoRuntime can cache data stored in hostPath volumes of Kubernetes clusters to accelerate data access. JindoFS is developed based on C++ and provides dataset management and caching for Fluid. In hybrid cloud environments, you can use hostPath volumes to mount self-managed storage systems. This helps accelerate access to the self-managed storage systems. This topic describes how to use JindoRuntime to accelerate access to hostPath volumes.
Prerequisites
A Container Service for Kubernetes (ACK) Pro cluster with non-containerOS is created, and the Kubernetes version of the cluster is 1.18 or later. For more information, see Create an ACK Pro cluster.
ImportantThe ack-fluid component is not currently supported on the ContainerOS.
The cloud-native AI suite is installed and the ack-fluid component is deployed. The version of the ack-fluid component must be later than 1.0.6.
ImportantIf you have installed open source Fluid, you must uninstall Fluid before you can install the ack-fluid component.
If you have not installed the cloud-native AI suite, enable Fluid acceleration when you install the suite. For more information, see Deploy the cloud-native AI set.
If you have installed the cloud-native AI suite, log on to the ACK console and deploy ack-fluid from the Cloud-native AI Suite page.
Step 1: Prepare a mount point for the hostPath volume
JindoRuntime uses a distributed cache system to accelerate access to hostPath volumes. Therefore, JindoFS users need to create a host path on the nodes where the master and workers run. To do this, perform the following operations.
Run the following command to create a subdirectory as the mount point of the hostPath volume in the
/mntdirectory:$ mkdir /mnt/demo-remote-fsRun the following command to create the /mnt/demo-remote-fs directory on the
cn-beijing.192.168.1.45andcn-beijing.192.168.2.234nodes:# The preceding nodes are used as an example. Replace them with the actual node names. ssh cn-beijing.192.168.1.45 "mkdir -p /mnt/demo-remote-fs" ssh cn-beijing.192.168.2.234 "mkdir -p /mnt/demo-remote-fs"Run the following command to add labels to the
cn-beijing.192.168.1.45andcn-beijing.192.168.2.234nodes: Thedemo-remote-fs=truelabel limits the nodes to which the master and workers of JindoRuntime can be scheduled.kubectl label node cn-beijing.192.168.1.45 demo-remote-fs=true kubectl label node cn-beijing.192.168.2.234 demo-remote-fs=true
Step 2: Create a Dataset object and a JindoRuntime object
Create a file named
dataset.yamland add the following content to the file.The following
dataset.yamlfile contains two Fluid objects: a Dataset and a JindoRuntime.Dataset: a Dataset object that is configured with the preceding mount point.
JindoRuntime: the configuration of the JindoFS distributed cache system, including the number of worker pods and the maximum cache size that each worker can use.
apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: hostpath-demo-dataset spec: mounts: - mountPoint: local:///mnt/demo-remote-fs name: data path: / accessModes: - ReadOnlyMany --- apiVersion: data.fluid.io/v1alpha1 kind: JindoRuntime metadata: name: hostpath-demo-dataset spec: master: nodeSelector: demo-remote-fs: "true" worker: nodeSelector: demo-remote-fs: "true" fuse: nodeSelector: demo-remote-fs: "true" replicas: 2 tieredstore: levels: - mediumtype: MEM volumeType: emptyDir path: /dev/shm quota: 10Gi high: "0.99" low: "0.99"The following table describes the object parameters in the configuration file.
Parameter
Description
mountPoint
The data source to be mounted. You can mount a hostPath volume as a data source in the
local://<path>format.pathindicates the host path to be mounted, which must be an absolute path.nodeSelector
The constraint that limits the nodes to which the master component, worker component, and FUSE component of JindoRuntime can be scheduled. This parameter is to ensure that pods of each component run only on nodes with prepared host directories.
replicas
The number of worker pods to be deployed for JindoFS. You can modify the number based on your requirements.
mediumtype
The cache type. Supported cache types are HDD, SSD, and MEM.
For more information about the recommended configurations of the mediumtype, see Policy 2: Select proper cache media.
volumeType
The volume type of the cache medium. Valid values:
emptyDirandhostPath. Default value:hostPath.If you use memory or local system disks as the cache medium, we recommend that you use the
emptyDirtype to avoid residual cache data on the node and ensure node availability.If you use local data disks as the cache medium, you can use the
hostPathtype and configure thepathto specify the mount path of the data disk on the host.
For more information about the recommended configurations of the volumeType, see Policy 2: Select proper cache media.
path
The directory used by JindoFS workers to cache data. To accelerate data access, we recommend that you use
/dev/shmor a path to which a memory file system is mounted.quota
The maximum cache size that each worker can use. You can modify the value based on your requirements.
Run the following commands to create a Dataset object and a JindoRuntime object:
kubectl create -f dataset.yamlRun the following command to check whether the dataset is deployed:
kubectl get dataset hostpath-demo-datasetExpected output:
NoteJindoFS needs to pull an image during the first-time launch. The process may require 2 to 3 minutes based on the network conditions.
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE hostpath-demo-dataset 1.98GiB 0.00B 20.00GiB 0.0% Bound 3m54sIf the Dataset object is in the Bound state, JindoFS is running in the cluster and application pods can access the data defined in the Dataset object.
(Optional) Step 3: Create a DataLoad object to prefetch data
First-time queries may fail to hit the cache. Fluid allows you to create DataLoad objects to prefetch data to accelerate first-time queries.
Create a file named
dataload.yamland copy the following content to the file:apiVersion: data.fluid.io/v1alpha1 kind: DataLoad metadata: name: dataset-warmup spec: dataset: name: hostpath-demo-dataset namespace: default loadMetadata: true target: - path: / replicas: 1The following table describes the object parameters.
Parameter
Description
dataset.name
The name of the Dataset object to be prefetched.
dataset.namespace
The namespace to which the Dataset object belongs. The namespace must be the same as the namespace of the DataLoad object.
loadMetadata
Specifies whether to synchronize the metadata before prefetching. Set the value to true for JindoRuntime.
target[*].path
The path or file to be prefetched. The path must be a relative path of the mount point specified in the Dataset object.
For example, if the data source in the Dataset object is
pvc://my-pvc/mydataand you set path to/test, the/mydata/testpath in the file system used bymy-pvcis prefetched.target[*].replicas
The number of worker pods created to cache the prefetched path or file.
Run the following command to create the DataLoad object:
kubectl create -f dataload.yamlRun the following command to query the status of the DataLoad object:
kubectl get dataload dataset-warmupExpected output:
NAME DATASET PHASE AGE DURATION dataset-warmup pv-demo-dataset Complete 62s 9sRun the following command to query the status of the Dataset object:
kubectl get datasetExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE hostpath-demo-dataset 1.98GiB 1.98GiB 20.00GiB 100.0% Bound 7m24sAfter prefetching is complete, the size of the cached data (
CACHED) equals the size of the dataset. This indicates that the entire dataset is cached and the percentage of data that is cached (CACHED PERCENTAGE) is 100%.
Step 4: Create containers to access the hostPath volume
Create a file named
pod.yamland add the following content to the file. Set the claimName parameter in the file to the name of the Dataset object created in Step 2.apiVersion: v1 kind: Pod metadata: name: nginx spec: containers: - name: nginx image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6 command: - "bash" - "-c" - "sleep inf" volumeMounts: - mountPath: /data name: data-vol volumes: - name: data-vol persistentVolumeClaim: claimName: hostpath-demo-dataset # Specify the name of the Dataset object.Run the following command to create a pod:
kubectl create -f pod.yamlRun the following command to log on to the pod and access data from the pod:
$ kubectl exec -it nginx bashExpected output:
# A file named demo-file exists in the /data directory of the NGINX pod. The file is 2 GB in size. $ ls -lh /data total 2.0G -rwxrwxr-x 1 root root 2.0G Jun 9 04:02 demo-file # Run the cat /data/demofile > /dev/null command to read the demofile file and write the file to /dev/null, which takes 2.061 seconds. $ time cat /data/demofile > /dev/null real 0m2.061s user 0m0.015s sys 0m0.581sThe entire dataset is cached to the JindoFS cache system. When queries hit the cache, data is directly retrieved from the cache instead of remotely fetched from the file system. This reduces the distance of data transmission and accelerates data access.