All Products
Search
Document Center

Container Service for Kubernetes:Use JindoRuntime to accelerate access to PVs

Last Updated:Jan 30, 2024

JindoRuntime is a Fluid runtime engine developed by the Alibaba Cloud E-MapReduce (EMR) team based on JindoFS. JindoFS is developed based on C++ and provides dataset management and caching for Fluid. JindoRuntime can cache data stored in persistent volumes (PVs) of Kubernetes clusters to accelerate data access. In addition, PVs can use any self-managed file systems, such as CephFS. This topic describes how to use JindoRuntime to accelerate access to PVs.

Prerequisites

  • A Container Service for Kubernetes (ACK) Pro cluster is created and the Kubernetes version of the cluster is 1.18 or later. For more information, see Create an ACK Pro cluster.
  • The cloud-native AI suite is installed and the ack-fluid component is deployed. The version of the ack-fluid component must be later than 1.0.6.

    Important

    If you have installed open source Fluid, you must uninstall Fluid before you can install the ack-fluid component.

    • If you have not installed the cloud-native AI suite, enable Fluid acceleration when you install the suite. For more information, see Deploy the cloud-native AI set.

    • If you have installed the cloud-native AI suite, log on to the ACK console and deploy ack-fluid from the Cloud-native AI Suite page.

  • A kubectl client is connected to the ACK Pro cluster. For more information, see Connect to a cluster by using kubectl.
  • A PV and a persistent volume claim (PVC) that use the specified file system are created.

    In Kubernetes clusters, different methods are used to create volumes for different file systems. To ensure the stability of the connection between a file system and a Kubernetes cluster, refer to the official documentation of the file system and complete the prerequisites.

Step 1: Query the PV and PVC

Run the following command to query the PV and PVC:

kubectl get pvc,pv

Expected output:

NAME                                          STATUS   VOLUME                          CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/demo-pvc                Bound    demo-pv                         5Gi        RWX                           19h

NAME                                             CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                           STORAGECLASS   REASON   AGE
persistentvolume/demo-pv                         30Gi       RWX            Retain           Bound    default/demo-pvc                                        19h

The PV named demo-pv is 30 GB in size and supports the ReadOnlyMany (RWX) access mode. The PV is bound to a PVC named demo-pvc. The PV and the PVC can be used as expected.

Step 2: Create a Fluid Dataset object and a JindoRuntime object

  1. Create a file named dataset.yaml and copy the following content to the file.

    The following configuration defines two Fluid resource objects: Dataset and JindoRuntime.

    • Dataset: specifies information about the PVC.

    • JindoRuntime: specifies the configuration of the JindoFS distributed cache system, including the number of workers and the maximum size of data that can be cached on each worker.

    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
      name: pv-demo-dataset
    spec:
      mounts:
        - mountPoint: pvc://demo-pvc
          name: data
          path: /
      accessModes:
        - ReadOnlyMany
    ---
    apiVersion: data.fluid.io/v1alpha1
    kind: JindoRuntime
    metadata:
      name: pv-demo-dataset
    spec:
      replicas: 2
      tieredstore:
        levels:
          - mediumtype: MEM
            path: /dev/shm
            quota: 10Gi
            high: "0.9"
            low: "0.8"

    The following table describes the parameters.

    Parameter

    Description

    mountPoint

    The information about the data source to be mounted. When a PVC is specified as the data source, you can specify a path in the pvc://<pvc_name>/<path> format:

    • pvc_name: the name of the PVC. The PVC and the Dataset object must belong to the same namespace.

    • path: the subpath of the volume to be mounted. Make sure that the subpath exists. Otherwise, the volume fails to be mounted.

    replicas

    The number of workers for the JindoFS cache system. You can modify the number based on your requirements.

    mediumtype

    The cache type. Valid values: HDD, SSD, and MEM.

    In AI training scenarios, we recommend that you set the cache type to MEM. When MEM is specified, you need to mount the path specified in the path parameter as a memory file system. For example, you can specify a temporary path and mount the path as a tmpfs file system.

    path

    The path where the workers store the cached data. To ensure the optimal data access experience, we recommend that you use /dev/shm or a path that is mounted as a memory file system.

    quota

    The maximum size of data that can be cached on each worker. You can modify the size based on your business requirements.

  2. Run the following commands to create a Dataset object and a JindoRuntime object:

    kubectl create -f dataset.yaml
  3. Run the following command to check whether the Dataset object is deployed:

    kubectl get dataset pv-demo-dataset

    Expected output:

    Note

    The system needs to pull an image during the first time you start up the JindoFS cache system. The image pulling process may require 2 to 3 minutes depending on the network conditions.

    NAME              UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    pv-demo-dataset   10.96GiB         0.00B    20.00GiB         0.0%                Bound   2m13s

    If the Dataset object is in the Bound state, the JindoFS cache system has been launched. Application pods can access the data defined in the Dataset object as expected.

(Optional) Step 3: Create a DataLoad object to prefetch data

First-time queries cannot hit the cache. Fluid allows you to create DataLoad objects to prefetch data to accelerate first-time queries.

  1. Create a file named dataload.yaml and add the following content to the file:

    apiVersion: data.fluid.io/v1alpha1
    kind: DataLoad
    metadata:
      name: dataset-warmup
    spec:
      dataset:
        name: pv-demo-dataset
        namespace: default
      loadMetadata: true
      target:
        - path: /
          replicas: 1

    The following table describes the parameters.

    Parameter

    Description

    spec.dataset.name

    The name of the Dataset object to be prefetched.

    spec.dataset.namespace

    The namespace to which the Dataset object belongs. The namespace must be the same as the namespace of the DataLoad object.

    spec.loadMetadata

    Specifies whether to synchronize the metadata before prefetching. Set the value to true for JindoRuntime.

    spec.target[*].path

    The path or file to be prefetched. The path must be a relative path of the mount point specified in the Dataset object.

    For example, if the data source in the Dataset object is pvc://my-pvc/mydata and you set path to /test, the /mydata/test path in the file system used by PVC my-pvc is prefetched.

    spec.target[*].replicas

    The number of workers created to cache the prefetched path or file.

  2. Run the following command to create the DataLoad object:

    kubectl create -f dataload.yaml
  3. Run the following command to query the status of the DataLoad object:

    kubectl get dataload dataset-warmup

    Expected output:

    NAME             DATASET           PHASE      AGE   DURATION
    dataset-warmup   pv-demo-dataset   Complete   62s   12s
  4. Run the following command to query the status of the Dataset object:

    kubectl get dataset

    Expected output:

    NAME              UFS TOTAL SIZE   CACHED     CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    pv-demo-dataset   10.96GiB         10.96GiB   20.00GiB         100.0%              Bound   3m13s

    After the prefetching process is complete, the size of the cached data (CACHED) equals the size of the dataset. This indicates that the entire dataset is cached and the percentage of data that is cached (CACHED PERCENTAGE) is 100%.

Step 4: Create application pods to access the data stored in the PV

  1. Create a file named pod.yaml, add the following content to the file, and then set claimName to the name of the Dataset object created in Step 2.

    apiVersion: v1
    kind: Pod
    metadata:
      name: nginx
    spec:
      containers:
        - name: nginx
          image: nginx
          command:
          - "bash"
          - "-c"
          - "sleep inf"
          volumeMounts:
            - mountPath: /data
              name: data-vol
      volumes:
        - name: data-vol
          persistentVolumeClaim:
            claimName: pv-demo-dataset # Specify the name of the Dataset object.
  2. Run the following command to create application pods:

    kubectl create -f pod.yaml
  3. Run the following command to access data from a pod:

    kubectl exec -it nginx bash

    Expected output:

    # A file named demofile is stored in the /data path of the Nginx pod. The file is 11 GB in size. 
    ls -lh /data
    total 11G
    -rw-r----- 1 root root 11G Jul 22  2022 demofile
    
    # Run the cat /data/demofile > /dev/null command to read the demofile file and write the file to /dev/null, which takes 11.004 seconds. 
    time cat /data/demofile > /dev/null
    real    0m11.004s
    user    0m0.065s
    sys     0m3.089s

    The entire dataset is cached to the JindoFS cache system. When queries hit the cache, data is directly retrieved from the cache instead of remotely fetched from the file system. This reduces the distance of data transmission and accelerates data access.