how to use Fluid to accelerate access to OSS objects - Container Service for Kubernetes

Fluid is an open source, Kubernetes-native distributed dataset orchestrator and accelerator for data-intensive applications in cloud-native scenarios, such as big data applications and AI applications. JindoRuntime is the execution engine of JindoFS developed by the Alibaba Cloud E-MapReduce (EMR) team. JindoRuntime is based on C++ and provides dataset management and caching. JindoRuntime also supports Object Storage Service (OSS). Fluid enables the observability, auto scaling, and portability of datasets by managing and scheduling JindoRuntime. This topic describes how to use Fluid in a registered cluster to accelerate access to OSS objects.

How it works

The following figure shows how Fluid is used to accelerate access to OSS objects. 访问oss.png

Prerequisites

An external cluster is registered with Container Service for Kubernetes (ACK) through a registered cluster. For more information, see Create a registered cluster in the ACK console and Use onectl to create a registered cluster.
A kubectl client is connected to the registered cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
OSS is activated and a bucket is created. For more information, see Activate OSS and Create a bucket.

Step 1: Install ack-fluid

Use onectl

Install onectl on your on-premises machine. For more information, see Use onectl to manage registered clusters.
Run the following command to install ack-fluid:
```
onectl addon install ack-fluid --set pullImageByVPCNetwork=false
```
pullImageByVPCNetwork: optional. This parameter specifies whether to pull the component image through a virtual private cloud (VPC).
Expected output:
```
Addon ack-fluid, version **** installed.
```

Use the console

Log on to the ACK console. In the left-side navigation pane, choose Marketplace > Marketplace.
On the App Catalog tab, find and click ack-fluid.
In the upper-right part of the page, click Deploy.
In the Deploy panel, specify Cluster, keep the default settings for Namespace and Release Name, and then click Next.
Set Chart Version to the latest version, configure component parameters, and then click OK.

Step 2: Prepare data

Run the following command to download a test dataset:

wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz

Upload the test dataset to the OSS bucket. You can use the client ossutil provided by OSS to upload the dataset. For more information, see Install ossutil.

Step 3: Add labels to nodes in the external Kubernetes cluster

Run the following command to add the demo-oss=true label to all nodes in the external Kubernetes cluster. The label adds constraints to limit the nodes where the master and worker components of JindoRuntime can be deployed.

kubectl label node **** demo-oss=true

Step 4: Create a Dataset CR and a JindoRuntime CR

Create a file named mySecret.yaml and add the following content to the file.
The file is used to store the fs.oss.accessKeyId and fs.oss.accessKeySecret of OSS. You must create this file before creating the Dataset CustomResource (CR).
```
apiVersion: v1
kind: Secret
metadata:
  name: mysecret
stringData:
  fs.oss.accessKeyId: ****
  fs.oss.accessKeySecret: ****
```
Run the following command to deploy the mySecret file to generate a Secret:
```
kubectl create -f mySecret.yaml
```
Kubernetes automatically encrypts Secrets to avoid disclosing sensitive data in plaintext.

Create a file named resource.yaml and add the following content to file. The file contains a Dataset CR and a JindoRuntime CR.

Dataset: describes the dataset stored in the bucket and the underlying file system (UFS).

JindoRuntime: launches a JindoFS cluster to provide caching services.

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: hadoop
spec:
  mounts:
    - mountPoint: oss://<oss_bucket>/<bucket_dir>
      options:
        fs.oss.endpoint: <oss_endpoint>
      name: hadoop
      path: "/"
      encryptOptions:
        - name: fs.oss.accessKeyId
          valueFrom:
            secretKeyRef:
              name: mysecret
              key: fs.oss.accessKeyId
        - name: fs.oss.accessKeySecret
          valueFrom:
            secretKeyRef:
              name: mysecret
              key: fs.oss.accessKeySecret
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: hadoop
spec:
  # Make sure that the cache runtime runs only on the nodes in the external Kubernetes cluster. 
  master:
    nodeSelector:
      demo-oss: "true"
  worker:
    nodeSelector:
      demo-oss: "true"
  fuse:
    nodeSelector:
      demo-oss: "true"
  replicas: 2
  tieredstore:
    levels:
      - mediumtype: HDD
        path: /mnt/disk1
        quota: 100G
        high: "0.99"
        low: "0.8"

Resource	Parameter	Description
Dataset	mountPoint	`oss://<oss_bucket>/<bucket_dir>` specifies the path of the UFS to be mounted. You do not need to include the endpoint in the path.
Dataset	fs.oss.endpoint	The public or private endpoint of the OSS bucket.
JindoRuntime	replicas	The number of workers in the JindoFS cluster.
	mediumtype	The type of cache. You can select HDD, SSD, or MEM for JindoFS when you create the JindoRuntime template.
	path	The cache path. You can specify only one path. If you select MEM as the cache type, you need to specify a local path to store logs.
	quota	The maximum size of the cache. Unit: GB.
	high	The upper limit of the storage capacity.
	low	The lower limit of the storage capacity.

Run the following command to create a Dataset CR and a JindoRuntime CR:
```
kubectl create -f resource.yaml
```

Run the following command to query the deployment of the Dataset CR:

kubectl get dataset hadoop

Expected output:

NAME     UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
hadoop        210MiB       0.00B    100.00GiB              0.0%          Bound   1h

Run the following command to query the deployment of the JindoRuntime CR:

kubectl get jindoruntime hadoop

Expected output:

NAME     MASTER PHASE   WORKER PHASE   FUSE PHASE   AGE
hadoop   Ready          Ready          Ready        4m45s

Run the following command to query the status of the persistent volume (PV) and persistent volume claim (PVC):

kubectl get pv,pvc

Expected output:

NAME                      CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM            STORAGECLASS   REASON   AGE
persistentvolume/hadoop   100Gi      RWX            Retain           Bound    default/hadoop                           52m

NAME                           STATUS   VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/hadoop   Bound    hadoop   100Gi      RWX                           52m

The output indicates that the Dataset and JindoRuntime CRs are created.

Step 5: Create a containerized application to verify the acceleration service

You can create a containerized application or submit a machine learning job to verify the JindoFS acceleration service. This section describes how to create a containerized application to access the same dataset multiple times and then compare the time consumption to verify the acceleration service.

Create a file named app.yaml and add the following content to the file:

apiVersion: v1
kind: Pod
metadata:
  name: demo-app
spec:
  containers:
    - name: demo
      image: fluidcloudnative/serving
      volumeMounts:
        - mountPath: /data
          name: hadoop
  volumes:
    - name: hadoop
      persistentVolumeClaim:
        claimName: hadoop

Run the following command to create a containerized application:
```
kubectl create -f app.yaml
```

Run the following command to query the size of the file to be accessed:

kubectl exec -it demo-app -- bash
du -sh /data/spark-3.0.1-bin-hadoop2.7.tgz

Expected output:

209.7M    /data/spark-3.0.1-bin-hadoop2.7.tgz

Run the following command to query the time required to copy the file:
```
time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /test
```
Expected output:
```
real    1m2.374s
user    0m0.000s
sys     0m0.256s
```
The output indicates that it takes 62 seconds to copy the file.

Run the following command to query the cache information of the dataset:

kubectl get dataset hadoop

Expected output:

NAME     UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
hadoop   209.74MiB       209.74MiB    100.00GiB        100.0%           Bound   1h

The output indicates that 209.7 MiB of data is cached.

Run the following command to delete the current application and then create the same application.
Note
This operation helps eliminate the impact of other factors, such as page cache, on the verification result.
```
kubectl delete -f app.yaml && kubectl create -f app.yaml
```
Run the following command to query the time required to copy the file:
```
kubectl exec -it demo-app -- bash
time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /test
```
Expected output:
```
real	0m3.454s
user	0m0.000s
sys	  0m0.268s
```
The output indicates that it takes 3 seconds to copy the file, which is only one eighteenth of the original time. This is because the file is cached by JindoFS. It is much faster to access a cached file.

(Optional) Step 6: Clear the environment

If you no longer need the acceleration service, run the following commands to clear the environment.

Run the following command to delete the JindoRuntime and application:
```
kubectl delete jindoruntime hadoop
```
Run the following command to delete the dataset:
```
kubectl delete dataset hadoop
```