Periodically update a Dataset by using a DataLoad job - Container Service for Kubernetes

Fluid is an open source, Kubernetes-native distributed dataset orchestrator and accelerator engine for data-intensive applications in cloud-native scenarios. Fluid enables dataset visibility, elastic scaling, and data migration by managing and scheduling underlying cache runtimes. This topic uses JindoFS as an example to demonstrate how to schedule data loading.

Prerequisites

You have created a Container Service for Kubernetes (ACK) managed cluster Pro Edition of version 1.18 or later. For more information, see Create an ACK managed cluster Pro Edition.
The cloud-native AI suite is installed, and the ack-fluid component is deployed.
Important
- If you have installed open source Fluid, you must uninstall it before you deploy the ack-fluid component.
- Make sure that the ack-fluid version is 1.0.3.
- If you have not installed the cloud-native AI suite, you can enable Fluid during the installation. For more information, see Install the cloud-native AI suite.
- If the cloud-native AI suite is installed, log on to the ACK console and deploy ack-fluid from the Cloud-native AI Suite page.
You have connected to a Kubernetes cluster using kubectl. For more information, see Connect to a cluster using the kubectl tool.

Step 1: Prepare data in an OSS bucket

You can run the following command to download the test data.

wget https://archive.apache.org/dist/hbase/2.5.2/RELEASENOTES.md

You can install ossutil and create a bucket. For more information, see Install ossutil.
You can run the following command to upload the test data to the OSS bucket:
```
ossutil64 cp RELEASENOTES.md oss://<bucket>/<path>/RELEASENOTES.md
```

Step 2: Create a Dataset and a JindoRuntime

You can create a mySecret.yaml file to store the accessKeyId and accessKeySecret for OSS access. The following YAML provides an example.

apiVersion: v1
kind: Secret
metadata:
  name: mysecret
stringData:
  fs.oss.accessKeyId: ****** # Enter your AccessKey ID.
  fs.oss.accessKeySecret: ****** # # Enter your AccessKey secret.

You can run the following command to create the Secret.
```
kubectl create -f mySecret.yaml
```
Expected output:
```
secret/mysecret created
```

You can create a dataset.yaml file to create the Dataset.

Click to view the sample YAML

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: demo
spec:
  mounts:
    - mountPoint: oss://<bucket-name>/<path>
      options:
        fs.oss.endpoint: <oss-endpoint>
      name: hbase
      path: "/"
      encryptOptions:
        - name: fs.oss.accessKeyId
          valueFrom:
            secretKeyRef:
              name: mysecret
              key: fs.oss.accessKeyId
        - name: fs.oss.accessKeySecret
          valueFrom:
            secretKeyRef:
              name: mysecret
              key: fs.oss.accessKeySecret
  accessModes:
    - ReadOnlyMany
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: demo
spec:
  replicas: 1
  tieredstore:
    levels:
      - mediumtype: MEM
        path: /dev/shm
        quota: 2Gi
        high: "0.99"
        low: "0.8"
  fuse:
   args:
    - -okernel_cache
    - -oro
    - -oattr_timeout=60
    - -oentry_timeout=60
    - -onegative_timeout=60

The following table describes the parameters.

Parameter		Description
Dataset	mountPoint	oss://<oss_bucket>/<path> specifies the path to which the underlying file system (UFS) is mounted. The path does not need to include endpoint information.
	fs.oss.endpoint	The endpoint of the OSS bucket. You can use a public or private endpoint.
	accessModes	The access mode of the Dataset.
JindoRuntime	replicas	The number of worker nodes in the JindoFS cluster.
	mediumtype	The cache medium type. JindoFS supports HDD, SSD, and MEM. You can specify any of these types when you create a JindoRuntime.
	path	The storage path. Only a single path is supported. If you set mediumtype to MEM, you must specify a local path to store files such as logs.
	quota	The maximum cache capacity in GB. You can configure the cache capacity based on the size of the data in the UFS.
	high	The high watermark for storage capacity.
	low	The low watermark for storage capacity.
	fuse.args	Optional mount parameters for the FUSE client. Use these parameters with the access mode of the Dataset. If the access mode is ReadOnlyMany, enable kernel_cache to use the kernel cache to optimize read performance. You can set the timeout for attr_timeout (file attribute cache), entry_timeout (file name lookup cache), and negative_timeout (failed file name lookup cache). The default value is 7200s for all. If the access mode is ReadWriteMany, use the default configurations. The parameters are set as follows: `- -oauto_cache` `- -oattr_timeout=0` `- -oentry_timeout=0` `- -onegative_timeout=0` Note Use auto_cache to ensure that the cache is invalidated if the file size or modification time changes. Set all timeout values to 0.

You can run the following command to deploy dataset.yaml to create the JindoRuntime and Dataset.
```
kubectl create -f dataset.yaml
```
Expected output:
```
dataset.data.fluid.io/demo created
jindoruntime.data.fluid.io/demo created
```

You can run the following command to check the status of the Dataset.

kubectl get dataset

Expected output:

NAME    UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
demo    588.90KiB        0.00B       10.00GiB         0.0%                Bound   2m7s

Step 3: Create a scheduled DataLoad job

You can create a dataload.yaml file.

Click to view the sample YAML

apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: cron-dataload
spec:
  dataset:
    name: demo
    namespace: default
  policy: Cron
  schedule: "*/2 * * * *" # Run every 2 min

The following table describes the parameters.

Parameter	Description
dataset	The name and namespace of the Dataset for the DataLoad job.
policy	The execution policy. Valid values: Once and Cron. This example creates a scheduled DataLoad job. Once: The job runs only once. Cron: The job runs on a schedule.
schedule	The schedule of the DataLoad job. The value of the `.spec.schedule` field follows Cron syntax. For more information, see Cron schedule syntax.

For more information about the advanced configurations for a DataLoad job, see the following configuration file:

apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: cron-dataload
spec:
  dataset:
    name: demo
    namespace: default
  policy: Cron # The execution policy for the DataLoad job. Valid values: [Once, Cron].
  schedule: * * * * * # This field is valid only when policy is set to Cron.
	loadMetadata: true # Synchronizes metadata before the DataLoad job runs.
  target: # The target of the DataLoad job. You can specify multiple targets.
    - path: <path1> # The path where the DataLoad job runs.
      replicas: 1 # The number of cached replicas.
    - path: <path2>
      replicas: 2

You can run the following command to deploy dataload.yaml to create the DataLoad job.
```
kubectl apply -f dataload.yaml
```
Expected output:
```
dataload.data.fluid.io/cron-dataload created
```
You can run the following command to check the status of the DataLoad job.
```
kubectl get dataload
```
When the PHASE is Complete, the data has been loaded. You can then proceed to the next step.
```
NAME            DATASET   PHASE      AGE   DURATION
cron-dataload   demo      Complete   68s   8s
```

You can run the following command to check the current status of the Dataset.

kubectl get dataset

Expected output:

NAME    UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
demo    588.90KiB        588.90KiB   10.00GiB         100.0%              Bound   5m50s

The output shows that all files from OSS are loaded into the cache.

Step 4: Create an application pod to access data in OSS

You can create an app.yaml file to use an application pod to access the RELEASENOTES.md file.

Click to view the sample YAML

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
    - name: nginx
      image: nginx
      volumeMounts:
        - mountPath: /data
          name: demo-vol
  volumes:
    - name: demo-vol
      persistentVolumeClaim:
        claimName: demo

You can run the following command to create the application pod.
```
kubectl create -f app.yaml
```
Expected output:
```
pod/nginx created
```
After the application pod is ready, you can run the following command to view the data in OSS.
```
kubectl exec -it nginx -- ls -lh /data
```
Expected output:
```
total 589K
-rwxrwxr-x 1 root root 589K Jul 31 04:20 RELEASENOTES.md
```
You can run the following command to append the string "hello, crondataload." to the RELEASENOTES.md file.
```
echo "hello, crondataload." >> RELEASENOTES.md
```

You can run the following command to re-upload the RELEASENOTES.md file to OSS.

ossutil64 cp RELEASENOTES.md oss://<bucket-name>/<path>/RELEASENOTES.md

Press y to confirm. The expected output is shown below:

cp: overwrite "oss://<bucket-name>/<path>/RELEASENOTES.md"(y or N)? y
Succeed: Total num: 1, size: 21. OK num: 1(upload 1 files).                          

average speed 0(byte/s)

81.827978(s) elapsed

You can run the following command to check the status of the DataLoad job.

kubectl describe dataload cron-dataload

Expected output:

...
Status:
  Conditions:
    Last Probe Time:       2023-08-24T06:44:08Z
    Last Transition Time:  2023-08-24T06:44:08Z
    Status:                True
    Type:                  Complete
  Duration:                8s
  Last Schedule Time:      2023-08-24T06:44:00Z # The time when the last DataLoad job was scheduled.
  Last Successful Time:    2023-08-24T06:44:08Z # The time when the last DataLoad job was completed.
  Phase:                   Complete
  ...

You can run the following command to check the current status of the Dataset.

kubectl get dataset

Expected output:

NAME    UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
demo    588.90KiB        1.15MiB     10.00GiB         100.0%              Bound   10m

The output shows that the updated file is loaded into the cache.

You can run the following command to view the updated file in the application pod.
```
kubectl exec -it nginx -- tail /data/RELEASENOTES.md
```
Expected output:
```
hello, crondataload.
```
The output shows that the application pod can access the updated file.

(Optional) Step 5: Clean up the environment

If you no longer need the data acceleration feature, you can clean up the environment.

You can run the following command to delete the JindoRuntime and the application pod.

kubectl delete -f app.yaml
kubectl delete -f dataset.yaml

Expected output:

pod "nginx" deleted
dataset.data.fluid.io "demo" deleted
jindoruntime.data.fluid.io "demo" deleted