Fluid is an open source, Kubernetes-native distributed dataset orchestrator and accelerator engine for data-intensive applications in cloud-native scenarios. Fluid enables dataset visibility, elastic scaling, and data migration by managing and scheduling underlying cache runtimes. This topic uses JindoFS as an example to demonstrate how to schedule data loading.
Prerequisites
You have created a Container Service for Kubernetes (ACK) managed cluster Pro Edition of version 1.18 or later. For more information, see Create an ACK managed cluster Pro Edition.
The cloud-native AI suite is installed, and the ack-fluid component is deployed.
ImportantIf you have installed open source Fluid, you must uninstall it before you deploy the ack-fluid component.
Make sure that the ack-fluid version is 1.0.3.
If you have not installed the cloud-native AI suite, you can enable Fluid during the installation. For more information, see Install the cloud-native AI suite.
If the cloud-native AI suite is installed, log on to the ACK console and deploy ack-fluid from the Cloud-native AI Suite page.
You have connected to a Kubernetes cluster using kubectl. For more information, see Connect to a cluster using the kubectl tool.
Step 1: Prepare data in an OSS bucket
You can run the following command to download the test data.
wget https://archive.apache.org/dist/hbase/2.5.2/RELEASENOTES.mdYou can install ossutil and create a bucket. For more information, see Install ossutil.
You can run the following command to upload the test data to the OSS bucket:
ossutil64 cp RELEASENOTES.md oss://<bucket>/<path>/RELEASENOTES.md
Step 2: Create a Dataset and a JindoRuntime
You can create a
mySecret.yamlfile to store theaccessKeyIdandaccessKeySecretfor OSS access. The following YAML provides an example.apiVersion: v1 kind: Secret metadata: name: mysecret stringData: fs.oss.accessKeyId: ****** # Enter your AccessKey ID. fs.oss.accessKeySecret: ****** # # Enter your AccessKey secret.You can run the following command to create the Secret.
kubectl create -f mySecret.yamlExpected output:
secret/mysecret createdYou can create a
dataset.yamlfile to create the Dataset.You can run the following command to deploy
dataset.yamlto create the JindoRuntime and Dataset.kubectl create -f dataset.yamlExpected output:
dataset.data.fluid.io/demo created jindoruntime.data.fluid.io/demo createdYou can run the following command to check the status of the Dataset.
kubectl get datasetExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE demo 588.90KiB 0.00B 10.00GiB 0.0% Bound 2m7s
Step 3: Create a scheduled DataLoad job
You can create a
dataload.yamlfile.You can run the following command to deploy
dataload.yamlto create the DataLoad job.kubectl apply -f dataload.yamlExpected output:
dataload.data.fluid.io/cron-dataload createdYou can run the following command to check the status of the DataLoad job.
kubectl get dataloadWhen the
PHASEisComplete, the data has been loaded. You can then proceed to the next step.NAME DATASET PHASE AGE DURATION cron-dataload demo Complete 68s 8sYou can run the following command to check the current status of the Dataset.
kubectl get datasetExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE demo 588.90KiB 588.90KiB 10.00GiB 100.0% Bound 5m50sThe output shows that all files from OSS are loaded into the cache.
Step 4: Create an application pod to access data in OSS
You can create an
app.yamlfile to use an application pod to access theRELEASENOTES.mdfile.You can run the following command to create the application pod.
kubectl create -f app.yamlExpected output:
pod/nginx createdAfter the application pod is ready, you can run the following command to view the data in OSS.
kubectl exec -it nginx -- ls -lh /dataExpected output:
total 589K -rwxrwxr-x 1 root root 589K Jul 31 04:20 RELEASENOTES.mdYou can run the following command to append the string
"hello, crondataload."to theRELEASENOTES.mdfile.echo "hello, crondataload." >> RELEASENOTES.mdYou can run the following command to re-upload the
RELEASENOTES.mdfile to OSS.ossutil64 cp RELEASENOTES.md oss://<bucket-name>/<path>/RELEASENOTES.mdPress
yto confirm. The expected output is shown below:cp: overwrite "oss://<bucket-name>/<path>/RELEASENOTES.md"(y or N)? y Succeed: Total num: 1, size: 21. OK num: 1(upload 1 files). average speed 0(byte/s) 81.827978(s) elapsedYou can run the following command to check the status of the DataLoad job.
kubectl describe dataload cron-dataloadExpected output:
... Status: Conditions: Last Probe Time: 2023-08-24T06:44:08Z Last Transition Time: 2023-08-24T06:44:08Z Status: True Type: Complete Duration: 8s Last Schedule Time: 2023-08-24T06:44:00Z # The time when the last DataLoad job was scheduled. Last Successful Time: 2023-08-24T06:44:08Z # The time when the last DataLoad job was completed. Phase: Complete ...You can run the following command to check the current status of the Dataset.
kubectl get datasetExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE demo 588.90KiB 1.15MiB 10.00GiB 100.0% Bound 10mThe output shows that the updated file is loaded into the cache.
You can run the following command to view the updated file in the application pod.
kubectl exec -it nginx -- tail /data/RELEASENOTES.mdExpected output:
hello, crondataload.The output shows that the application pod can access the updated file.
(Optional) Step 5: Clean up the environment
If you no longer need the data acceleration feature, you can clean up the environment.
You can run the following command to delete the JindoRuntime and the application pod.
kubectl delete -f app.yaml
kubectl delete -f dataset.yamlExpected output:
pod "nginx" deleted
dataset.data.fluid.io "demo" deleted
jindoruntime.data.fluid.io "demo" deleted