When source data in OSS changes regularly, application pods that rely on JindoRuntime cache can serve stale data between cache refreshes. Use a scheduled DataLoad job to pull the latest data from OSS into the JindoRuntime cache automatically—without restarting application pods. This topic uses JindoFS as an example.
Prerequisites
Before you begin, ensure that you have:
An ACK managed cluster Pro Edition, version 1.18 or later. For more information, see Create an ACK managed cluster Pro Edition.
The cloud-native AI suite installed, with the ack-fluid component (version 1.0.3) deployed.
ImportantIf you have installed open-source Fluid, uninstall it before deploying ack-fluid. To install ack-fluid, see Install the cloud-native AI suite, or deploy it from the Cloud-native AI Suite page in the ACK console.
kubectl configured to connect to your cluster. For more information, see Connect to a cluster using the kubectl tool.
ossutil installed and an OSS bucket created. For more information, see Install ossutil.
Step 1: Prepare data in an OSS bucket
Download the test file.
wget https://archive.apache.org/dist/hbase/2.5.2/RELEASENOTES.mdUpload the file to your OSS bucket.
ossutil64 cp RELEASENOTES.md oss://<bucket>/<path>/RELEASENOTES.md
Step 2: Create a Dataset and a JindoRuntime
Create a file named
mySecret.yamlwith the following content to store your OSS credentials.apiVersion: v1 kind: Secret metadata: name: mysecret stringData: fs.oss.accessKeyId: ****** # Enter your AccessKey ID. fs.oss.accessKeySecret: ****** # Enter your AccessKey secret.Create the Secret.
kubectl create -f mySecret.yamlExpected output:
secret/mysecret createdCreate a file named
dataset.yamlwith the following content.apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: demo spec: mounts: - mountPoint: oss://<bucket-name>/<path> options: fs.oss.endpoint: <oss-endpoint> name: hbase path: "/" encryptOptions: - name: fs.oss.accessKeyId valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeyId - name: fs.oss.accessKeySecret valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeySecret accessModes: - ReadOnlyMany --- apiVersion: data.fluid.io/v1alpha1 kind: JindoRuntime metadata: name: demo spec: replicas: 1 tieredstore: levels: - mediumtype: MEM path: /dev/shm quota: 2Gi high: "0.99" low: "0.8" fuse: args: - -okernel_cache - -oro - -oattr_timeout=60 - -oentry_timeout=60 - -onegative_timeout=60The following table describes the key parameters.
Resource Parameter Description Dataset mountPointThe path to the underlying file system (UFS), in the format oss://<bucket>/<path>. The path does not include endpoint information.Dataset fs.oss.endpointThe endpoint of the OSS bucket. Both public and private endpoints are supported. Dataset accessModesThe access mode of the Dataset. JindoRuntime replicasThe number of worker nodes in the JindoFS cluster. JindoRuntime mediumtypeThe cache storage medium. Valid values: HDD,SSD,MEM.JindoRuntime pathThe local storage path. If mediumtypeisMEM, specify a local path for files such as logs.JindoRuntime quotaThe maximum cache capacity. Set this based on the size of the data in the UFS. JindoRuntime highThe high watermark for cache storage capacity. JindoRuntime lowThe low watermark for cache storage capacity. JindoRuntime fuse.argsOptional FUSE client mount parameters. Configuration depends on the Dataset access mode:
- ReadOnlyMany: Enablekernel_cacheto use the kernel cache for read performance. Setattr_timeout(file attribute cache),entry_timeout(file name lookup cache), andnegative_timeout(failed file name lookup cache). Default for all: 7200s.
- ReadWriteMany: Use the default configuration:-oauto_cache,-oattr_timeout=0,-oentry_timeout=0,-onegative_timeout=0. Theauto_cacheoption invalidates the cache when the file size or modification time changes.Deploy
dataset.yamlto create the Dataset and JindoRuntime.kubectl create -f dataset.yamlExpected output:
dataset.data.fluid.io/demo created jindoruntime.data.fluid.io/demo createdVerify that the Dataset is ready.
kubectl get datasetExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE demo 588.90KiB 0.00B 10.00GiB 0.0% Bound 2m7sThe
PHASEfield showsBound, which means the Dataset is ready.
Step 3: Create a scheduled DataLoad job
By default, a DataLoad job loads all data in the target Dataset. For fine-grained control—such as loading only a specific path or syncing metadata before loading—see Advanced DataLoad configurations.
DataLoad supports two execution policies:
Once: The job runs only once.
Cron: The job runs on a recurring schedule.
Create a file named
dataload.yamlwith the following content.apiVersion: data.fluid.io/v1alpha1 kind: DataLoad metadata: name: cron-dataload spec: dataset: name: demo namespace: default policy: Cron schedule: "*/2 * * * *" # Run every 2 minThe following table describes the parameters.
Parameter Description datasetThe name and namespace of the Dataset to load. policyThe execution policy. Set to Cronfor a scheduled job.scheduleThe cron expression for the job schedule. For more information, see Cron schedule syntax. Deploy the DataLoad job.
kubectl apply -f dataload.yamlExpected output:
dataload.data.fluid.io/cron-dataload createdCheck the DataLoad job status.
kubectl get dataloadWhen
PHASEshowsComplete, the data has been loaded into the cache.NAME DATASET PHASE AGE DURATION cron-dataload demo Complete 68s 8sConfirm the data is cached.
kubectl get datasetExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE demo 588.90KiB 588.90KiB 10.00GiB 100.0% Bound 5m50sCACHED PERCENTAGEat 100% confirms that all data from OSS is loaded into the cache.
Step 4: Access data from an application pod
Create a file named
app.yamlwith the following content.apiVersion: v1 kind: Pod metadata: name: nginx spec: containers: - name: nginx image: nginx volumeMounts: - mountPath: /data name: demo-vol volumes: - name: demo-vol persistentVolumeClaim: claimName: demoCreate the application pod.
kubectl create -f app.yamlExpected output:
pod/nginx createdAfter the pod is ready, list the data in OSS.
kubectl exec -it nginx -- ls -lh /dataExpected output:
total 589K -rwxrwxr-x 1 root root 589K Jul 31 04:20 RELEASENOTES.mdAppend a line to
RELEASENOTES.mdto simulate an update.echo "hello, crondataload." >> RELEASENOTES.mdRe-upload the updated file to OSS.
ossutil64 cp RELEASENOTES.md oss://<bucket-name>/<path>/RELEASENOTES.mdWhen prompted, enter
yto confirm. Expected output:cp: overwrite "oss://<bucket-name>/<path>/RELEASENOTES.md"(y or N)? y Succeed: Total num: 1, size: 21. OK num: 1(upload 1 files).Wait for the next scheduled DataLoad run, then check the job status.
kubectl describe dataload cron-dataloadExpected output (relevant fields):
Status: Conditions: Last Probe Time: 2023-08-24T06:44:08Z Last Transition Time: 2023-08-24T06:44:08Z Status: True Type: Complete Duration: 8s Last Schedule Time: 2023-08-24T06:44:00Z # The time when the last DataLoad job was scheduled. Last Successful Time: 2023-08-24T06:44:08Z # The time when the last DataLoad job was completed. Phase: CompleteConfirm that the updated file is cached.
kubectl get datasetExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE demo 588.90KiB 1.15MiB 10.00GiB 100.0% Bound 10mThe
CACHEDvalue increased, reflecting the updated file loaded into the cache.Verify that the application pod can read the updated content.
kubectl exec -it nginx -- tail /data/RELEASENOTES.mdExpected output:
hello, crondataload.
Advanced DataLoad configurations
The following configurations let you control DataLoad behavior beyond the defaults.
Sync metadata before loading
When files in OSS have changed, JindoFS may serve stale data because its metadata view is out of sync. Set loadMetadata: true to sync metadata before the DataLoad job runs.
spec:
...
loadMetadata: trueLoad only a specific path
By default, DataLoad loads all data in the Dataset. To load only a subset, specify one or more target paths.
spec:
...
target:
- path: <path1>
replicas: 1
- path: <path2>
replicas: 2The replicas field under each target sets the number of cached replicas for that path.
Combined advanced configuration
The following example shows all advanced fields together for reference.
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
name: cron-dataload
spec:
dataset:
name: demo
namespace: default
policy: Cron
schedule: "* * * * *"
loadMetadata: true
target:
- path: <path1>
replicas: 1
- path: <path2>
replicas: 2(Optional) Clean up
If you no longer need the data acceleration setup, delete the application pod and the Dataset.
kubectl delete -f app.yaml
kubectl delete -f dataset.yamlExpected output:
pod "nginx" deleted
dataset.data.fluid.io "demo" deleted
jindoruntime.data.fluid.io "demo" deleted