All Products
Search
Document Center

Container Service for Kubernetes:Schedule periodic dataset updates with DataLoad

Last Updated:Mar 26, 2026

When source data in OSS changes regularly, application pods that rely on JindoRuntime cache can serve stale data between cache refreshes. Use a scheduled DataLoad job to pull the latest data from OSS into the JindoRuntime cache automatically—without restarting application pods. This topic uses JindoFS as an example.

Prerequisites

Before you begin, ensure that you have:

Step 1: Prepare data in an OSS bucket

  1. Download the test file.

    wget https://archive.apache.org/dist/hbase/2.5.2/RELEASENOTES.md
  2. Upload the file to your OSS bucket.

    ossutil64 cp RELEASENOTES.md oss://<bucket>/<path>/RELEASENOTES.md

Step 2: Create a Dataset and a JindoRuntime

  1. Create a file named mySecret.yaml with the following content to store your OSS credentials.

    apiVersion: v1
    kind: Secret
    metadata:
      name: mysecret
    stringData:
      fs.oss.accessKeyId: ****** # Enter your AccessKey ID.
      fs.oss.accessKeySecret: ****** # Enter your AccessKey secret.
  2. Create the Secret.

    kubectl create -f mySecret.yaml

    Expected output:

    secret/mysecret created
  3. Create a file named dataset.yaml with the following content.

    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
      name: demo
    spec:
      mounts:
        - mountPoint: oss://<bucket-name>/<path>
          options:
            fs.oss.endpoint: <oss-endpoint>
          name: hbase
          path: "/"
          encryptOptions:
            - name: fs.oss.accessKeyId
              valueFrom:
                secretKeyRef:
                  name: mysecret
                  key: fs.oss.accessKeyId
            - name: fs.oss.accessKeySecret
              valueFrom:
                secretKeyRef:
                  name: mysecret
                  key: fs.oss.accessKeySecret
      accessModes:
        - ReadOnlyMany
    ---
    apiVersion: data.fluid.io/v1alpha1
    kind: JindoRuntime
    metadata:
      name: demo
    spec:
      replicas: 1
      tieredstore:
        levels:
          - mediumtype: MEM
            path: /dev/shm
            quota: 2Gi
            high: "0.99"
            low: "0.8"
      fuse:
       args:
        - -okernel_cache
        - -oro
        - -oattr_timeout=60
        - -oentry_timeout=60
        - -onegative_timeout=60

    The following table describes the key parameters.

    ResourceParameterDescription
    DatasetmountPointThe path to the underlying file system (UFS), in the format oss://<bucket>/<path>. The path does not include endpoint information.
    Datasetfs.oss.endpointThe endpoint of the OSS bucket. Both public and private endpoints are supported.
    DatasetaccessModesThe access mode of the Dataset.
    JindoRuntimereplicasThe number of worker nodes in the JindoFS cluster.
    JindoRuntimemediumtypeThe cache storage medium. Valid values: HDD, SSD, MEM.
    JindoRuntimepathThe local storage path. If mediumtype is MEM, specify a local path for files such as logs.
    JindoRuntimequotaThe maximum cache capacity. Set this based on the size of the data in the UFS.
    JindoRuntimehighThe high watermark for cache storage capacity.
    JindoRuntimelowThe low watermark for cache storage capacity.
    JindoRuntimefuse.argsOptional FUSE client mount parameters. Configuration depends on the Dataset access mode:
    - ReadOnlyMany: Enable kernel_cache to use the kernel cache for read performance. Set attr_timeout (file attribute cache), entry_timeout (file name lookup cache), and negative_timeout (failed file name lookup cache). Default for all: 7200s.
    - ReadWriteMany: Use the default configuration: -oauto_cache, -oattr_timeout=0, -oentry_timeout=0, -onegative_timeout=0. The auto_cache option invalidates the cache when the file size or modification time changes.

  4. Deploy dataset.yaml to create the Dataset and JindoRuntime.

    kubectl create -f dataset.yaml

    Expected output:

    dataset.data.fluid.io/demo created
    jindoruntime.data.fluid.io/demo created
  5. Verify that the Dataset is ready.

    kubectl get dataset

    Expected output:

    NAME    UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    demo    588.90KiB        0.00B       10.00GiB         0.0%                Bound   2m7s

    The PHASE field shows Bound, which means the Dataset is ready.

Step 3: Create a scheduled DataLoad job

By default, a DataLoad job loads all data in the target Dataset. For fine-grained control—such as loading only a specific path or syncing metadata before loading—see Advanced DataLoad configurations.

DataLoad supports two execution policies:

  • Once: The job runs only once.

  • Cron: The job runs on a recurring schedule.

  1. Create a file named dataload.yaml with the following content.

    apiVersion: data.fluid.io/v1alpha1
    kind: DataLoad
    metadata:
      name: cron-dataload
    spec:
      dataset:
        name: demo
        namespace: default
      policy: Cron
      schedule: "*/2 * * * *" # Run every 2 min

    The following table describes the parameters.

    ParameterDescription
    datasetThe name and namespace of the Dataset to load.
    policyThe execution policy. Set to Cron for a scheduled job.
    scheduleThe cron expression for the job schedule. For more information, see Cron schedule syntax.
  2. Deploy the DataLoad job.

    kubectl apply -f dataload.yaml

    Expected output:

    dataload.data.fluid.io/cron-dataload created
  3. Check the DataLoad job status.

    kubectl get dataload

    When PHASE shows Complete, the data has been loaded into the cache.

    NAME            DATASET   PHASE      AGE   DURATION
    cron-dataload   demo      Complete   68s   8s
  4. Confirm the data is cached.

    kubectl get dataset

    Expected output:

    NAME    UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    demo    588.90KiB        588.90KiB   10.00GiB         100.0%              Bound   5m50s

    CACHED PERCENTAGE at 100% confirms that all data from OSS is loaded into the cache.

Step 4: Access data from an application pod

  1. Create a file named app.yaml with the following content.

    apiVersion: v1
    kind: Pod
    metadata:
      name: nginx
    spec:
      containers:
        - name: nginx
          image: nginx
          volumeMounts:
            - mountPath: /data
              name: demo-vol
      volumes:
        - name: demo-vol
          persistentVolumeClaim:
            claimName: demo
  2. Create the application pod.

    kubectl create -f app.yaml

    Expected output:

    pod/nginx created
  3. After the pod is ready, list the data in OSS.

    kubectl exec -it nginx -- ls -lh /data

    Expected output:

    total 589K
    -rwxrwxr-x 1 root root 589K Jul 31 04:20 RELEASENOTES.md
  4. Append a line to RELEASENOTES.md to simulate an update.

    echo "hello, crondataload." >> RELEASENOTES.md
  5. Re-upload the updated file to OSS.

    ossutil64 cp RELEASENOTES.md oss://<bucket-name>/<path>/RELEASENOTES.md

    When prompted, enter y to confirm. Expected output:

    cp: overwrite "oss://<bucket-name>/<path>/RELEASENOTES.md"(y or N)? y
    Succeed: Total num: 1, size: 21. OK num: 1(upload 1 files).
  6. Wait for the next scheduled DataLoad run, then check the job status.

    kubectl describe dataload cron-dataload

    Expected output (relevant fields):

    Status:
      Conditions:
        Last Probe Time:       2023-08-24T06:44:08Z
        Last Transition Time:  2023-08-24T06:44:08Z
        Status:                True
        Type:                  Complete
      Duration:                8s
      Last Schedule Time:      2023-08-24T06:44:00Z # The time when the last DataLoad job was scheduled.
      Last Successful Time:    2023-08-24T06:44:08Z # The time when the last DataLoad job was completed.
      Phase:                   Complete
  7. Confirm that the updated file is cached.

    kubectl get dataset

    Expected output:

    NAME    UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    demo    588.90KiB        1.15MiB     10.00GiB         100.0%              Bound   10m

    The CACHED value increased, reflecting the updated file loaded into the cache.

  8. Verify that the application pod can read the updated content.

    kubectl exec -it nginx -- tail /data/RELEASENOTES.md

    Expected output:

    hello, crondataload.

Advanced DataLoad configurations

The following configurations let you control DataLoad behavior beyond the defaults.

Sync metadata before loading

When files in OSS have changed, JindoFS may serve stale data because its metadata view is out of sync. Set loadMetadata: true to sync metadata before the DataLoad job runs.

spec:
  ...
  loadMetadata: true

Load only a specific path

By default, DataLoad loads all data in the Dataset. To load only a subset, specify one or more target paths.

spec:
  ...
  target:
    - path: <path1>
      replicas: 1
    - path: <path2>
      replicas: 2

The replicas field under each target sets the number of cached replicas for that path.

Combined advanced configuration

The following example shows all advanced fields together for reference.

apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: cron-dataload
spec:
  dataset:
    name: demo
    namespace: default
  policy: Cron
  schedule: "* * * * *"
  loadMetadata: true
  target:
    - path: <path1>
      replicas: 1
    - path: <path2>
      replicas: 2

(Optional) Clean up

If you no longer need the data acceleration setup, delete the application pod and the Dataset.

kubectl delete -f app.yaml
kubectl delete -f dataset.yaml

Expected output:

pod "nginx" deleted
dataset.data.fluid.io "demo" deleted
jindoruntime.data.fluid.io "demo" deleted