All Products
Search
Document Center

Container Service for Kubernetes:Periodically update a Dataset by using a DataLoad job

Last Updated:Sep 17, 2025

Fluid is an open source, Kubernetes-native distributed dataset orchestrator and accelerator engine for data-intensive applications in cloud-native scenarios. Fluid enables dataset visibility, elastic scaling, and data migration by managing and scheduling underlying cache runtimes. This topic uses JindoFS as an example to demonstrate how to schedule data loading.

Prerequisites

  • You have created a Container Service for Kubernetes (ACK) managed cluster Pro Edition of version 1.18 or later. For more information, see Create an ACK managed cluster Pro Edition.

  • The cloud-native AI suite is installed, and the ack-fluid component is deployed.

    Important
    • If you have installed open source Fluid, you must uninstall it before you deploy the ack-fluid component.

    • Make sure that the ack-fluid version is 1.0.3.

    • If you have not installed the cloud-native AI suite, you can enable Fluid during the installation. For more information, see Install the cloud-native AI suite.

    • If the cloud-native AI suite is installed, log on to the ACK console and deploy ack-fluid from the Cloud-native AI Suite page.

  • You have connected to a Kubernetes cluster using kubectl. For more information, see Connect to a cluster using the kubectl tool.

Step 1: Prepare data in an OSS bucket

  1. You can run the following command to download the test data.

    wget https://archive.apache.org/dist/hbase/2.5.2/RELEASENOTES.md
  2. You can install ossutil and create a bucket. For more information, see Install ossutil.

  3. You can run the following command to upload the test data to the OSS bucket:

    ossutil64 cp RELEASENOTES.md oss://<bucket>/<path>/RELEASENOTES.md

Step 2: Create a Dataset and a JindoRuntime

  1. You can create a mySecret.yaml file to store the accessKeyId and accessKeySecret for OSS access. The following YAML provides an example.

    apiVersion: v1
    kind: Secret
    metadata:
      name: mysecret
    stringData:
      fs.oss.accessKeyId: ****** # Enter your AccessKey ID.
      fs.oss.accessKeySecret: ****** # # Enter your AccessKey secret.
  2. You can run the following command to create the Secret.

    kubectl create -f mySecret.yaml

    Expected output:

    secret/mysecret created
  3. You can create a dataset.yaml file to create the Dataset.

    Click to view the sample YAML

    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
      name: demo
    spec:
      mounts:
        - mountPoint: oss://<bucket-name>/<path>
          options:
            fs.oss.endpoint: <oss-endpoint>
          name: hbase
          path: "/"
          encryptOptions:
            - name: fs.oss.accessKeyId
              valueFrom:
                secretKeyRef:
                  name: mysecret
                  key: fs.oss.accessKeyId
            - name: fs.oss.accessKeySecret
              valueFrom:
                secretKeyRef:
                  name: mysecret
                  key: fs.oss.accessKeySecret
      accessModes:
        - ReadOnlyMany
    ---
    apiVersion: data.fluid.io/v1alpha1
    kind: JindoRuntime
    metadata:
      name: demo
    spec:
      replicas: 1
      tieredstore:
        levels:
          - mediumtype: MEM
            path: /dev/shm
            quota: 2Gi
            high: "0.99"
            low: "0.8"
      fuse:
       args:
        - -okernel_cache
        - -oro
        - -oattr_timeout=60
        - -oentry_timeout=60
        - -onegative_timeout=60

    The following table describes the parameters.

    Parameter

    Description

    Dataset

    mountPoint

    oss://<oss_bucket>/<path> specifies the path to which the underlying file system (UFS) is mounted. The path does not need to include endpoint information.

    fs.oss.endpoint

    The endpoint of the OSS bucket. You can use a public or private endpoint.

    accessModes

    The access mode of the Dataset.

    JindoRuntime

    replicas

    The number of worker nodes in the JindoFS cluster.

    mediumtype

    The cache medium type. JindoFS supports HDD, SSD, and MEM. You can specify any of these types when you create a JindoRuntime.

    path

    The storage path. Only a single path is supported. If you set mediumtype to MEM, you must specify a local path to store files such as logs.

    quota

    The maximum cache capacity in GB. You can configure the cache capacity based on the size of the data in the UFS.

    high

    The high watermark for storage capacity.

    low

    The low watermark for storage capacity.

    fuse.args

    Optional mount parameters for the FUSE client. Use these parameters with the access mode of the Dataset.

    • If the access mode is ReadOnlyMany, enable kernel_cache to use the kernel cache to optimize read performance. You can set the timeout for attr_timeout (file attribute cache), entry_timeout (file name lookup cache), and negative_timeout (failed file name lookup cache). The default value is 7200s for all.

    • If the access mode is ReadWriteMany, use the default configurations. The parameters are set as follows:

      - -oauto_cache

      - -oattr_timeout=0

      - -oentry_timeout=0

      - -onegative_timeout=0

      Note

      Use auto_cache to ensure that the cache is invalidated if the file size or modification time changes. Set all timeout values to 0.

  4. You can run the following command to deploy dataset.yaml to create the JindoRuntime and Dataset.

    kubectl create -f dataset.yaml

    Expected output:

    dataset.data.fluid.io/demo created
    jindoruntime.data.fluid.io/demo created
  5. You can run the following command to check the status of the Dataset.

    kubectl get dataset

    Expected output:

    NAME    UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    demo    588.90KiB        0.00B       10.00GiB         0.0%                Bound   2m7s

Step 3: Create a scheduled DataLoad job

  1. You can create a dataload.yaml file.

    Click to view the sample YAML

    apiVersion: data.fluid.io/v1alpha1
    kind: DataLoad
    metadata:
      name: cron-dataload
    spec:
      dataset:
        name: demo
        namespace: default
      policy: Cron
      schedule: "*/2 * * * *" # Run every 2 min

    The following table describes the parameters.

    Parameter

    Description

    dataset

    The name and namespace of the Dataset for the DataLoad job.

    policy

    The execution policy. Valid values: Once and Cron. This example creates a scheduled DataLoad job.

    • Once: The job runs only once.

    • Cron: The job runs on a schedule.

    schedule

    The schedule of the DataLoad job. The value of the .spec.schedule field follows Cron syntax. For more information, see Cron schedule syntax.

    For more information about the advanced configurations for a DataLoad job, see the following configuration file:

    apiVersion: data.fluid.io/v1alpha1
    kind: DataLoad
    metadata:
      name: cron-dataload
    spec:
      dataset:
        name: demo
        namespace: default
      policy: Cron # The execution policy for the DataLoad job. Valid values: [Once, Cron].
      schedule: * * * * * # This field is valid only when policy is set to Cron.
    	loadMetadata: true # Synchronizes metadata before the DataLoad job runs.
      target: # The target of the DataLoad job. You can specify multiple targets.
        - path: <path1> # The path where the DataLoad job runs.
          replicas: 1 # The number of cached replicas.
        - path: <path2>
          replicas: 2
  2. You can run the following command to deploy dataload.yaml to create the DataLoad job.

    kubectl apply -f dataload.yaml

    Expected output:

    dataload.data.fluid.io/cron-dataload created
  3. You can run the following command to check the status of the DataLoad job.

    kubectl get dataload

    When the PHASE is Complete, the data has been loaded. You can then proceed to the next step.

    NAME            DATASET   PHASE      AGE   DURATION
    cron-dataload   demo      Complete   68s   8s
  4. You can run the following command to check the current status of the Dataset.

    kubectl get dataset

    Expected output:

    NAME    UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    demo    588.90KiB        588.90KiB   10.00GiB         100.0%              Bound   5m50s

    The output shows that all files from OSS are loaded into the cache.

Step 4: Create an application pod to access data in OSS

  1. You can create an app.yaml file to use an application pod to access the RELEASENOTES.md file.

    Click to view the sample YAML

    apiVersion: v1
    kind: Pod
    metadata:
      name: nginx
    spec:
      containers:
        - name: nginx
          image: nginx
          volumeMounts:
            - mountPath: /data
              name: demo-vol
      volumes:
        - name: demo-vol
          persistentVolumeClaim:
            claimName: demo
  2. You can run the following command to create the application pod.

    kubectl create -f app.yaml

    Expected output:

    pod/nginx created
  3. After the application pod is ready, you can run the following command to view the data in OSS.

    kubectl exec -it nginx -- ls -lh /data

    Expected output:

    total 589K
    -rwxrwxr-x 1 root root 589K Jul 31 04:20 RELEASENOTES.md
  4. You can run the following command to append the string "hello, crondataload." to the RELEASENOTES.md file.

    echo "hello, crondataload." >> RELEASENOTES.md
  5. You can run the following command to re-upload the RELEASENOTES.md file to OSS.

    ossutil64 cp RELEASENOTES.md oss://<bucket-name>/<path>/RELEASENOTES.md

    Press y to confirm. The expected output is shown below:

    cp: overwrite "oss://<bucket-name>/<path>/RELEASENOTES.md"(y or N)? y
    Succeed: Total num: 1, size: 21. OK num: 1(upload 1 files).                          
    
    average speed 0(byte/s)
    
    81.827978(s) elapsed
  6. You can run the following command to check the status of the DataLoad job.

    kubectl describe dataload cron-dataload

    Expected output:

    ...
    Status:
      Conditions:
        Last Probe Time:       2023-08-24T06:44:08Z
        Last Transition Time:  2023-08-24T06:44:08Z
        Status:                True
        Type:                  Complete
      Duration:                8s
      Last Schedule Time:      2023-08-24T06:44:00Z # The time when the last DataLoad job was scheduled.
      Last Successful Time:    2023-08-24T06:44:08Z # The time when the last DataLoad job was completed.
      Phase:                   Complete
      ...
  7. You can run the following command to check the current status of the Dataset.

    kubectl get dataset

    Expected output:

    NAME    UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    demo    588.90KiB        1.15MiB     10.00GiB         100.0%              Bound   10m

    The output shows that the updated file is loaded into the cache.

  8. You can run the following command to view the updated file in the application pod.

    kubectl exec -it nginx -- tail /data/RELEASENOTES.md

    Expected output:

    hello, crondataload.

    The output shows that the application pod can access the updated file.

(Optional) Step 5: Clean up the environment

If you no longer need the data acceleration feature, you can clean up the environment.

You can run the following command to delete the JindoRuntime and the application pod.

kubectl delete -f app.yaml
kubectl delete -f dataset.yaml

Expected output:

pod "nginx" deleted
dataset.data.fluid.io "demo" deleted
jindoruntime.data.fluid.io "demo" deleted