All Products
Search
Document Center

Container Service for Kubernetes:Periodically update a dataset by running a DataLoad job

Last Updated:Dec 21, 2023

Fluid is a distributed dataset orchestrator and accelerator of open source Kubernetes for data-intensive applications in cloud-native scenarios. Fluid enables the observability, auto scaling, and portability of datasets by managing and scheduling cache runtimes. This topic uses JindoFS as an example to describe how to periodically update a dataset by running a DataLoad job.

Prerequisites

  • A Container Service for Kubernetes (ACK) Pro cluster that runs Kubernetes 1.18 or later is created. For more information, see Create an ACK managed cluster.

  • The cloud-native AI suite is installed and the ack-fluid component is deployed.

    Important
    • If you have already installed open source Fluid, uninstall Fluid and deploy the ack-fluid component.

    • Make sure that the ack-fluid version is 1.0.3.

    • If you have not installed the cloud-native AI suite, enable Fluid acceleration when you install the suite. For more information, see Deploy the cloud-native AI set.

    • If you have installed the cloud-native AI suite, log on to the ACK console and deploy ack-fluid from the Cloud-native AI Suite page.

  • A kubectl client is connected to the ACK Pro cluster. For more information, see Connect to a cluster by using kubectl.

Step 1: Upload data to OSS

  1. Run the following command to download the test data:

    wget https://archive.apache.org/dist/hbase/2.5.2/RELEASENOTES.md
  2. Install ossutil and create an Object Storage Service (OSS) bucket. For more information, see Install ossutil.

  3. Run the following command to upload the test data to the OSS bucket:

    ossutil64 cp RELEASENOTES.md oss://<bucket>/<path>/RELEASENOTES.md

Step 2: Create a dataset and a JindoRuntime

  1. Create a file named mySecret.yaml to store the AccessKey ID and AccessKey secret that are used to access OSS. The following YAML template provides an example:

    apiVersion: v1
    kind: Secret
    metadata:
      name: mysecret
    stringData:
      fs.oss.accessKeyId: ****** # Enter the AccessKey ID. 
      fs.oss.accessKeySecret: ****** # # Enter the AccessKey secret.

  2. Run the following command to create a Secret:

    kubectl create -f mySecret.yaml

    Expected output:

    secret/mysecret created
  3. Create a file named dataset.yaml. The file is used to create a dataset.

    Show sample code

    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
      name: demo
    spec:
      mounts:
        - mountPoint: oss://<bucket-name>/<path>
          options:
            fs.oss.endpoint: <oss-endpoint>
          name: hbase
          path: "/"
          encryptOptions:
            - name: fs.oss.accessKeyId
              valueFrom:
                secretKeyRef:
                  name: mysecret
                  key: fs.oss.accessKeyId
            - name: fs.oss.accessKeySecret
              valueFrom:
                secretKeyRef:
                  name: mysecret
                  key: fs.oss.accessKeySecret
      accessModes:
        - ReadOnlyMany
    ---
    apiVersion: data.fluid.io/v1alpha1
    kind: JindoRuntime
    metadata:
      name: demo
    spec:
      replicas: 1
      tieredstore:
        levels:
          - mediumtype: MEM
            path: /dev/shm
            quota: 2Gi
            high: "0.99"
            low: "0.8"
      fuse:
       args:
        - -okernel_cache
        - -oro
        - -oattr_timeout=60
        - -oentry_timeout=60
        - -onegative_timeout=60

    The following table describes the parameters in the YAML template.

    Parameter

    Description

    Dataset

    mountPoint

    oss://<oss_bucket>/<path> specifies the path to which the underlying file system (UFS) is mounted. You do not need to include the endpoint in the path.

    fs.oss.endpoint

    The public or private endpoint of the OSS bucket.

    accessModes

    The access mode of the dataset.

    JindoRuntime

    replicas

    The number of workers in the JindoFS cluster.

    mediumtype

    The cache type. You can select HDD, SSD, or MEM for JindoFS when you create the JindoRuntime template.

    path

    The storage path. You can specify only one path. If you set mediumtype to MEM, you must specify a local path to store data, such as logs.

    quota

    The maximum size of the cache. Unit: GB. You can configure the cache size based on the data size of the UFS.

    high

    The maximum storage capacity.

    low

    The minimum storage capacity.

    fuse.args

    The optional mount parameters of the FUSE client. You can specify the mount parameters based on the access mode of the dataset.

    • If you set accessModes to ReadOnlyMany, you can enable kernel_cache to optimize read performance by using the kernel cache. In this case, you can configure attr_timeout (the timeout period for which file attributes are cached), entry_timeout (the timeout period for which a name lookup is cached), and negative_timeout (the timeout period for which a negative lookup is cached). The default value for each parameter is 7200. Unit: seconds.

    • If you set accessModes to ReadWriteMany, we recommend that you use the default settings. In this case, the following settings are used:

      - -oauto_cache

      - -oattr_timeout=0

      - -oentry_timeout=0

      - -onegative_timeout=0

      Note

      Enable auto_cache to ensure that the cache expires when the file size or modification time changes. attr_timeout, entry_timeout, and negative_timeout are set to zero.

  4. Run the following command to deploy dateset.yaml to create a JindoRuntime and a dataset:

    kubectl create -f dataset.yaml

    Expected output:

    dataset.data.fluid.io/demo created
    jindoruntime.data.fluid.io/demo created
  5. Run the following command to check whether the dataset is deployed:

    kubectl get dataset

    Expected output:

    NAME    UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    demo    588.90KiB        0.00B       10.00GiB         0.0%                Bound   2m7s

Step 3: Create a DataLoad job that runs periodically

  1. Create a file named dataload.yaml.

    Show sample code

    apiVersion: data.fluid.io/v1alpha1
    kind: DataLoad
    metadata:
      name: cron-dataload
    spec:
      dataset:
        name: demo
        namespace: default
      policy: Cron
      schedule: "*/2 * * * *" # Run every 2 min

    The following table describes the parameters in the YAML template.

    Parameter

    Description

    dataset

    The name and namespace of the dataset on which the DataLoad job runs.

    policy

    The policy that defines how the job runs. Valid values: Once and Cron. In this example, policy is set to Cron.

    • Once: The job runs once.

    • Cron: The job runs periodically.

    schedule

    The schedule of the DataLoad job. The value of the .spec.schedule field follows the same syntax as standard cron expressions. For more information, see Schedule syntax.

    For more information about the advanced settings of DataLoad jobs, see the following configuration file:

    apiVersion: data.fluid.io/v1alpha1
    kind: DataLoad
    metadata:
      name: cron-dataload
    spec:
      dataset:
        name: demo
        namespace: default
      policy: Cron # The policy that defines how the job runs. Valid values: Once and Cron. 
      schedule: * * * * * # This field takes effect only if you set policy to Cron. 
    	loadMetadata: true # Synchronize metadata before the DataLoad job runs. 
      target: # The targets on which the DataLoad job runs. You can specify multiple targets. 
        - path: <path1> # The path where the DataLoad job runs. 
          replicas: 1 # The number of cached replicas. 
        - path: <path2>
          replicas: 2
  2. Run the following command to deploy dataload.yaml to create a DataLoad job:

    kubectl apply -f dataload.yaml

    Expected output:

    dataload.data.fluid.io/cron-dataload created
  3. Run the following command to query the status of the DataLoad job:

    kubectl get dataload

    In the following code block, if the status of the PHASE parameter is Complete, data is loaded and you can proceed to the next step.

    NAME            DATASET   PHASE      AGE   DURATION
    cron-dataload   demo      Complete   68s   8s
  4. Run the following command to query the status of the dataset:

    kubectl get dataset

    Expected output:

    NAME    UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    demo    588.90KiB        588.90KiB   10.00GiB         100.0%              Bound   5m50s

    The output indicates that all data in OSS is loaded to the cache.

Step 4: Create an application pod to access data in OSS

  1. Create a file named app.yaml and use an application pod to access the RELEASENOTES.md file.

    Show sample code

    apiVersion: v1
    kind: Pod
    metadata:
      name: nginx
    spec:
      containers:
        - name: nginx
          image: nginx
          volumeMounts:
            - mountPath: /data
              name: demo-vol
      volumes:
        - name: demo-vol
          persistentVolumeClaim:
            claimName: demo
  2. Run the following command to create an application pod:

    kubectl create -f app.yaml

    Expected output:

    pod/nginx created
  3. After the application pod is ready, run the following command to access the data in OSS:

    kubectl exec -it nginx -- ls -lh /data

    Expected output:

    total 589K
    -rwxrwxr-x 1 root root 589K Jul 31 04:20 RELEASENOTES.md
  4. Run the following command to write the string "hello, crondataload." to the RELEASENOTES.md file:

    echo "hello, crondataload." >> RELEASENOTES.md
  5. Run the following command to upload the RELEASENOTES.md file to OSS:

    ossutil64 cp RELEASENOTES.md oss://<bucket-name>/<path>/RELEASENOTES.md

    Press y. Expected output:

    cp: overwrite "oss://<bucket-name>/<path>/RELEASENOTES.md"(y or N)? y
    Succeed: Total num: 1, size: 21. OK num: 1(upload 1 files).                          
    
    average speed 0(byte/s)
    
    81.827978(s) elapsed
  6. Run the following command to query the status of the DataLoad job:

    kubectl describe dataload cron-dataload

    Expected output:

    ...
    Status:
      Conditions:
        Last Probe Time:       2023-08-24T06:44:08Z
        Last Transition Time:  2023-08-24T06:44:08Z
        Status:                True
        Type:                  Complete
      Duration:                8s
      Last Schedule Time:      2023-08-24T06:44:00Z # The most recent time when the DataLoad job started. 
      Last Successful Time:    2023-08-24T06:44:08Z # The most recent time when the DataLoad job was completed. 
      Phase:                   Complete
      ...
  7. Run the following command to query the status of the dataset:

    kubectl get dataset

    Expected output:

    NAME    UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    demo    588.90KiB        1.15MiB     10.00GiB         100.0%              Bound   10m

    The output indicates that the updated file is loaded to the cache.

  8. Run the following command to view the updated file in the application pod:

    kubectl exec -it nginx -- tail /data/RELEASENOTES.md

    Expected output:

    hello, crondataload.

    The output indicates that the application pod can access the updated file.

(Optional) Step 5: Clear data

If you do not need to use the data acceleration feature, clear the related data.

Run the following command to delete the JindoRuntime and the application pod:

kubectl delete -f app.yaml
kubectl delete -f dataset.yaml

Expected output:

pod "nginx" deleted
dataset.data.fluid.io "demo" deleted
jindoruntime.data.fluid.io "demo" deleted