JindoRuntime is based on C++ and supports dataset management, data caching, and data storage in OSS. Fluid enables the observability, auto scaling, and portability of datasets by managing and scheduling JindoRuntime. This topic describes how to use Fluid to accelerate data access in scenarios in which ACS compute power is used.
When Pods read OSS data repeatedly, each read fetches the data over the network — even if the same file was just accessed moments ago. JindoFS eliminates those repeat round trips by caching data in local memory. Once a file is cached, subsequent reads serve it at near-local speed. The example in this topic demonstrates a 9x speedup on a 210 MiB file after the first cached read.
Prerequisites
Before you begin, ensure that you have:
-
An activated OSS account. See Activate OSS.
-
ack-fluid 1.0.11-\* or later installed in your cluster. See Use Helm to manage applications in ACS.
-
Privileged mode enabled for ACS Pods.
NotePrivileged mode is required to use Fluid for data acceleration. To enable it, submit a ticket.
Step 1: Upload data to OSS
-
Download the test dataset.
wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz -
Upload the dataset to an OSS bucket.
-
Create a bucket named
examplebucket.NoteIf the command returns
ErrorCode=BucketAlreadyExists, the bucket already exists. OSS bucket names must be globally unique — change the name as needed.ossutil64 mb oss://examplebucketExpected output:
0.668238(s) elapsed -
Upload the dataset to the bucket.
ossutil64 cp spark-3.0.1-bin-hadoop2.7.tgz oss://examplebucket -
(Optional) Configure bucket and data access permissions. See Permission control.
ImportantThe following sub-steps use an ECS instance running Alibaba Cloud Linux 3.2104 LTS 64-bit. For other operating systems, see the ossutil command reference and ossutil 1.0.
-
Create a file named
mySecret.yamlwith the following content.apiVersion: v1 kind: Secret metadata: name: mysecret stringData: fs.oss.accessKeyId: <your-access-key-id> # Replace with your AccessKey ID fs.oss.accessKeySecret: <your-access-key-secret> # Replace with your AccessKey SecretKubernetes automatically encrypts Secrets to avoid exposing sensitive data in plaintext.
-
Apply the Secret.
kubectl create -f mySecret.yaml
Step 2: Create a Dataset and a JindoRuntime
Before proceeding, verify that the dataset-controller and jindoruntime-controller of the ack-fluid component are running:
kubectl get pods --field-selector=status.phase=Running -n fluid-system
In this example, CPU compute power is preferably used. To accelerate the loading of LLMs, make sure that the zone of your cluster provides GPU resources. See Introduction to GPU compute classes.
-
Create a file named
resource.yamlwith the following content. The file defines a Dataset that points to your OSS data, and a JindoRuntime that launches a JindoFS cluster to cache it.Parameter Description mountPointThe OSS path to mount as the underlying file system (UFS). Use the format oss://<bucket>oross://<bucket>/<path>for a subdirectory.fs.oss.endpointThe endpoint of the OSS bucket — public or private. Example: oss-cn-beijing-internal.aliyuncs.com. See OSS regions and endpoints.replicasThe number of worker nodes in the JindoFS cluster. mediumtypeThe cache storage medium. Supported values: MEM,HDD,SSD.quotaThe maximum cache size per worker. high/lowThe upper and lower thresholds for cache eviction. apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: hadoop spec: placement: Shared # Required for ACS virtual nodes mounts: # To mount a subdirectory, use oss://<oss_bucket>/<oss_path> - mountPoint: oss://<oss_bucket> # Replace with your OSS bucket name, e.g. oss://examplebucket options: fs.oss.endpoint: <oss_endpoint> # Replace with your OSS endpoint, e.g. oss-cn-beijing-internal.aliyuncs.com name: hadoop path: "/" encryptOptions: - name: fs.oss.accessKeyId valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeyId - name: fs.oss.accessKeySecret valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeySecret --- apiVersion: data.fluid.io/v1alpha1 kind: JindoRuntime metadata: name: hadoop # Must match the Dataset name spec: networkmode: ContainerNetwork replicas: 4 # Number of JindoFS worker nodes; adjust as needed master: podMetadata: labels: alibabacloud.com/compute-class: performance alibabacloud.com/compute-qos: default worker: podMetadata: labels: alibabacloud.com/compute-class: performance alibabacloud.com/compute-qos: default resources: requests: cpu: 24 memory: 48Gi limits: cpu: 24 memory: 48Gi tieredstore: levels: - mediumtype: MEM # Cache medium: MEM, HDD, or SSD path: /dev/shm # Storage path for the cache medium volumeType: emptyDir quota: 48Gi # Maximum cache size per worker; adjust as needed high: "0.99" # Eviction starts when usage reaches this threshold low: "0.95" # Eviction stops when usage drops to this thresholdKey parameters:
-
Apply the configuration.
kubectl create -f resource.yaml -
Verify that the Dataset and JindoRuntime are ready. Check the Dataset:
kubectl get dataset hadoopExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE hadoop 209.74MiB 0.00B 4.00GiB 0.0% Bound 56sCheck the JindoRuntime:
kubectl get jindoruntime hadoopExpected output:
NAME MASTER PHASE WORKER PHASE FUSE PHASE AGE hadoop Ready Ready Ready 2m11s -
Confirm that the persistent volume (PV) and persistent volume claim (PVC) are created. The PV uses the Dataset name.
kubectl get pv,pvcExpected output:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE persistentvolume/default-hadoop 100Pi ROX Retain Bound default/hadoop fluid <unset> 2m5s NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE persistentvolumeclaim/hadoop Bound default-hadoop 100Pi ROX fluid <unset> 2m5s
Step 3: Create a DataLoad resource
Preloading the dataset into the JindoFS cache before your workload runs ensures that the first access is already fast and that the data processing logic is valid.
-
If the data in your OSS bucket is static, create a file named
dataload.yamlwith the following content.apiVersion: data.fluid.io/v1alpha1 kind: DataLoad metadata: name: hadoop spec: dataset: name: hadoop namespace: default loadMetadata: trueIf the data changes periodically, set up a recurring preload instead. See Scenario 2: Data in the backend storage is read-only but periodically changes.
-
Apply the DataLoad resource to start preloading.
kubectl create -f dataload.yaml -
Monitor preloading progress.
kubectl get dataloadExpected output when complete:
NAME DATASET PHASE AGE DURATION hadoop hadoop Complete 92m 51s
Step 4: Verify data acceleration
Deploy a test Pod that mounts the Dataset and measure file copy time before and after JindoFS caching takes effect.
-
Create a file named
app.yaml.apiVersion: v1 kind: Pod metadata: name: demo-app labels: # Required: instructs the Fluid webhook to inject JindoFS sidecar containers into ACS Pods alibabacloud.com/fluid-sidecar-target: acs spec: containers: - name: demo image: mirrors-ssl.aliyuncs.com/nginx:latest volumeMounts: - mountPath: /data name: hadoop resources: requests: cpu: 14 memory: 56Gi volumes: - name: hadoop persistentVolumeClaim: claimName: hadoop # Matches the Fluid Dataset name nodeSelector: type: virtual-kubelet tolerations: - key: virtual-kubelet.io/provider operator: Equal value: alibabacloud effect: NoSchedule -
Deploy the Pod.
kubectl create -f app.yaml -
Measure the file copy time without JindoFS caching. Check the file size:
kubectl exec -it demo-app -c demo -- du -sh /data/spark-3.0.1-bin-hadoop2.7.tgzExpected output:
210M /data/spark-3.0.1-bin-hadoop2.7.tgzTime the copy:
kubectl exec -it demo-app -c demo -- bash time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/nullExpected output:
real 0m1.883s user 0m0.001s sys 0m0.041s -
Confirm that the data is fully cached.
kubectl get dataset hadoopExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE hadoop 209.74MiB 209.74MiB 4.00GiB 100.0% Bound 64m -
Delete and recreate the Pod, then rerun the copy test against the JindoFS cache.
NoteRecreating the Pod clears the OS page cache, so the second measurement reflects only the JindoFS cache speed — not any in-memory residue from the first run.
Delete the existing Pod:
kubectl delete pod demo-appRecreate it:
kubectl create -f app.yamlRun the copy test again:
kubectl exec -it demo-app -c demo -- bash time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/nullExpected output:
real 0m0.203s user 0m0.000s sys 0m0.047sThe copy now takes 0.203 seconds — about 9x faster than the 1.883 seconds without caching. The speedup comes from JindoFS serving the file from its in-memory cache on
/dev/shmrather than fetching it from OSS over the network. Once data is cached locally, subsequent reads skip the network entirely.ImportantThe copy times shown here are for reference only and may vary based on your cluster configuration and network conditions.
Use ACS compute power in ACK Pro clusters
The steps above apply to ACS clusters. To use ACS compute power in ACK managed clusters instead, see Use the computing power of ACS in ACK Pro clusters.
For ACK managed clusters, make the following adjustments:
-
Install the ack-fluid component in the ACK managed cluster. See Use Helm to simplify application deployment.
-
Create the Dataset and JindoRuntime using the following configuration. The key difference from the ACS configuration is the absence of
placement: Sharedandnetworkmode, and nocompute-classlabels — standard ACK nodes do not require these settings.-
ACS clusters use virtual nodes that do not support standard node scaling. To enable shared dataset access and inter-pod communication, set
placement: Sharedandnetworkmode: ContainerNetworkin the ACS configuration. These fields are not needed for ACK managed clusters. -
Fluid workers on ACS require high bandwidth. Set
compute-class: performanceand configure sufficient CPU and memoryresourcesin the ACS configuration to ensure adequate bandwidth. ACK managed clusters allocate resources differently and do not need these labels.
apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: hadoop spec: mounts: # To mount a subdirectory, use oss://<oss_bucket>/<oss_path> - mountPoint: oss://<oss_bucket> # Replace with your OSS bucket name options: fs.oss.endpoint: <oss_endpoint> # Replace with your OSS endpoint name: hadoop path: "/" encryptOptions: - name: fs.oss.accessKeyId valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeyId - name: fs.oss.accessKeySecret valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeySecret --- apiVersion: data.fluid.io/v1alpha1 kind: JindoRuntime metadata: name: hadoop spec: replicas: 4 # Adjust as needed tieredstore: levels: - mediumtype: MEM path: /dev/shm volumeType: emptyDir quota: 48Gi high: "0.99" low: "0.95"Differences from the ACS configuration:
-