Spark jobs that read large datasets from Object Storage Service (OSS) repeatedly fetch the same data over the network, adding latency on every run. Fluid caches OSS data on local disks so that Spark executors read from the cache instead of fetching data from OSS on each access. This guide shows you how to set up Fluid with JindoRuntime on Container Service for Kubernetes (ACK) and run a Spark PageRank job against cached data.
Prerequisites
Before you begin, ensure that you have:
-
The
ack-spark-operatorcomponent installed. For more information, see Step 1: Install the ack-spark-operator component. -
The cloud-native AI suite deployed with the
ack-fluidcomponent enabled. For more information, see Install the cloud-native AI suite. -
Test data uploaded to an OSS bucket. For more information, see Prepare and upload test data to an OSS bucket.
The examples in this guide use spark.jobNamespaces=["spark"]. To use a different namespace, update the namespace field in your Spark job configurations accordingly.
How it works
Fluid intercepts data access from Spark jobs and routes reads through JindoRuntime worker pods. On the first access, JindoRuntime fetches data from OSS and writes it to local SATA HDDs. Subsequent reads hit the local cache, eliminating repeated network round-trips to OSS.
Two access methods are available:
| Method | How Spark accesses data | When to use |
|---|---|---|
| POSIX | Fluid creates a persistent volume claim (PVC) backed by the dataset. Spark mounts the PVC and reads files using standard file:// paths. |
Your Spark job reads local files and you want cache acceleration with no code changes. |
| HCFS | Spark uses JindoSDK to access data via oss://, s3://, or s3a:// URLs, routed through JindoRuntime. |
Your job already uses OSS paths, or you want to switch storage URIs without remounting volumes. Requires a custom Spark image with JindoSDK. |
Step 1: Create a dedicated node pool for Fluid
Create a node pool named fluid for the JindoRuntime worker pods. This example uses three ecs.d1ne.4xlarge instances (network-enhanced big data family). Each node has eight 5,905 GB high-throughput local SATA HDDs, formatted and mounted at /mnt/disk1 through /mnt/disk8.
Add the following label and taint to each node:
-
Label:
fluid-cloudnative.github.io/node="true" -
Taint:
fluid-cloudnative.github.io/node="true":NoSchedule
For guidance on creating a node pool, see Create and manage a node pool. For instance type recommendations, see Best practices for the cache optimization policies of Fluid.
Step 2: Create a dataset
A Dataset is a Fluid custom resource that maps an OSS path to a cache-backed volume. Spark pods access data through this volume instead of going directly to OSS.
-
Create
fluid-oss-secret.yamlto store your OSS credentials. Replace<ACCESS_KEY_ID>and<ACCESS_KEY_SECRET>with your Alibaba Cloud AccessKey pair.apiVersion: v1 kind: Secret metadata: name: fluid-oss-secret namespace: spark stringData: OSS_ACCESS_KEY_ID: <ACCESS_KEY_ID> OSS_ACCESS_KEY_SECRET: <ACCESS_KEY_SECRET> -
Create the Secret:
kubectl create -f fluid-oss-secret.yamlExpected output:
secret/fluid-oss-secret created -
Create
spark-fluid-dataset.yaml:apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: spark namespace: spark spec: mounts: - name: spark # Required. The OSS path to cache. Replace <OSS_BUCKET> with your bucket name. mountPoint: oss://<OSS_BUCKET>/ path: / options: # Required. The OSS endpoint. Example for China (Beijing) internal: oss-cn-beijing-internal.aliyuncs.com fs.oss.endpoint: <OSS_ENDPOINT> encryptOptions: - name: fs.oss.accessKeyId valueFrom: secretKeyRef: name: fluid-oss-secret key: OSS_ACCESS_KEY_ID - name: fs.oss.accessKeySecret valueFrom: secretKeyRef: name: fluid-oss-secret key: OSS_ACCESS_KEY_SECRET # Data is cached only on nodes matching this affinity rule. nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: fluid-cloudnative.github.io/node operator: In values: - "true" # Required for nodes with the fluid taint. tolerations: - key: fluid-cloudnative.github.io/node operator: Equal value: "true" effect: NoScheduleKey parameters:
Parameter Description mountPointThe OSS path to cache. fs.oss.endpointThe OSS bucket endpoint. For the China (Beijing) region internal endpoint, use oss-cn-beijing-internal.aliyuncs.com.encryptOptionsReads OSS credentials from the fluid-oss-secretSecret. -
Create the Dataset:
kubectl create -f spark-fluid-dataset.yamlExpected output:
dataset.data.fluid.io/spark created -
Verify the Dataset status:
kubectl get -n spark dataset spark -o wideExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE HCFS URL TOTAL FILES CACHE HIT RATIO AGE spark NotBound 58mThe
NotBoundphase is expected. The Dataset binds to a runtime in the next step.
Step 3: Create a JindoRuntime
A JindoRuntime deploys the JindoFS master and worker pods that back the Dataset with local disk cache.
-
Create
spark-fluid-jindoruntime.yaml:apiVersion: data.fluid.io/v1alpha1 kind: JindoRuntime metadata: # Required. Must match the Dataset name. name: spark namespace: spark spec: # Required. Number of worker pods — one per cache node. replicas: 3 tieredstore: levels: - mediumtype: HDD volumeType: hostPath # Required. Set based on the number of disks on each node. path: /mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/disk5,/mnt/disk6,/mnt/disk7,/mnt/disk8 # Required. Cache quota per disk. Total cache = 8 x 5500Gi = 43.95 TiB per worker. quotaList: 5500Gi,5500Gi,5500Gi,5500Gi,5500Gi,5500Gi,5500Gi,5500Gi # Optional. Start evicting cache when disk usage reaches 99%. high: "0.99" # Optional. Stop evicting when disk usage drops to 95%. low: "0.95" worker: resources: requests: cpu: 14 memory: 56Gi limits: cpu: 14 memory: 56GiKey parameters:
Parameter Description replicasNumber of JindoRuntime worker pods. mediumtypeCache storage type. HDDuses local hard disks.pathMount paths of the local disks on each worker node. quotaListMaximum cache size per disk. Each value maps to the corresponding path. highDisk usage ratio that triggers cache eviction. For example, "0.99"means eviction starts at 99% disk usage.lowDisk usage ratio at which cache eviction stops. For example, "0.95"means eviction stops when disk usage drops to 95%. -
Create the JindoRuntime:
kubectl create -f spark-fluid-jindoruntime.yamlExpected output:
jindoruntime.data.fluid.io/spark created -
Wait for the JindoRuntime to be ready:
kubectl get -n spark jindoruntime sparkExpected output:
NAME MASTER PHASE WORKER PHASE FUSE PHASE AGE spark Ready Ready Ready 2m28sAll three phases must show
Readybefore proceeding. -
Verify the Dataset is now bound:
kubectl get -n spark dataset spark -o wideExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE HCFS URL TOTAL FILES CACHE HIT RATIO AGE spark [Calculating] 0.00B 128.91TiB Bound spark-jindofs-master-0.spark:19434 [Calculating] 2m5sThe
Boundphase confirms the Dataset is ready to serve cached data.
Step 4: (Optional) Prefetch data
The cache is empty until Spark jobs run and access data. The first job run fetches data from OSS at regular network speeds. To warm the cache before any job runs — so even the first job benefits from local reads — use a DataLoad to preload OSS data into the cache.
-
Create
spark-fluid-dataload.yaml:apiVersion: data.fluid.io/v1alpha1 kind: DataLoad metadata: name: spark namespace: spark spec: dataset: name: spark namespace: spark loadMetadata: true -
Create the DataLoad:
kubectl create -f spark-fluid-dataload.yamlExpected output:
dataload.data.fluid.io/spark created -
Monitor prefetching progress:
kubectl get -n spark dataload spark -wExpected output:
NAME DATASET PHASE AGE DURATION spark spark Executing 20s Unfinished spark spark Complete 9m31s 8m37sPrefetching is complete when the phase shows
Complete. -
Verify cached data volume:
kubectl get -n spark dataset spark -o wideExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE HCFS URL TOTAL FILES CACHE HIT RATIO AGE spark 0.00B 326.85GiB 128.91TiB 0.0% Bound spark-jindofs-master-0.spark:19434 [Calculating] 19mThe
CACHEDcolumn shows326.85GiB, confirming that data is preloaded.
Step 5: Run a Spark job
Choose a method based on your setup:
-
Use POSIX if your Spark job reads local files and you want no code changes. Fluid mounts a PVC that Spark accesses via
file://paths. -
Use HCFS if your job already uses
oss://,s3://, ors3a://paths, or you want to route existing OSS access through the cache. This method requires a custom Spark image with JindoSDK.
Method 1: Use POSIX APIs
Fluid creates a PVC for the Dataset. Mount the PVC in the Spark driver and executor pods to access cached data using standard file:// paths — no code changes required.
-
Create
spark-pagerank-fluid-posix.yaml:NoteThis example uses the
spark:3.5.4image from the Spark community. If the image fails to pull due to network issues, sync it to your own image repository or build a custom image.apiVersion: sparkoperator.k8s.io/v1beta2 kind: SparkApplication metadata: name: spark-pagerank-fluid-posix namespace: spark spec: type: Scala mode: cluster image: spark:3.5.4 mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.4.jar mainClass: org.apache.spark.examples.SparkPageRank arguments: # Access cached data using the file:// format. - file:///mnt/fluid/data/pagerank_dataset.txt - "10" sparkVersion: 3.5.4 driver: cores: 1 coreLimit: 1200m memory: 512m volumeMounts: # Mount the PVC created by Fluid for the dataset. - name: spark mountPath: /mnt/fluid serviceAccount: spark-operator-spark executor: instances: 2 cores: 1 coreLimit: "1" memory: 4g volumeMounts: # Mount the PVC created by Fluid for the dataset. - name: spark mountPath: /mnt/fluid volumes: # The PVC name matches the Dataset name. - name: spark persistentVolumeClaim: claimName: spark restartPolicy: type: Never -
Submit the job:
kubectl create -f spark-pagerank-fluid-posix.yamlExpected output:
sparkapplication.sparkoperator.k8s.io/spark-pagerank-fluid-posix created -
Monitor the job status:
kubectl get -n spark sparkapplication spark-pagerank-fluid-posix -wExpected output:
NAME STATUS ATTEMPTS START FINISH AGE spark-pagerank-fluid-posix RUNNING 1 2025-01-16T11:06:15Z <no value> 87s spark-pagerank-fluid-posix RUNNING 1 2025-01-16T11:06:15Z <no value> 102s spark-pagerank-fluid-posix RUNNING 1 2025-01-16T11:06:15Z <no value> 102s spark-pagerank-fluid-posix SUCCEEDING 1 2025-01-16T11:06:15Z 2025-01-16T11:07:59Z 104s spark-pagerank-fluid-posix COMPLETED 1 2025-01-16T11:06:15Z 2025-01-16T11:07:59Z 104s
Method 2: Use HCFS APIs
With HCFS access, Spark reads data using oss://, s3://, or s3a:// URIs. JindoSDK intercepts these requests and routes reads through JindoRuntime, hitting the local cache. This method requires a custom Spark image that includes JindoSDK.
-
Get the HCFS URL of the Dataset:
kubectl get -n spark dataset spark -o wideExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE HCFS URL TOTAL FILES CACHE HIT RATIO AGE spark 0.00B 326.85GiB 128.91TiB 0.0% Bound spark-jindofs-master-0.spark:19434 [Calculating] 30mNote the HCFS URL (
spark-jindofs-master-0.spark:19434). Setfs.jindofsx.namespace.rpc.addressto this value in the next step. -
If you don't already have a Spark image with JindoSDK, build one using this Dockerfile:
ARG SPARK_IMAGE=spark:3.5.4 FROM ${SPARK_IMAGE} # Add JindoSDK dependencies for OSS acceleration ADD --chown=spark:spark --chmod=644 https://jindodata-binary.oss-cn-shanghai.aliyuncs.com/mvn-repo/com/aliyun/jindodata/jindo-core/6.4.0/jindo-core-6.4.0.jar ${SPARK_HOME}/jars ADD --chown=spark:spark --chmod=644 https://jindodata-binary.oss-cn-shanghai.aliyuncs.com/mvn-repo/com/aliyun/jindodata/jindo-sdk/6.4.0/jindo-sdk-6.4.0.jar ${SPARK_HOME}/jars -
Create
spark-pagerank-fluid-hcfs.yaml. Replace<SPARK_IMAGE>with your image,<OSS_BUCKET>with your bucket name, and<OSS_ENDPOINT>with your bucket endpoint.apiVersion: sparkoperator.k8s.io/v1beta2 kind: SparkApplication metadata: name: spark-pagerank-fluid-hcfs namespace: spark spec: type: Scala mode: cluster # Required. Must be a Spark image that includes JindoSDK. image: <SPARK_IMAGE> mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.4.jar mainClass: org.apache.spark.examples.SparkPageRank arguments: # Choose one URI format. All three are routed through JindoRuntime. # oss:// format: - oss://<OSS_BUCKET>/data/pagerank_dataset.txt # s3:// format (commented out): # - s3://<OSS_BUCKET>/data/pagerank_dataset.txt # s3a:// format (commented out): # - s3a://<OSS_BUCKET>/data/pagerank_dataset.txt - "10" sparkVersion: 3.5.4 hadoopConf: #=================== # OSS access via oss:// #=================== fs.oss.impl: com.aliyun.jindodata.oss.JindoOssFileSystem fs.oss.endpoint: <OSS_ENDPOINT> fs.oss.credentials.provider: com.aliyun.jindodata.oss.auth.EnvironmentVariableCredentialsProvider # OSS access via s3:// fs.s3.impl: com.aliyun.jindodata.s3.JindoS3FileSystem fs.s3.endpoint: <OSS_ENDPOINT> fs.s3.credentials.provider: com.aliyun.jindodata.oss.auth.EnvironmentVariableCredentialsProvider # OSS access via s3a:// fs.s3a.impl: com.aliyun.jindodata.s3.JindoS3FileSystem fs.s3a.endpoint: <OSS_ENDPOINT> fs.s3a.credentials.provider: com.aliyun.jindodata.oss.auth.EnvironmentVariableCredentialsProvider #=================== # JindoFS cache routing #=================== fs.xengine: jindofsx # Required. Set to the HCFS URL from step 1. fs.jindofsx.namespace.rpc.address: spark-jindofs-master-0.spark:19434 fs.jindofsx.data.cache.enable: "true" driver: cores: 1 coreLimit: 1200m memory: 512m envFrom: - secretRef: name: spark-oss-secret serviceAccount: spark-operator-spark executor: instances: 2 cores: 2 coreLimit: "2" memory: 8g envFrom: - secretRef: name: spark-oss-secret restartPolicy: type: Never -
Submit the job:
kubectl create -f spark-pagerank-fluid-hcfs.yamlExpected output:
sparkapplication.sparkoperator.k8s.io/spark-pagerank-fluid-hcfs created -
Monitor the job status:
kubectl get -n spark sparkapplication spark-pagerank-fluid-hcfs -wExpected output:
NAME STATUS ATTEMPTS START FINISH AGE spark-pagerank-fluid-hcfs RUNNING 1 2025-01-16T11:21:16Z <no value> 9s spark-pagerank-fluid-hcfs RUNNING 1 2025-01-16T11:21:16Z <no value> 15s spark-pagerank-fluid-hcfs RUNNING 1 2025-01-16T11:21:16Z <no value> 77s spark-pagerank-fluid-hcfs RUNNING 1 2025-01-16T11:21:16Z <no value> 77s spark-pagerank-fluid-hcfs SUCCEEDING 1 2025-01-16T11:21:16Z 2025-01-16T11:22:34Z 78s spark-pagerank-fluid-hcfs COMPLETED 1 2025-01-16T11:21:16Z 2025-01-16T11:22:34Z 78s
Step 6: (Optional) Clean up
Delete resources in the following order to avoid dependency conflicts:
kubectl delete -f spark-pagerank-fluid-posix.yaml
kubectl delete -f spark-pagerank-fluid-hcfs.yaml
kubectl delete -f spark-fluid-dataload.yaml
kubectl delete -f spark-fluid-jindoruntime.yaml
kubectl delete -f spark-fluid-dataset.yaml
kubectl delete -f fluid-oss-secret.yaml
What's next
-
Elastic datasets — Learn more about Fluid's dataset abstraction and supported runtimes.
-
Best practices for the cache optimization policies of Fluid — Optimize cache performance for your workloads.