Accelerate Spark Jobs with Fluid - Container Service for Kubernetes

Set up Fluid with JindoRuntime on ACK to cache OSS data on local disks, reducing Spark job latency.

Prerequisites

Before you begin, ensure that you have:

The ack-spark-operator component installed.
The cloud-native AI suite deployed with the ack-fluid component enabled.
Test data uploaded to an OSS bucket.

Note

The examples use spark.jobNamespaces=["spark"]. To use a different namespace, update the namespace field in your Spark job configurations.

How it works

Fluid routes Spark data reads through JindoRuntime worker pods. On first access, JindoRuntime fetches data from OSS and writes it to local SATA HDDs. Subsequent reads hit the local cache, eliminating network round-trips to OSS.

Two access methods are available:

Method	How Spark accesses data	When to use
POSIX	Fluid creates a persistent volume claim (PVC) backed by the dataset. Spark mounts the PVC and reads files using standard `file://` paths.	Your Spark job reads local files and you want cache acceleration with no code changes.
HCFS	Spark uses JindoSDK to access data via `oss://`, `s3://`, or `s3a://` URLs, routed through JindoRuntime.	Your job already uses OSS paths, or you want to switch storage URIs without remounting volumes. Requires a custom Spark image with JindoSDK.

Step 1: Create a dedicated node pool for Fluid

Create a node pool named fluid for JindoRuntime worker pods. This example uses three ecs.d1ne.4xlarge instances (network-enhanced big data family), each with eight 5,905 GB local SATA HDDs mounted at /mnt/disk1 through /mnt/disk8.

Add this label and taint to each node:

Label: fluid-cloudnative.github.io/node="true"
Taint: fluid-cloudnative.github.io/node="true":NoSchedule

See Create and manage a node pool for setup and Best practices for the cache optimization policies of Fluid for instance types.

Step 2: Create a dataset

A Dataset is a Fluid custom resource that maps an OSS path to a cache-backed volume, so Spark pods read from the cache instead of OSS directly.

Create fluid-oss-secret.yaml. Replace <ACCESS_KEY_ID> and <ACCESS_KEY_SECRET> with your AccessKey pair.

apiVersion: v1
kind: Secret
metadata:
  name: fluid-oss-secret
  namespace: spark
stringData:
  OSS_ACCESS_KEY_ID: <ACCESS_KEY_ID>
  OSS_ACCESS_KEY_SECRET: <ACCESS_KEY_SECRET>

Create the Secret:

kubectl create -f fluid-oss-secret.yaml

Expected output:

secret/fluid-oss-secret created

Create spark-fluid-dataset.yaml:

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: spark
  namespace: spark
spec:
  mounts:
  - name: spark
    # Required. The OSS path to cache. Replace <OSS_BUCKET> with your bucket name.
    mountPoint: oss://<OSS_BUCKET>/
    path: /
    options:
      # Required. The OSS endpoint. Example for China (Beijing) internal: oss-cn-beijing-internal.aliyuncs.com
      fs.oss.endpoint: <OSS_ENDPOINT>
    encryptOptions:
    - name: fs.oss.accessKeyId
      valueFrom:
        secretKeyRef:
          name: fluid-oss-secret
          key: OSS_ACCESS_KEY_ID
    - name: fs.oss.accessKeySecret
      valueFrom:
        secretKeyRef:
          name: fluid-oss-secret
          key: OSS_ACCESS_KEY_SECRET
  # Data is cached only on nodes matching this affinity rule.
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: fluid-cloudnative.github.io/node
          operator: In
          values:
          - "true"
  # Required for nodes with the fluid taint.
  tolerations:
  - key: fluid-cloudnative.github.io/node
    operator: Equal
    value: "true"
    effect: NoSchedule

Key parameters:

Parameter	Description
`mountPoint`	The OSS path to cache.
`fs.oss.endpoint`	The OSS bucket endpoint. For the China (Beijing) region internal endpoint, use `oss-cn-beijing-internal.aliyuncs.com`.
`encryptOptions`	Reads OSS credentials from the `fluid-oss-secret` Secret.

Create the Dataset:

kubectl create -f spark-fluid-dataset.yaml

Expected output:

dataset.data.fluid.io/spark created

Verify the Dataset status:

kubectl get -n spark dataset spark -o wide

Expected output:

NAME    UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE      HCFS URL   TOTAL FILES   CACHE HIT RATIO   AGE
spark                                                                  NotBound                                              58m

NotBound is expected — the Dataset binds to a runtime in the next step.

Step 3: Create a JindoRuntime

A JindoRuntime deploys JindoFS master and worker pods that back the Dataset with local disk cache.

Create spark-fluid-jindoruntime.yaml:

apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  # Required. Must match the Dataset name.
  name: spark
  namespace: spark
spec:
  # Required. Number of worker pods — one per cache node.
  replicas: 3
  tieredstore:
    levels:
    - mediumtype: HDD
      volumeType: hostPath
      # Required. Set based on the number of disks on each node.
      path: /mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/disk5,/mnt/disk6,/mnt/disk7,/mnt/disk8
      # Required. Cache quota per disk. Total cache = 8 x 5500Gi = 43.95 TiB per worker.
      quotaList: 5500Gi,5500Gi,5500Gi,5500Gi,5500Gi,5500Gi,5500Gi,5500Gi
      # Optional. Start evicting cache when disk usage reaches 99%.
      high: "0.99"
      # Optional. Stop evicting when disk usage drops to 95%.
      low: "0.95"
  worker:
    resources:
      requests:
        cpu: 14
        memory: 56Gi
      limits:
        cpu: 14
        memory: 56Gi

Key parameters:

Parameter	Description
`replicas`	Number of JindoRuntime worker pods.
`mediumtype`	Cache storage type. `HDD` uses local hard disks.
`path`	Mount paths of the local disks on each worker node.
`quotaList`	Maximum cache size per disk.
`high`	Disk usage ratio that triggers cache eviction. `"0.99"` starts eviction at 99% disk usage.
`low`	Disk usage ratio at which cache eviction stops. `"0.95"` stops eviction at 95% usage.

Create the JindoRuntime:

kubectl create -f spark-fluid-jindoruntime.yaml

Expected output:

jindoruntime.data.fluid.io/spark created

Verify the JindoRuntime is ready:

kubectl get -n spark jindoruntime spark

Expected output:

NAME    MASTER PHASE   WORKER PHASE   FUSE PHASE   AGE
spark   Ready          Ready          Ready        2m28s

All three phases must show Ready before proceeding.

Verify the Dataset is now bound:

kubectl get -n spark dataset spark -o wide

Expected output:

NAME    UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   HCFS URL                             TOTAL FILES     CACHE HIT RATIO   AGE
spark   [Calculating]    0.00B    128.91TiB                            Bound   spark-jindofs-master-0.spark:19434   [Calculating]                     2m5s

Bound confirms the Dataset is ready to serve cached data.

Step 4: (Optional) Prefetch data

By default, the first Spark job fetches data from OSS at network speed. Use a DataLoad to preload data into the cache so even the first run reads locally.

Create spark-fluid-dataload.yaml:

apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: spark
  namespace: spark
spec:
  dataset:
    name: spark
    namespace: spark
  loadMetadata: true

Create the DataLoad:

kubectl create -f spark-fluid-dataload.yaml

Expected output:

dataload.data.fluid.io/spark created

Monitor prefetching progress:

kubectl get -n spark dataload spark -w

Expected output:

NAME    DATASET   PHASE       AGE     DURATION
spark   spark     Executing   20s     Unfinished
spark   spark     Complete    9m31s   8m37s

Prefetching completes when the phase shows Complete.

Verify cached data volume:

kubectl get -n spark dataset spark -o wide

Expected output:

NAME    UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   HCFS URL                             TOTAL FILES     CACHE HIT RATIO   AGE
spark   0.00B            326.85GiB   128.91TiB        0.0%                Bound   spark-jindofs-master-0.spark:19434   [Calculating]                     19m

CACHED shows 326.85GiB, confirming data is preloaded.

Step 5: Run a Spark job

Choose a method based on your setup:

Use POSIX if your Spark job reads local files. Fluid mounts a PVC accessible via file:// paths — no code changes needed.
Use HCFS if your job already uses oss://, s3://, or s3a:// paths, or you want to route OSS access through the cache. Requires a custom Spark image with JindoSDK.

Method 1: Use POSIX APIs

Fluid creates a PVC for the Dataset. Mount it in Spark driver and executor pods to access cached data via file:// paths — no code changes required.

Create spark-pagerank-fluid-posix.yaml:

Note

This example uses the spark:3.5.4 community image. If the image fails to pull, sync it to your image repository or build a custom image.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pagerank-fluid-posix
  namespace: spark
spec:
  type: Scala
  mode: cluster
  image: spark:3.5.4
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.4.jar
  mainClass: org.apache.spark.examples.SparkPageRank
  arguments:
  # Access cached data using the file:// format.
  - file:///mnt/fluid/data/pagerank_dataset.txt
  - "10"
  sparkVersion: 3.5.4
  driver:
    cores: 1
    coreLimit: 1200m
    memory: 512m
    volumeMounts:
    # Mount the PVC created by Fluid for the dataset.
    - name: spark
      mountPath: /mnt/fluid
    serviceAccount: spark-operator-spark
  executor:
    instances: 2
    cores: 1
    coreLimit: "1"
    memory: 4g
    volumeMounts:
    # Mount the PVC created by Fluid for the dataset.
    - name: spark
      mountPath: /mnt/fluid
  volumes:
  # The PVC name matches the Dataset name.
  - name: spark
    persistentVolumeClaim:
      claimName: spark
  restartPolicy:
    type: Never

Submit the job:

kubectl create -f spark-pagerank-fluid-posix.yaml

Expected output:

sparkapplication.sparkoperator.k8s.io/spark-pagerank-fluid-posix created

Monitor the job status:

kubectl get -n spark sparkapplication spark-pagerank-fluid-posix -w

Expected output:

NAME                         STATUS       ATTEMPTS   START                  FINISH                 AGE
spark-pagerank-fluid-posix   RUNNING      1          2025-01-16T11:06:15Z   <no value>             87s
spark-pagerank-fluid-posix   RUNNING      1          2025-01-16T11:06:15Z   <no value>             102s
spark-pagerank-fluid-posix   RUNNING      1          2025-01-16T11:06:15Z   <no value>             102s
spark-pagerank-fluid-posix   SUCCEEDING   1          2025-01-16T11:06:15Z   2025-01-16T11:07:59Z   104s
spark-pagerank-fluid-posix   COMPLETED    1          2025-01-16T11:06:15Z   2025-01-16T11:07:59Z   104s

Method 2: Use HCFS APIs

Spark reads data via oss://, s3://, or s3a:// URIs, routed through JindoRuntime by JindoSDK to hit the local cache. Requires a custom Spark image with JindoSDK.

Get the HCFS URL of the Dataset:

kubectl get -n spark dataset spark -o wide

Expected output:

NAME    UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   HCFS URL                             TOTAL FILES     CACHE HIT RATIO   AGE
spark   0.00B            326.85GiB   128.91TiB        0.0%                Bound   spark-jindofs-master-0.spark:19434   [Calculating]                     30m

Note the HCFS URL (spark-jindofs-master-0.spark:19434). Set fs.jindofsx.namespace.rpc.address to this value in the next step.

Build a Spark image with JindoSDK if needed:

ARG SPARK_IMAGE=spark:3.5.4

FROM ${SPARK_IMAGE}

# Add JindoSDK dependencies for OSS acceleration
ADD --chown=spark:spark --chmod=644 https://jindodata-binary.oss-cn-shanghai.aliyuncs.com/mvn-repo/com/aliyun/jindodata/jindo-core/6.4.0/jindo-core-6.4.0.jar ${SPARK_HOME}/jars
ADD --chown=spark:spark --chmod=644 https://jindodata-binary.oss-cn-shanghai.aliyuncs.com/mvn-repo/com/aliyun/jindodata/jindo-sdk/6.4.0/jindo-sdk-6.4.0.jar ${SPARK_HOME}/jars

Create spark-pagerank-fluid-hcfs.yaml. Replace <SPARK_IMAGE>, <OSS_BUCKET>, and <OSS_ENDPOINT> with your values.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pagerank-fluid-hcfs
  namespace: spark
spec:
  type: Scala
  mode: cluster
  # Required. Must be a Spark image that includes JindoSDK.
  image: <SPARK_IMAGE>
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.4.jar
  mainClass: org.apache.spark.examples.SparkPageRank
  arguments:
  # Choose one URI format. All three are routed through JindoRuntime.
  # oss:// format:
  - oss://<OSS_BUCKET>/data/pagerank_dataset.txt
  # s3:// format (commented out):
  # - s3://<OSS_BUCKET>/data/pagerank_dataset.txt
  # s3a:// format (commented out):
  # - s3a://<OSS_BUCKET>/data/pagerank_dataset.txt
  - "10"
  sparkVersion: 3.5.4
  hadoopConf:
    #===================
    # OSS access via oss://
    #===================
    fs.oss.impl: com.aliyun.jindodata.oss.JindoOssFileSystem
    fs.oss.endpoint: <OSS_ENDPOINT>
    fs.oss.credentials.provider: com.aliyun.jindodata.oss.auth.EnvironmentVariableCredentialsProvider

    # OSS access via s3://
    fs.s3.impl: com.aliyun.jindodata.s3.JindoS3FileSystem
    fs.s3.endpoint: <OSS_ENDPOINT>
    fs.s3.credentials.provider: com.aliyun.jindodata.oss.auth.EnvironmentVariableCredentialsProvider

    # OSS access via s3a://
    fs.s3a.impl: com.aliyun.jindodata.s3.JindoS3FileSystem
    fs.s3a.endpoint: <OSS_ENDPOINT>
    fs.s3a.credentials.provider: com.aliyun.jindodata.oss.auth.EnvironmentVariableCredentialsProvider

    #===================
    # JindoFS cache routing
    #===================
    fs.xengine: jindofsx
    # Required. Set to the HCFS URL from step 1.
    fs.jindofsx.namespace.rpc.address: spark-jindofs-master-0.spark:19434
    fs.jindofsx.data.cache.enable: "true"
  driver:
    cores: 1
    coreLimit: 1200m
    memory: 512m
    envFrom:
    - secretRef:
        name: spark-oss-secret
    serviceAccount: spark-operator-spark
  executor:
    instances: 2
    cores: 2
    coreLimit: "2"
    memory: 8g
    envFrom:
    - secretRef:
        name: spark-oss-secret
  restartPolicy:
    type: Never

Submit the job:

kubectl create -f spark-pagerank-fluid-hcfs.yaml

Expected output:

sparkapplication.sparkoperator.k8s.io/spark-pagerank-fluid-hcfs created

Monitor the job status:

kubectl get -n spark sparkapplication spark-pagerank-fluid-hcfs -w

Expected output:

NAME                        STATUS       ATTEMPTS   START                  FINISH                 AGE
spark-pagerank-fluid-hcfs   RUNNING      1          2025-01-16T11:21:16Z   <no value>             9s
spark-pagerank-fluid-hcfs   RUNNING      1          2025-01-16T11:21:16Z   <no value>             15s
spark-pagerank-fluid-hcfs   RUNNING      1          2025-01-16T11:21:16Z   <no value>             77s
spark-pagerank-fluid-hcfs   RUNNING      1          2025-01-16T11:21:16Z   <no value>             77s
spark-pagerank-fluid-hcfs   SUCCEEDING   1          2025-01-16T11:21:16Z   2025-01-16T11:22:34Z   78s
spark-pagerank-fluid-hcfs   COMPLETED    1          2025-01-16T11:21:16Z   2025-01-16T11:22:34Z   78s

Step 6: (Optional) Clean up

Delete resources in this order to avoid dependency conflicts:

kubectl delete -f spark-pagerank-fluid-posix.yaml
kubectl delete -f spark-pagerank-fluid-hcfs.yaml
kubectl delete -f spark-fluid-dataload.yaml
kubectl delete -f spark-fluid-jindoruntime.yaml
kubectl delete -f spark-fluid-dataset.yaml
kubectl delete -f fluid-oss-secret.yaml

Next steps

Elastic datasets — Fluid's dataset abstraction and supported runtimes.
Best practices for the cache optimization policies of Fluid — Optimize cache performance for your workloads.