All Products
Search
Document Center

Container Service for Kubernetes:Accelerate Spark Jobs with Fluid

Last Updated:Mar 26, 2026

Spark jobs that read large datasets from Object Storage Service (OSS) repeatedly fetch the same data over the network, adding latency on every run. Fluid caches OSS data on local disks so that Spark executors read from the cache instead of fetching data from OSS on each access. This guide shows you how to set up Fluid with JindoRuntime on Container Service for Kubernetes (ACK) and run a Spark PageRank job against cached data.

Prerequisites

Before you begin, ensure that you have:

Note

The examples in this guide use spark.jobNamespaces=["spark"]. To use a different namespace, update the namespace field in your Spark job configurations accordingly.

How it works

Fluid intercepts data access from Spark jobs and routes reads through JindoRuntime worker pods. On the first access, JindoRuntime fetches data from OSS and writes it to local SATA HDDs. Subsequent reads hit the local cache, eliminating repeated network round-trips to OSS.

Two access methods are available:

Method How Spark accesses data When to use
POSIX Fluid creates a persistent volume claim (PVC) backed by the dataset. Spark mounts the PVC and reads files using standard file:// paths. Your Spark job reads local files and you want cache acceleration with no code changes.
HCFS Spark uses JindoSDK to access data via oss://, s3://, or s3a:// URLs, routed through JindoRuntime. Your job already uses OSS paths, or you want to switch storage URIs without remounting volumes. Requires a custom Spark image with JindoSDK.

Step 1: Create a dedicated node pool for Fluid

Create a node pool named fluid for the JindoRuntime worker pods. This example uses three ecs.d1ne.4xlarge instances (network-enhanced big data family). Each node has eight 5,905 GB high-throughput local SATA HDDs, formatted and mounted at /mnt/disk1 through /mnt/disk8.

Add the following label and taint to each node:

  • Label: fluid-cloudnative.github.io/node="true"

  • Taint: fluid-cloudnative.github.io/node="true":NoSchedule

For guidance on creating a node pool, see Create and manage a node pool. For instance type recommendations, see Best practices for the cache optimization policies of Fluid.

Step 2: Create a dataset

A Dataset is a Fluid custom resource that maps an OSS path to a cache-backed volume. Spark pods access data through this volume instead of going directly to OSS.

  1. Create fluid-oss-secret.yaml to store your OSS credentials. Replace <ACCESS_KEY_ID> and <ACCESS_KEY_SECRET> with your Alibaba Cloud AccessKey pair.

    apiVersion: v1
    kind: Secret
    metadata:
      name: fluid-oss-secret
      namespace: spark
    stringData:
      OSS_ACCESS_KEY_ID: <ACCESS_KEY_ID>
      OSS_ACCESS_KEY_SECRET: <ACCESS_KEY_SECRET>
  2. Create the Secret:

    kubectl create -f fluid-oss-secret.yaml

    Expected output:

    secret/fluid-oss-secret created
  3. Create spark-fluid-dataset.yaml:

    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
      name: spark
      namespace: spark
    spec:
      mounts:
      - name: spark
        # Required. The OSS path to cache. Replace <OSS_BUCKET> with your bucket name.
        mountPoint: oss://<OSS_BUCKET>/
        path: /
        options:
          # Required. The OSS endpoint. Example for China (Beijing) internal: oss-cn-beijing-internal.aliyuncs.com
          fs.oss.endpoint: <OSS_ENDPOINT>
        encryptOptions:
        - name: fs.oss.accessKeyId
          valueFrom:
            secretKeyRef:
              name: fluid-oss-secret
              key: OSS_ACCESS_KEY_ID
        - name: fs.oss.accessKeySecret
          valueFrom:
            secretKeyRef:
              name: fluid-oss-secret
              key: OSS_ACCESS_KEY_SECRET
      # Data is cached only on nodes matching this affinity rule.
      nodeAffinity:
        required:
          nodeSelectorTerms:
          - matchExpressions:
            - key: fluid-cloudnative.github.io/node
              operator: In
              values:
              - "true"
      # Required for nodes with the fluid taint.
      tolerations:
      - key: fluid-cloudnative.github.io/node
        operator: Equal
        value: "true"
        effect: NoSchedule

    Key parameters:

    Parameter Description
    mountPoint The OSS path to cache.
    fs.oss.endpoint The OSS bucket endpoint. For the China (Beijing) region internal endpoint, use oss-cn-beijing-internal.aliyuncs.com.
    encryptOptions Reads OSS credentials from the fluid-oss-secret Secret.
  4. Create the Dataset:

    kubectl create -f spark-fluid-dataset.yaml

    Expected output:

    dataset.data.fluid.io/spark created
  5. Verify the Dataset status:

    kubectl get -n spark dataset spark -o wide

    Expected output:

    NAME    UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE      HCFS URL   TOTAL FILES   CACHE HIT RATIO   AGE
    spark                                                                  NotBound                                              58m

    The NotBound phase is expected. The Dataset binds to a runtime in the next step.

Step 3: Create a JindoRuntime

A JindoRuntime deploys the JindoFS master and worker pods that back the Dataset with local disk cache.

  1. Create spark-fluid-jindoruntime.yaml:

    apiVersion: data.fluid.io/v1alpha1
    kind: JindoRuntime
    metadata:
      # Required. Must match the Dataset name.
      name: spark
      namespace: spark
    spec:
      # Required. Number of worker pods — one per cache node.
      replicas: 3
      tieredstore:
        levels:
        - mediumtype: HDD
          volumeType: hostPath
          # Required. Set based on the number of disks on each node.
          path: /mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/disk5,/mnt/disk6,/mnt/disk7,/mnt/disk8
          # Required. Cache quota per disk. Total cache = 8 x 5500Gi = 43.95 TiB per worker.
          quotaList: 5500Gi,5500Gi,5500Gi,5500Gi,5500Gi,5500Gi,5500Gi,5500Gi
          # Optional. Start evicting cache when disk usage reaches 99%.
          high: "0.99"
          # Optional. Stop evicting when disk usage drops to 95%.
          low: "0.95"
      worker:
        resources:
          requests:
            cpu: 14
            memory: 56Gi
          limits:
            cpu: 14
            memory: 56Gi

    Key parameters:

    Parameter Description
    replicas Number of JindoRuntime worker pods.
    mediumtype Cache storage type. HDD uses local hard disks.
    path Mount paths of the local disks on each worker node.
    quotaList Maximum cache size per disk. Each value maps to the corresponding path.
    high Disk usage ratio that triggers cache eviction. For example, "0.99" means eviction starts at 99% disk usage.
    low Disk usage ratio at which cache eviction stops. For example, "0.95" means eviction stops when disk usage drops to 95%.
  2. Create the JindoRuntime:

    kubectl create -f spark-fluid-jindoruntime.yaml

    Expected output:

    jindoruntime.data.fluid.io/spark created
  3. Wait for the JindoRuntime to be ready:

    kubectl get -n spark jindoruntime spark

    Expected output:

    NAME    MASTER PHASE   WORKER PHASE   FUSE PHASE   AGE
    spark   Ready          Ready          Ready        2m28s

    All three phases must show Ready before proceeding.

  4. Verify the Dataset is now bound:

    kubectl get -n spark dataset spark -o wide

    Expected output:

    NAME    UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   HCFS URL                             TOTAL FILES     CACHE HIT RATIO   AGE
    spark   [Calculating]    0.00B    128.91TiB                            Bound   spark-jindofs-master-0.spark:19434   [Calculating]                     2m5s

    The Bound phase confirms the Dataset is ready to serve cached data.

Step 4: (Optional) Prefetch data

The cache is empty until Spark jobs run and access data. The first job run fetches data from OSS at regular network speeds. To warm the cache before any job runs — so even the first job benefits from local reads — use a DataLoad to preload OSS data into the cache.

  1. Create spark-fluid-dataload.yaml:

    apiVersion: data.fluid.io/v1alpha1
    kind: DataLoad
    metadata:
      name: spark
      namespace: spark
    spec:
      dataset:
        name: spark
        namespace: spark
      loadMetadata: true
  2. Create the DataLoad:

    kubectl create -f spark-fluid-dataload.yaml

    Expected output:

    dataload.data.fluid.io/spark created
  3. Monitor prefetching progress:

    kubectl get -n spark dataload spark -w

    Expected output:

    NAME    DATASET   PHASE       AGE     DURATION
    spark   spark     Executing   20s     Unfinished
    spark   spark     Complete    9m31s   8m37s

    Prefetching is complete when the phase shows Complete.

  4. Verify cached data volume:

    kubectl get -n spark dataset spark -o wide

    Expected output:

    NAME    UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   HCFS URL                             TOTAL FILES     CACHE HIT RATIO   AGE
    spark   0.00B            326.85GiB   128.91TiB        0.0%                Bound   spark-jindofs-master-0.spark:19434   [Calculating]                     19m

    The CACHED column shows 326.85GiB, confirming that data is preloaded.

Step 5: Run a Spark job

Choose a method based on your setup:

  • Use POSIX if your Spark job reads local files and you want no code changes. Fluid mounts a PVC that Spark accesses via file:// paths.

  • Use HCFS if your job already uses oss://, s3://, or s3a:// paths, or you want to route existing OSS access through the cache. This method requires a custom Spark image with JindoSDK.

Method 1: Use POSIX APIs

Fluid creates a PVC for the Dataset. Mount the PVC in the Spark driver and executor pods to access cached data using standard file:// paths — no code changes required.

  1. Create spark-pagerank-fluid-posix.yaml:

    Note

    This example uses the spark:3.5.4 image from the Spark community. If the image fails to pull due to network issues, sync it to your own image repository or build a custom image.

    apiVersion: sparkoperator.k8s.io/v1beta2
    kind: SparkApplication
    metadata:
      name: spark-pagerank-fluid-posix
      namespace: spark
    spec:
      type: Scala
      mode: cluster
      image: spark:3.5.4
      mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.4.jar
      mainClass: org.apache.spark.examples.SparkPageRank
      arguments:
      # Access cached data using the file:// format.
      - file:///mnt/fluid/data/pagerank_dataset.txt
      - "10"
      sparkVersion: 3.5.4
      driver:
        cores: 1
        coreLimit: 1200m
        memory: 512m
        volumeMounts:
        # Mount the PVC created by Fluid for the dataset.
        - name: spark
          mountPath: /mnt/fluid
        serviceAccount: spark-operator-spark
      executor:
        instances: 2
        cores: 1
        coreLimit: "1"
        memory: 4g
        volumeMounts:
        # Mount the PVC created by Fluid for the dataset.
        - name: spark
          mountPath: /mnt/fluid
      volumes:
      # The PVC name matches the Dataset name.
      - name: spark
        persistentVolumeClaim:
          claimName: spark
      restartPolicy:
        type: Never
  2. Submit the job:

    kubectl create -f spark-pagerank-fluid-posix.yaml

    Expected output:

    sparkapplication.sparkoperator.k8s.io/spark-pagerank-fluid-posix created
  3. Monitor the job status:

    kubectl get -n spark sparkapplication spark-pagerank-fluid-posix -w

    Expected output:

    NAME                         STATUS       ATTEMPTS   START                  FINISH                 AGE
    spark-pagerank-fluid-posix   RUNNING      1          2025-01-16T11:06:15Z   <no value>             87s
    spark-pagerank-fluid-posix   RUNNING      1          2025-01-16T11:06:15Z   <no value>             102s
    spark-pagerank-fluid-posix   RUNNING      1          2025-01-16T11:06:15Z   <no value>             102s
    spark-pagerank-fluid-posix   SUCCEEDING   1          2025-01-16T11:06:15Z   2025-01-16T11:07:59Z   104s
    spark-pagerank-fluid-posix   COMPLETED    1          2025-01-16T11:06:15Z   2025-01-16T11:07:59Z   104s

Method 2: Use HCFS APIs

With HCFS access, Spark reads data using oss://, s3://, or s3a:// URIs. JindoSDK intercepts these requests and routes reads through JindoRuntime, hitting the local cache. This method requires a custom Spark image that includes JindoSDK.

  1. Get the HCFS URL of the Dataset:

    kubectl get -n spark dataset spark -o wide

    Expected output:

    NAME    UFS TOTAL SIZE   CACHED      CACHE CAPACITY   CACHED PERCENTAGE   PHASE   HCFS URL                             TOTAL FILES     CACHE HIT RATIO   AGE
    spark   0.00B            326.85GiB   128.91TiB        0.0%                Bound   spark-jindofs-master-0.spark:19434   [Calculating]                     30m

    Note the HCFS URL (spark-jindofs-master-0.spark:19434). Set fs.jindofsx.namespace.rpc.address to this value in the next step.

  2. If you don't already have a Spark image with JindoSDK, build one using this Dockerfile:

    ARG SPARK_IMAGE=spark:3.5.4
    
    FROM ${SPARK_IMAGE}
    
    # Add JindoSDK dependencies for OSS acceleration
    ADD --chown=spark:spark --chmod=644 https://jindodata-binary.oss-cn-shanghai.aliyuncs.com/mvn-repo/com/aliyun/jindodata/jindo-core/6.4.0/jindo-core-6.4.0.jar ${SPARK_HOME}/jars
    ADD --chown=spark:spark --chmod=644 https://jindodata-binary.oss-cn-shanghai.aliyuncs.com/mvn-repo/com/aliyun/jindodata/jindo-sdk/6.4.0/jindo-sdk-6.4.0.jar ${SPARK_HOME}/jars
  3. Create spark-pagerank-fluid-hcfs.yaml. Replace <SPARK_IMAGE> with your image, <OSS_BUCKET> with your bucket name, and <OSS_ENDPOINT> with your bucket endpoint.

    apiVersion: sparkoperator.k8s.io/v1beta2
    kind: SparkApplication
    metadata:
      name: spark-pagerank-fluid-hcfs
      namespace: spark
    spec:
      type: Scala
      mode: cluster
      # Required. Must be a Spark image that includes JindoSDK.
      image: <SPARK_IMAGE>
      mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.4.jar
      mainClass: org.apache.spark.examples.SparkPageRank
      arguments:
      # Choose one URI format. All three are routed through JindoRuntime.
      # oss:// format:
      - oss://<OSS_BUCKET>/data/pagerank_dataset.txt
      # s3:// format (commented out):
      # - s3://<OSS_BUCKET>/data/pagerank_dataset.txt
      # s3a:// format (commented out):
      # - s3a://<OSS_BUCKET>/data/pagerank_dataset.txt
      - "10"
      sparkVersion: 3.5.4
      hadoopConf:
        #===================
        # OSS access via oss://
        #===================
        fs.oss.impl: com.aliyun.jindodata.oss.JindoOssFileSystem
        fs.oss.endpoint: <OSS_ENDPOINT>
        fs.oss.credentials.provider: com.aliyun.jindodata.oss.auth.EnvironmentVariableCredentialsProvider
    
        # OSS access via s3://
        fs.s3.impl: com.aliyun.jindodata.s3.JindoS3FileSystem
        fs.s3.endpoint: <OSS_ENDPOINT>
        fs.s3.credentials.provider: com.aliyun.jindodata.oss.auth.EnvironmentVariableCredentialsProvider
    
        # OSS access via s3a://
        fs.s3a.impl: com.aliyun.jindodata.s3.JindoS3FileSystem
        fs.s3a.endpoint: <OSS_ENDPOINT>
        fs.s3a.credentials.provider: com.aliyun.jindodata.oss.auth.EnvironmentVariableCredentialsProvider
    
        #===================
        # JindoFS cache routing
        #===================
        fs.xengine: jindofsx
        # Required. Set to the HCFS URL from step 1.
        fs.jindofsx.namespace.rpc.address: spark-jindofs-master-0.spark:19434
        fs.jindofsx.data.cache.enable: "true"
      driver:
        cores: 1
        coreLimit: 1200m
        memory: 512m
        envFrom:
        - secretRef:
            name: spark-oss-secret
        serviceAccount: spark-operator-spark
      executor:
        instances: 2
        cores: 2
        coreLimit: "2"
        memory: 8g
        envFrom:
        - secretRef:
            name: spark-oss-secret
      restartPolicy:
        type: Never
  4. Submit the job:

    kubectl create -f spark-pagerank-fluid-hcfs.yaml

    Expected output:

    sparkapplication.sparkoperator.k8s.io/spark-pagerank-fluid-hcfs created
  5. Monitor the job status:

    kubectl get -n spark sparkapplication spark-pagerank-fluid-hcfs -w

    Expected output:

    NAME                        STATUS       ATTEMPTS   START                  FINISH                 AGE
    spark-pagerank-fluid-hcfs   RUNNING      1          2025-01-16T11:21:16Z   <no value>             9s
    spark-pagerank-fluid-hcfs   RUNNING      1          2025-01-16T11:21:16Z   <no value>             15s
    spark-pagerank-fluid-hcfs   RUNNING      1          2025-01-16T11:21:16Z   <no value>             77s
    spark-pagerank-fluid-hcfs   RUNNING      1          2025-01-16T11:21:16Z   <no value>             77s
    spark-pagerank-fluid-hcfs   SUCCEEDING   1          2025-01-16T11:21:16Z   2025-01-16T11:22:34Z   78s
    spark-pagerank-fluid-hcfs   COMPLETED    1          2025-01-16T11:21:16Z   2025-01-16T11:22:34Z   78s

Step 6: (Optional) Clean up

Delete resources in the following order to avoid dependency conflicts:

kubectl delete -f spark-pagerank-fluid-posix.yaml
kubectl delete -f spark-pagerank-fluid-hcfs.yaml
kubectl delete -f spark-fluid-dataload.yaml
kubectl delete -f spark-fluid-jindoruntime.yaml
kubectl delete -f spark-fluid-dataset.yaml
kubectl delete -f fluid-oss-secret.yaml

What's next