Run Spark Jobs with JindoSDK for OSS Read-Write in ACK - Container Service for Kubernetes

Run Spark jobs that read and write OSS data in ACK clusters, using the built-in PageRank job as a worked example. The tutorial covers three OSS integration options—Hadoop OSS SDK, Hadoop S3 SDK, and JindoSDK—and walks through every step from dataset generation to job verification.

Prerequisites

Before you begin, ensure that you have:

An ACK Pro cluster or ACK Serverless Pro cluster running Kubernetes 1.24 or later. See Create an ACK managed cluster, Create an ACK Serverless cluster, and Manually upgrade ACK clusters
The ack-spark-operator component installed. See Step 1: Install the ack-spark-operator component
A kubectl client connected to the cluster. See Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster
An OSS bucket. See Create a bucket
ossutil installed and configured. See ossutil command reference

Choose an OSS integration

Three SDKs are available for OSS access. Choose one before proceeding.

SDK	Use when	URI scheme
Hadoop OSS SDK	Native OSS support via Aliyun filesystem; straightforward for most OSS workloads	`oss://`
Hadoop S3 SDK	Your cluster or tooling already uses S3-compatible APIs; no Aliyun-specific dependency needed	`s3a://`
JindoSDK	Optimized OSS performance with Alibaba Cloud's Jindo acceleration layer	`oss://`

Apply the same SDK choice consistently through all steps.

Step 1: Prepare and upload test data

Generate a PageRank dataset and upload it to your OSS bucket.

Create a file named generate_pagerank_dataset.sh with the following content:

#!/bin/bash

# Check the number of arguments
if [ "$#" -ne 2 ]; then
    echo "Usage: $0 M N"
    echo "M: Number of web pages"
    echo "N: Number of records to generate"
    exit 1
fi

M=$1
N=$2

# Verify if M and N are positive integers
if ! [[ "$M" =~ ^[0-9]+$ ]] || ! [[ "$N" =~ ^[0-9]+$ ]]; then
    echo "Both M and N must be positive integers."
    exit 1
fi

# Generate dataset
for ((i=1; i<=$N; i++)); do
    # Ensure the source and target pages are different
    while true; do
        src=$((RANDOM % M + 1))
        dst=$((RANDOM % M + 1))
        if [ "$src" -ne "$dst" ]; then
            echo "$src $dst"
            break
        fi
    done
done

Generate the dataset:

M=100000    # The number of web pages

N=10000000  # The number of records

# Generate dataset randomly and save as pagerank_dataset.txt
bash generate_pagerank_dataset.sh $M $N > pagerank_dataset.txt

Upload the dataset to the data/ directory in your OSS bucket:
```
ossutil cp pagerank_dataset.txt oss://<BUCKET_NAME>/data/
```

Step 2: Build a Spark container image

Build a container image that includes the JAR dependencies required for OSS access. For instructions on building images with Alibaba Cloud Container Registry, see Use a Container Registry Enterprise Edition instance to build an image.

The base Spark image in the sample Dockerfiles comes from the open source community. Replace it with your own image as needed, and match the SDK version to your Spark version.

Use Hadoop OSS SDK

This example uses Spark 3.5.5 and Hadoop OSS SDK 3.3.4:

ARG SPARK_IMAGE=spark:3.5.5

FROM ${SPARK_IMAGE}

# Add dependencies for Hadoop Aliyun OSS support
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aliyun/3.3.4/hadoop-aliyun-3.3.4.jar ${SPARK_HOME}/jars
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/com/aliyun/oss/aliyun-sdk-oss/3.17.4/aliyun-sdk-oss-3.17.4.jar ${SPARK_HOME}/jars
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/org/jdom/jdom2/2.0.6.1/jdom2-2.0.6.1.jar ${SPARK_HOME}/jars

Use Hadoop S3 SDK

This example uses Spark 3.5.5 and Hadoop S3 SDK 3.3.4:

ARG SPARK_IMAGE=spark:3.5.5

FROM ${SPARK_IMAGE}

# Add dependencies for Hadoop AWS S3 support
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar ${SPARK_HOME}/jars
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.367/aws-java-sdk-bundle-1.12.367.jar ${SPARK_HOME}/jars

Use JindoSDK

This example uses Spark 3.5.5 and JindoSDK 6.8.0:

ARG SPARK_IMAGE=spark:3.5.5

FROM ${SPARK_IMAGE}

# Add dependencies for JindoSDK support
ADD --chown=spark:spark --chmod=644 https://jindodata-binary.oss-cn-shanghai.aliyuncs.com/mvn-repo/com/aliyun/jindodata/jindo-core/6.8.0/jindo-core-6.8.0.jar ${SPARK_HOME}/jars
ADD --chown=spark:spark --chmod=644 https://jindodata-binary.oss-cn-shanghai.aliyuncs.com/mvn-repo/com/aliyun/jindodata/jindo-sdk/6.8.0/jindo-sdk-6.8.0.jar ${SPARK_HOME}/jars

Step 3: Store OSS credentials in a Kubernetes Secret

Store your OSS AccessKey ID and AccessKey Secret in a Kubernetes Secret rather than embedding them in the SparkApplication manifest. The Spark operator injects Secret keys as environment variables into the driver and executor pods at runtime.

The hadoopConf fields in Step 4 use EnvironmentVariableCredentialsProvider, which reads credentials from these environment variables automatically. Never embed credentials directly in YAML files or commit them to source control.

Use Hadoop OSS SDK

Create spark-oss-secret.yaml:

apiVersion: v1
kind: Secret
metadata:
  name: spark-oss-secret
  namespace: default
stringData:
  # Replace <ACCESS_KEY_ID> with the AccessKey ID of your Alibaba Cloud account.
  OSS_ACCESS_KEY_ID: <ACCESS_KEY_ID>
  # Replace <ACCESS_KEY_SECRET> with the AccessKey Secret of your Alibaba Cloud account.
  OSS_ACCESS_KEY_SECRET: <ACCESS_KEY_SECRET>

Apply the Secret:

kubectl apply -f spark-oss-secret.yaml

Expected output:

secret/spark-oss-secret created

Use Hadoop S3 SDK

The Hadoop S3 SDK reads credentials from AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, which is why the key names differ from the OSS SDK variant.

Create spark-s3-secret.yaml:

apiVersion: v1
kind: Secret
metadata:
  name: spark-s3-secret
  namespace: default
stringData:
  # Replace <ACCESS_KEY_ID> with the AccessKey ID of your Alibaba Cloud account.
  AWS_ACCESS_KEY_ID: <ACCESS_KEY_ID>
  # Replace <ACCESS_KEY_SECRET> with the AccessKey Secret of your Alibaba Cloud account.
  AWS_SECRET_ACCESS_KEY: <ACCESS_KEY_SECRET>

Apply the Secret:

kubectl apply -f spark-s3-secret.yaml

Expected output:

secret/spark-s3-secret created

Use JindoSDK

Create spark-oss-secret.yaml:

apiVersion: v1
kind: Secret
metadata:
  name: spark-oss-secret
  namespace: default
stringData:
  # Replace <ACCESS_KEY_ID> with the AccessKey ID of your Alibaba Cloud account.
  OSS_ACCESS_KEY_ID: <ACCESS_KEY_ID>
  # Replace <ACCESS_KEY_SECRET> with the AccessKey Secret of your Alibaba Cloud account.
  OSS_ACCESS_KEY_SECRET: <ACCESS_KEY_SECRET>

Apply the Secret:

kubectl apply -f spark-oss-secret.yaml

Expected output:

secret/spark-oss-secret created

Step 4: Submit the Spark job

Create and submit a SparkApplication manifest to run the PageRank job against your OSS dataset.

Use Hadoop OSS SDK

Create spark-pagerank.yaml. For a full list of OSS configuration parameters, see Hadoop-Aliyun module.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pagerank
  namespace: default
spec:
  type: Scala
  mode: cluster
  # Replace <SPARK_IMAGE> with the Spark container image built in Step 2.
  image: <SPARK_IMAGE>
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.5.jar
  mainClass: org.apache.spark.examples.SparkPageRank
  arguments:
  - oss://<OSS_BUCKET>/data/pagerank_dataset.txt           # Specify the input test dataset. Replace <OSS_BUCKET> with your OSS bucket name.
  - "10"                                                   # The number of iterations.
  sparkVersion: 3.5.5
  hadoopConf:
    fs.AbstractFileSystem.oss.impl: org.apache.hadoop.fs.aliyun.oss.OSS
    fs.oss.impl: org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem
    # OSS endpoint. Replace <OSS_ENDPOINT> with your OSS endpoint.
    # For example, the internal endpoint for the China (Beijing) region is oss-cn-beijing-internal.aliyuncs.com.
    fs.oss.endpoint: <OSS_ENDPOINT>
    fs.oss.credentials.provider: com.aliyun.oss.common.auth.EnvironmentVariableCredentialsProvider
  driver:
    cores: 1
    coreLimit: 1200m
    memory: 512m
    envFrom:
    - secretRef:
        name: spark-oss-secret               # Specify the Secret used to access OSS.
    serviceAccount: spark-operator-spark
  executor:
    instances: 2
    cores: 1
    coreLimit: "2"
    memory: 8g
    envFrom:
    - secretRef:
        name: spark-oss-secret               # Specify the Secret used to access OSS.
  restartPolicy:
    type: Never

Use Hadoop S3 SDK

Create spark-pagerank.yaml. For a full list of S3 configuration parameters, see Hadoop-AWS module.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pagerank
  namespace: default
spec:
  type: Scala
  mode: cluster
  # Replace <SPARK_IMAGE> with the Spark container image built in Step 2.
  image: <SPARK_IMAGE>
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.5.jar
  mainClass: org.apache.spark.examples.SparkPageRank
  arguments:
  - s3a://<OSS_BUCKET>/data/pagerank_dataset.txt           # Specify the input test dataset. Replace <OSS_BUCKET> with your OSS bucket name.
  - "10"                                                   # The number of iterations.
  sparkVersion: 3.5.5
  hadoopConf:
    fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
    # OSS endpoint. Replace <OSS_ENDPOINT> with your OSS endpoint.
    # For example, the internal endpoint for the China (Beijing) region is oss-cn-beijing-internal.aliyuncs.com.
    fs.s3a.endpoint: <OSS_ENDPOINT>
    # The region where the OSS endpoint is located. For example, cn-beijing for the China (Beijing) region.
    fs.s3a.endpoint.region: <OSS_REGION>
  driver:
    cores: 1
    coreLimit: 1200m
    memory: 512m
    envFrom:
    - secretRef:
        name: spark-s3-secret               # Specify the Secret used to access OSS.
    serviceAccount: spark-operator-spark
  executor:
    instances: 2
    cores: 1
    coreLimit: "2"
    memory: 8g
    envFrom:
    - secretRef:
        name: spark-s3-secret               # Specify the Secret used to access OSS.
  restartPolicy:
    type: Never

Use JindoSDK

Create spark-pagerank.yaml:

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pagerank
  namespace: default
spec:
  type: Scala
  mode: cluster
  # Replace <SPARK_IMAGE> with the Spark container image built in Step 2.
  image: <SPARK_IMAGE>
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.5.jar
  mainClass: org.apache.spark.examples.SparkPageRank
  arguments:
  - oss://<OSS_BUCKET>/data/pagerank_dataset.txt    # Specify the input test dataset. Replace <OSS_BUCKET> with your OSS bucket name.
  - "10"                                            # The number of iterations.
  sparkVersion: 3.5.5
  hadoopConf:
    fs.AbstractFileSystem.oss.impl: com.aliyun.jindodata.oss.JindoOSS
    fs.oss.impl: com.aliyun.jindodata.oss.JindoOssFileSystem
    # OSS endpoint. Replace <OSS_ENDPOINT> with your OSS endpoint.
    # For example, the internal endpoint for the China (Beijing) region is oss-cn-beijing-internal.aliyuncs.com.
    fs.oss.endpoint: <OSS_ENDPOINT>
    fs.oss.credentials.provider: com.aliyun.jindodata.oss.auth.EnvironmentVariableCredentialsProvider
  driver:
    cores: 1
    coreLimit: 1200m
    memory: 512m
    serviceAccount: spark-operator-spark
    envFrom:
    - secretRef:
        name: spark-oss-secret                    # Specify the Secret used to access OSS.
  executor:
    instances: 2
    cores: 1
    coreLimit: "2"
    memory: 8g
    envFrom:
    - secretRef:
        name: spark-oss-secret                    # Specify the Secret used to access OSS.
  restartPolicy:
    type: Never

Submit and verify the job

The following commands apply to all three SDK options.

Submit the job:
```
kubectl apply -f spark-pagerank.yaml
```

Monitor job status:

kubectl get sparkapplications spark-pagerank

Expected output when the job completes:

NAME             STATUS      ATTEMPTS   START                  FINISH                 AGE
spark-pagerank   COMPLETED   1          2024-10-09T12:54:25Z   2024-10-09T12:55:46Z   90s

View the last 20 log lines from the driver pod:

kubectl logs spark-pagerank-driver --tail=20

Hadoop OSS SDK — expected output:

30024 has rank:  1.0709659078941967 .
21390 has rank:  0.9933356174074005 .
28500 has rank:  1.0404018494028928 .
2137 has rank:  0.9931000490520374 .
3406 has rank:  0.9562543137167121 .
20904 has rank:  0.8827028621652337 .
25604 has rank:  1.0270134041934191 .
24/10/09 12:48:36 INFO SparkUI: Stopped Spark web UI at http://spark-pagerank-dd0d4d927151c9d0-driver-svc.default.svc:4040
24/10/09 12:48:36 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
24/10/09 12:48:36 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
24/10/09 12:48:36 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.
24/10/09 12:48:36 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
24/10/09 12:48:36 INFO MemoryStore: MemoryStore cleared
24/10/09 12:48:36 INFO BlockManager: BlockManager stopped
24/10/09 12:48:36 INFO BlockManagerMaster: BlockManagerMaster stopped
24/10/09 12:48:36 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
24/10/09 12:48:36 INFO SparkContext: Successfully stopped SparkContext
24/10/09 12:48:36 INFO ShutdownHookManager: Shutdown hook called
24/10/09 12:48:36 INFO ShutdownHookManager: Deleting directory /tmp/spark-e8b8c2ab-c916-4f84-b60f-f54c0de3a7f0
24/10/09 12:48:36 INFO ShutdownHookManager: Deleting directory /var/data/spark-c5917d98-06fb-46fe-85bc-199b839cb885/spark-23e2c2ae-4754-43ae-854d-2752eb83b2c5

Hadoop S3 SDK — expected output:

3406 has rank:  0.9562543137167121 .
20904 has rank:  0.8827028621652337 .
25604 has rank:  1.0270134041934191 .
25/04/07 03:54:11 INFO SparkContext: SparkContext is stopping with exitCode 0.
25/04/07 03:54:11 INFO SparkUI: Stopped Spark web UI at http://spark-pagerank-0f7dec960e615617-driver-svc.spark.svc:4040
25/04/07 03:54:11 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
25/04/07 03:54:11 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
25/04/07 03:54:11 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.
25/04/07 03:54:11 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
25/04/07 03:54:11 INFO MemoryStore: MemoryStore cleared
25/04/07 03:54:11 INFO BlockManager: BlockManager stopped
25/04/07 03:54:11 INFO BlockManagerMaster: BlockManagerMaster stopped
25/04/07 03:54:11 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
25/04/07 03:54:11 INFO SparkContext: Successfully stopped SparkContext
25/04/07 03:54:11 INFO ShutdownHookManager: Shutdown hook called
25/04/07 03:54:11 INFO ShutdownHookManager: Deleting directory /var/data/spark-20d425bb-f442-4b0a-83e2-5a0202959a54/spark-ff5bbf08-4343-4a7a-9ce0-3f7c127cf4a9
25/04/07 03:54:11 INFO ShutdownHookManager: Deleting directory /tmp/spark-a421839a-07af-49c0-b637-f15f76c3e752
25/04/07 03:54:11 INFO MetricsSystemImpl: Stopping s3a-file-system metrics system...
25/04/07 03:54:11 INFO MetricsSystemImpl: s3a-file-system metrics system stopped.
25/04/07 03:54:11 INFO MetricsSystemImpl: s3a-file-system metrics system shutdown complete.

JindoSDK — expected output:

21390 has rank:  0.9933356174074005 .
28500 has rank:  1.0404018494028928 .
2137 has rank:  0.9931000490520374 .
3406 has rank:  0.9562543137167121 .
20904 has rank:  0.8827028621652337 .
25604 has rank:  1.0270134041934191 .
24/10/09 12:55:44 INFO SparkContext: SparkContext is stopping with exitCode 0.
24/10/09 12:55:44 INFO SparkUI: Stopped Spark web UI at http://spark-pagerank-6a5e3d9271584856-driver-svc.default.svc:4040
24/10/09 12:55:44 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
24/10/09 12:55:44 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
24/10/09 12:55:44 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.
24/10/09 12:55:45 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
24/10/09 12:55:45 INFO MemoryStore: MemoryStore cleared
24/10/09 12:55:45 INFO BlockManager: BlockManager stopped
24/10/09 12:55:45 INFO BlockManagerMaster: BlockManagerMaster stopped
24/10/09 12:55:45 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
24/10/09 12:55:45 INFO SparkContext: Successfully stopped SparkContext
24/10/09 12:55:45 INFO ShutdownHookManager: Shutdown hook called
24/10/09 12:55:45 INFO ShutdownHookManager: Deleting directory /var/data/spark-87e8406e-06a7-4b4a-b18f-2193da299d35/spark-093a1b71-121a-4367-9d22-ad4e397c9815
24/10/09 12:55:45 INFO ShutdownHookManager: Deleting directory /tmp/spark-723e2039-a493-49e8-b86d-fff5fd1bb168

(Optional) Step 5: Clean up

Delete the Spark job and the Secret when you no longer need them.

Delete the Spark job:

kubectl delete -f spark-pagerank.yaml

Delete the Secret:

Hadoop OSS SDK or JindoSDK:

kubectl delete -f spark-oss-secret.yaml

Hadoop S3 SDK:

kubectl delete -f spark-s3-secret.yaml

Container Service for Kubernetes:Read and write OSS data in Spark jobs

Prerequisites

Choose an OSS integration

Step 1: Prepare and upload test data

Step 2: Build a Spark container image

Use Hadoop OSS SDK

Use Hadoop S3 SDK

Use JindoSDK

Step 3: Store OSS credentials in a Kubernetes Secret

Use Hadoop OSS SDK

Use Hadoop S3 SDK

Use JindoSDK

Step 4: Submit the Spark job

Use Hadoop OSS SDK

Use Hadoop S3 SDK

Use JindoSDK

Submit and verify the job

(Optional) Step 5: Clean up

What's next