Run Spark jobs that read and write OSS data in ACK clusters, using the built-in PageRank job as a worked example. The tutorial covers three OSS integration options—Hadoop OSS SDK, Hadoop S3 SDK, and JindoSDK—and walks through every step from dataset generation to job verification.
Prerequisites
Before you begin, ensure that you have:
-
An ACK Pro cluster or ACK Serverless Pro cluster running Kubernetes 1.24 or later. See Create an ACK managed cluster, Create an ACK Serverless cluster, and Manually upgrade ACK clusters
-
The ack-spark-operator component installed. See Step 1: Install the ack-spark-operator component
-
A kubectl client connected to the cluster. See Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster
-
An OSS bucket. See Create a bucket
-
ossutil installed and configured. See ossutil command reference
Choose an OSS integration
Three SDKs are available for OSS access. Choose one before proceeding.
| SDK | Use when | URI scheme |
|---|---|---|
| Hadoop OSS SDK | Native OSS support via Aliyun filesystem; straightforward for most OSS workloads | oss:// |
| Hadoop S3 SDK | Your cluster or tooling already uses S3-compatible APIs; no Aliyun-specific dependency needed | s3a:// |
| JindoSDK | Optimized OSS performance with Alibaba Cloud's Jindo acceleration layer | oss:// |
Apply the same SDK choice consistently through all steps.
Step 1: Prepare and upload test data
Generate a PageRank dataset and upload it to your OSS bucket.
-
Create a file named
generate_pagerank_dataset.shwith the following content:#!/bin/bash # Check the number of arguments if [ "$#" -ne 2 ]; then echo "Usage: $0 M N" echo "M: Number of web pages" echo "N: Number of records to generate" exit 1 fi M=$1 N=$2 # Verify if M and N are positive integers if ! [[ "$M" =~ ^[0-9]+$ ]] || ! [[ "$N" =~ ^[0-9]+$ ]]; then echo "Both M and N must be positive integers." exit 1 fi # Generate dataset for ((i=1; i<=$N; i++)); do # Ensure the source and target pages are different while true; do src=$((RANDOM % M + 1)) dst=$((RANDOM % M + 1)) if [ "$src" -ne "$dst" ]; then echo "$src $dst" break fi done done -
Generate the dataset:
M=100000 # The number of web pages N=10000000 # The number of records # Generate dataset randomly and save as pagerank_dataset.txt bash generate_pagerank_dataset.sh $M $N > pagerank_dataset.txt -
Upload the dataset to the
data/directory in your OSS bucket:ossutil cp pagerank_dataset.txt oss://<BUCKET_NAME>/data/
Step 2: Build a Spark container image
Build a container image that includes the JAR dependencies required for OSS access. For instructions on building images with Alibaba Cloud Container Registry, see Use a Container Registry Enterprise Edition instance to build an image.
The base Spark image in the sample Dockerfiles comes from the open source community. Replace it with your own image as needed, and match the SDK version to your Spark version.
Use Hadoop OSS SDK
This example uses Spark 3.5.5 and Hadoop OSS SDK 3.3.4:
ARG SPARK_IMAGE=spark:3.5.5
FROM ${SPARK_IMAGE}
# Add dependencies for Hadoop Aliyun OSS support
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aliyun/3.3.4/hadoop-aliyun-3.3.4.jar ${SPARK_HOME}/jars
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/com/aliyun/oss/aliyun-sdk-oss/3.17.4/aliyun-sdk-oss-3.17.4.jar ${SPARK_HOME}/jars
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/org/jdom/jdom2/2.0.6.1/jdom2-2.0.6.1.jar ${SPARK_HOME}/jars
Use Hadoop S3 SDK
This example uses Spark 3.5.5 and Hadoop S3 SDK 3.3.4:
ARG SPARK_IMAGE=spark:3.5.5
FROM ${SPARK_IMAGE}
# Add dependencies for Hadoop AWS S3 support
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar ${SPARK_HOME}/jars
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.367/aws-java-sdk-bundle-1.12.367.jar ${SPARK_HOME}/jars
Use JindoSDK
This example uses Spark 3.5.5 and JindoSDK 6.8.0:
ARG SPARK_IMAGE=spark:3.5.5
FROM ${SPARK_IMAGE}
# Add dependencies for JindoSDK support
ADD --chown=spark:spark --chmod=644 https://jindodata-binary.oss-cn-shanghai.aliyuncs.com/mvn-repo/com/aliyun/jindodata/jindo-core/6.8.0/jindo-core-6.8.0.jar ${SPARK_HOME}/jars
ADD --chown=spark:spark --chmod=644 https://jindodata-binary.oss-cn-shanghai.aliyuncs.com/mvn-repo/com/aliyun/jindodata/jindo-sdk/6.8.0/jindo-sdk-6.8.0.jar ${SPARK_HOME}/jars
Step 3: Store OSS credentials in a Kubernetes Secret
Store your OSS AccessKey ID and AccessKey Secret in a Kubernetes Secret rather than embedding them in the SparkApplication manifest. The Spark operator injects Secret keys as environment variables into the driver and executor pods at runtime.
ThehadoopConffields in Step 4 useEnvironmentVariableCredentialsProvider, which reads credentials from these environment variables automatically. Never embed credentials directly in YAML files or commit them to source control.
Use Hadoop OSS SDK
-
Create
spark-oss-secret.yaml:apiVersion: v1 kind: Secret metadata: name: spark-oss-secret namespace: default stringData: # Replace <ACCESS_KEY_ID> with the AccessKey ID of your Alibaba Cloud account. OSS_ACCESS_KEY_ID: <ACCESS_KEY_ID> # Replace <ACCESS_KEY_SECRET> with the AccessKey Secret of your Alibaba Cloud account. OSS_ACCESS_KEY_SECRET: <ACCESS_KEY_SECRET> -
Apply the Secret:
kubectl apply -f spark-oss-secret.yamlExpected output:
secret/spark-oss-secret created
Use Hadoop S3 SDK
The Hadoop S3 SDK reads credentials from AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, which is why the key names differ from the OSS SDK variant.
-
Create
spark-s3-secret.yaml:apiVersion: v1 kind: Secret metadata: name: spark-s3-secret namespace: default stringData: # Replace <ACCESS_KEY_ID> with the AccessKey ID of your Alibaba Cloud account. AWS_ACCESS_KEY_ID: <ACCESS_KEY_ID> # Replace <ACCESS_KEY_SECRET> with the AccessKey Secret of your Alibaba Cloud account. AWS_SECRET_ACCESS_KEY: <ACCESS_KEY_SECRET> -
Apply the Secret:
kubectl apply -f spark-s3-secret.yamlExpected output:
secret/spark-s3-secret created
Use JindoSDK
-
Create
spark-oss-secret.yaml:apiVersion: v1 kind: Secret metadata: name: spark-oss-secret namespace: default stringData: # Replace <ACCESS_KEY_ID> with the AccessKey ID of your Alibaba Cloud account. OSS_ACCESS_KEY_ID: <ACCESS_KEY_ID> # Replace <ACCESS_KEY_SECRET> with the AccessKey Secret of your Alibaba Cloud account. OSS_ACCESS_KEY_SECRET: <ACCESS_KEY_SECRET> -
Apply the Secret:
kubectl apply -f spark-oss-secret.yamlExpected output:
secret/spark-oss-secret created
Step 4: Submit the Spark job
Create and submit a SparkApplication manifest to run the PageRank job against your OSS dataset.
Use Hadoop OSS SDK
Create spark-pagerank.yaml. For a full list of OSS configuration parameters, see Hadoop-Aliyun module.
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: spark-pagerank
namespace: default
spec:
type: Scala
mode: cluster
# Replace <SPARK_IMAGE> with the Spark container image built in Step 2.
image: <SPARK_IMAGE>
mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.5.jar
mainClass: org.apache.spark.examples.SparkPageRank
arguments:
- oss://<OSS_BUCKET>/data/pagerank_dataset.txt # Specify the input test dataset. Replace <OSS_BUCKET> with your OSS bucket name.
- "10" # The number of iterations.
sparkVersion: 3.5.5
hadoopConf:
fs.AbstractFileSystem.oss.impl: org.apache.hadoop.fs.aliyun.oss.OSS
fs.oss.impl: org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem
# OSS endpoint. Replace <OSS_ENDPOINT> with your OSS endpoint.
# For example, the internal endpoint for the China (Beijing) region is oss-cn-beijing-internal.aliyuncs.com.
fs.oss.endpoint: <OSS_ENDPOINT>
fs.oss.credentials.provider: com.aliyun.oss.common.auth.EnvironmentVariableCredentialsProvider
driver:
cores: 1
coreLimit: 1200m
memory: 512m
envFrom:
- secretRef:
name: spark-oss-secret # Specify the Secret used to access OSS.
serviceAccount: spark-operator-spark
executor:
instances: 2
cores: 1
coreLimit: "2"
memory: 8g
envFrom:
- secretRef:
name: spark-oss-secret # Specify the Secret used to access OSS.
restartPolicy:
type: Never
Use Hadoop S3 SDK
Create spark-pagerank.yaml. For a full list of S3 configuration parameters, see Hadoop-AWS module.
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: spark-pagerank
namespace: default
spec:
type: Scala
mode: cluster
# Replace <SPARK_IMAGE> with the Spark container image built in Step 2.
image: <SPARK_IMAGE>
mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.5.jar
mainClass: org.apache.spark.examples.SparkPageRank
arguments:
- s3a://<OSS_BUCKET>/data/pagerank_dataset.txt # Specify the input test dataset. Replace <OSS_BUCKET> with your OSS bucket name.
- "10" # The number of iterations.
sparkVersion: 3.5.5
hadoopConf:
fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
# OSS endpoint. Replace <OSS_ENDPOINT> with your OSS endpoint.
# For example, the internal endpoint for the China (Beijing) region is oss-cn-beijing-internal.aliyuncs.com.
fs.s3a.endpoint: <OSS_ENDPOINT>
# The region where the OSS endpoint is located. For example, cn-beijing for the China (Beijing) region.
fs.s3a.endpoint.region: <OSS_REGION>
driver:
cores: 1
coreLimit: 1200m
memory: 512m
envFrom:
- secretRef:
name: spark-s3-secret # Specify the Secret used to access OSS.
serviceAccount: spark-operator-spark
executor:
instances: 2
cores: 1
coreLimit: "2"
memory: 8g
envFrom:
- secretRef:
name: spark-s3-secret # Specify the Secret used to access OSS.
restartPolicy:
type: Never
Use JindoSDK
Create spark-pagerank.yaml:
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: spark-pagerank
namespace: default
spec:
type: Scala
mode: cluster
# Replace <SPARK_IMAGE> with the Spark container image built in Step 2.
image: <SPARK_IMAGE>
mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.5.jar
mainClass: org.apache.spark.examples.SparkPageRank
arguments:
- oss://<OSS_BUCKET>/data/pagerank_dataset.txt # Specify the input test dataset. Replace <OSS_BUCKET> with your OSS bucket name.
- "10" # The number of iterations.
sparkVersion: 3.5.5
hadoopConf:
fs.AbstractFileSystem.oss.impl: com.aliyun.jindodata.oss.JindoOSS
fs.oss.impl: com.aliyun.jindodata.oss.JindoOssFileSystem
# OSS endpoint. Replace <OSS_ENDPOINT> with your OSS endpoint.
# For example, the internal endpoint for the China (Beijing) region is oss-cn-beijing-internal.aliyuncs.com.
fs.oss.endpoint: <OSS_ENDPOINT>
fs.oss.credentials.provider: com.aliyun.jindodata.oss.auth.EnvironmentVariableCredentialsProvider
driver:
cores: 1
coreLimit: 1200m
memory: 512m
serviceAccount: spark-operator-spark
envFrom:
- secretRef:
name: spark-oss-secret # Specify the Secret used to access OSS.
executor:
instances: 2
cores: 1
coreLimit: "2"
memory: 8g
envFrom:
- secretRef:
name: spark-oss-secret # Specify the Secret used to access OSS.
restartPolicy:
type: Never
Submit and verify the job
The following commands apply to all three SDK options.
-
Submit the job:
kubectl apply -f spark-pagerank.yaml -
Monitor job status:
kubectl get sparkapplications spark-pagerankExpected output when the job completes:
NAME STATUS ATTEMPTS START FINISH AGE spark-pagerank COMPLETED 1 2024-10-09T12:54:25Z 2024-10-09T12:55:46Z 90s -
View the last 20 log lines from the driver pod:
kubectl logs spark-pagerank-driver --tail=20Hadoop OSS SDK — expected output:
30024 has rank: 1.0709659078941967 . 21390 has rank: 0.9933356174074005 . 28500 has rank: 1.0404018494028928 . 2137 has rank: 0.9931000490520374 . 3406 has rank: 0.9562543137167121 . 20904 has rank: 0.8827028621652337 . 25604 has rank: 1.0270134041934191 . 24/10/09 12:48:36 INFO SparkUI: Stopped Spark web UI at http://spark-pagerank-dd0d4d927151c9d0-driver-svc.default.svc:4040 24/10/09 12:48:36 INFO KubernetesClusterSchedulerBackend: Shutting down all executors 24/10/09 12:48:36 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down 24/10/09 12:48:36 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed. 24/10/09 12:48:36 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 24/10/09 12:48:36 INFO MemoryStore: MemoryStore cleared 24/10/09 12:48:36 INFO BlockManager: BlockManager stopped 24/10/09 12:48:36 INFO BlockManagerMaster: BlockManagerMaster stopped 24/10/09 12:48:36 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 24/10/09 12:48:36 INFO SparkContext: Successfully stopped SparkContext 24/10/09 12:48:36 INFO ShutdownHookManager: Shutdown hook called 24/10/09 12:48:36 INFO ShutdownHookManager: Deleting directory /tmp/spark-e8b8c2ab-c916-4f84-b60f-f54c0de3a7f0 24/10/09 12:48:36 INFO ShutdownHookManager: Deleting directory /var/data/spark-c5917d98-06fb-46fe-85bc-199b839cb885/spark-23e2c2ae-4754-43ae-854d-2752eb83b2c5Hadoop S3 SDK — expected output:
3406 has rank: 0.9562543137167121 . 20904 has rank: 0.8827028621652337 . 25604 has rank: 1.0270134041934191 . 25/04/07 03:54:11 INFO SparkContext: SparkContext is stopping with exitCode 0. 25/04/07 03:54:11 INFO SparkUI: Stopped Spark web UI at http://spark-pagerank-0f7dec960e615617-driver-svc.spark.svc:4040 25/04/07 03:54:11 INFO KubernetesClusterSchedulerBackend: Shutting down all executors 25/04/07 03:54:11 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down 25/04/07 03:54:11 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed. 25/04/07 03:54:11 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 25/04/07 03:54:11 INFO MemoryStore: MemoryStore cleared 25/04/07 03:54:11 INFO BlockManager: BlockManager stopped 25/04/07 03:54:11 INFO BlockManagerMaster: BlockManagerMaster stopped 25/04/07 03:54:11 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 25/04/07 03:54:11 INFO SparkContext: Successfully stopped SparkContext 25/04/07 03:54:11 INFO ShutdownHookManager: Shutdown hook called 25/04/07 03:54:11 INFO ShutdownHookManager: Deleting directory /var/data/spark-20d425bb-f442-4b0a-83e2-5a0202959a54/spark-ff5bbf08-4343-4a7a-9ce0-3f7c127cf4a9 25/04/07 03:54:11 INFO ShutdownHookManager: Deleting directory /tmp/spark-a421839a-07af-49c0-b637-f15f76c3e752 25/04/07 03:54:11 INFO MetricsSystemImpl: Stopping s3a-file-system metrics system... 25/04/07 03:54:11 INFO MetricsSystemImpl: s3a-file-system metrics system stopped. 25/04/07 03:54:11 INFO MetricsSystemImpl: s3a-file-system metrics system shutdown complete.JindoSDK — expected output:
21390 has rank: 0.9933356174074005 . 28500 has rank: 1.0404018494028928 . 2137 has rank: 0.9931000490520374 . 3406 has rank: 0.9562543137167121 . 20904 has rank: 0.8827028621652337 . 25604 has rank: 1.0270134041934191 . 24/10/09 12:55:44 INFO SparkContext: SparkContext is stopping with exitCode 0. 24/10/09 12:55:44 INFO SparkUI: Stopped Spark web UI at http://spark-pagerank-6a5e3d9271584856-driver-svc.default.svc:4040 24/10/09 12:55:44 INFO KubernetesClusterSchedulerBackend: Shutting down all executors 24/10/09 12:55:44 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down 24/10/09 12:55:44 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed. 24/10/09 12:55:45 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 24/10/09 12:55:45 INFO MemoryStore: MemoryStore cleared 24/10/09 12:55:45 INFO BlockManager: BlockManager stopped 24/10/09 12:55:45 INFO BlockManagerMaster: BlockManagerMaster stopped 24/10/09 12:55:45 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 24/10/09 12:55:45 INFO SparkContext: Successfully stopped SparkContext 24/10/09 12:55:45 INFO ShutdownHookManager: Shutdown hook called 24/10/09 12:55:45 INFO ShutdownHookManager: Deleting directory /var/data/spark-87e8406e-06a7-4b4a-b18f-2193da299d35/spark-093a1b71-121a-4367-9d22-ad4e397c9815 24/10/09 12:55:45 INFO ShutdownHookManager: Deleting directory /tmp/spark-723e2039-a493-49e8-b86d-fff5fd1bb168
(Optional) Step 5: Clean up
Delete the Spark job and the Secret when you no longer need them.
Delete the Spark job:
kubectl delete -f spark-pagerank.yaml
Delete the Secret:
Hadoop OSS SDK or JindoSDK:
kubectl delete -f spark-oss-secret.yaml
Hadoop S3 SDK:
kubectl delete -f spark-s3-secret.yaml