All Products
Search
Document Center

Container Service for Kubernetes:Read and write OSS data in Spark jobs

Last Updated:Mar 26, 2026

Run Spark jobs that read and write OSS data in ACK clusters, using the built-in PageRank job as a worked example. The tutorial covers three OSS integration options—Hadoop OSS SDK, Hadoop S3 SDK, and JindoSDK—and walks through every step from dataset generation to job verification.

Prerequisites

Before you begin, ensure that you have:

Choose an OSS integration

Three SDKs are available for OSS access. Choose one before proceeding.

SDK Use when URI scheme
Hadoop OSS SDK Native OSS support via Aliyun filesystem; straightforward for most OSS workloads oss://
Hadoop S3 SDK Your cluster or tooling already uses S3-compatible APIs; no Aliyun-specific dependency needed s3a://
JindoSDK Optimized OSS performance with Alibaba Cloud's Jindo acceleration layer oss://

Apply the same SDK choice consistently through all steps.

Step 1: Prepare and upload test data

Generate a PageRank dataset and upload it to your OSS bucket.

  1. Create a file named generate_pagerank_dataset.sh with the following content:

    #!/bin/bash
    
    # Check the number of arguments
    if [ "$#" -ne 2 ]; then
        echo "Usage: $0 M N"
        echo "M: Number of web pages"
        echo "N: Number of records to generate"
        exit 1
    fi
    
    M=$1
    N=$2
    
    # Verify if M and N are positive integers
    if ! [[ "$M" =~ ^[0-9]+$ ]] || ! [[ "$N" =~ ^[0-9]+$ ]]; then
        echo "Both M and N must be positive integers."
        exit 1
    fi
    
    # Generate dataset
    for ((i=1; i<=$N; i++)); do
        # Ensure the source and target pages are different
        while true; do
            src=$((RANDOM % M + 1))
            dst=$((RANDOM % M + 1))
            if [ "$src" -ne "$dst" ]; then
                echo "$src $dst"
                break
            fi
        done
    done
  2. Generate the dataset:

    M=100000    # The number of web pages
    
    N=10000000  # The number of records
    
    # Generate dataset randomly and save as pagerank_dataset.txt
    bash generate_pagerank_dataset.sh $M $N > pagerank_dataset.txt
  3. Upload the dataset to the data/ directory in your OSS bucket:

    ossutil cp pagerank_dataset.txt oss://<BUCKET_NAME>/data/

Step 2: Build a Spark container image

Build a container image that includes the JAR dependencies required for OSS access. For instructions on building images with Alibaba Cloud Container Registry, see Use a Container Registry Enterprise Edition instance to build an image.

The base Spark image in the sample Dockerfiles comes from the open source community. Replace it with your own image as needed, and match the SDK version to your Spark version.

Use Hadoop OSS SDK

This example uses Spark 3.5.5 and Hadoop OSS SDK 3.3.4:

ARG SPARK_IMAGE=spark:3.5.5

FROM ${SPARK_IMAGE}

# Add dependencies for Hadoop Aliyun OSS support
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aliyun/3.3.4/hadoop-aliyun-3.3.4.jar ${SPARK_HOME}/jars
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/com/aliyun/oss/aliyun-sdk-oss/3.17.4/aliyun-sdk-oss-3.17.4.jar ${SPARK_HOME}/jars
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/org/jdom/jdom2/2.0.6.1/jdom2-2.0.6.1.jar ${SPARK_HOME}/jars

Use Hadoop S3 SDK

This example uses Spark 3.5.5 and Hadoop S3 SDK 3.3.4:

ARG SPARK_IMAGE=spark:3.5.5

FROM ${SPARK_IMAGE}

# Add dependencies for Hadoop AWS S3 support
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar ${SPARK_HOME}/jars
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.367/aws-java-sdk-bundle-1.12.367.jar ${SPARK_HOME}/jars

Use JindoSDK

This example uses Spark 3.5.5 and JindoSDK 6.8.0:

ARG SPARK_IMAGE=spark:3.5.5

FROM ${SPARK_IMAGE}

# Add dependencies for JindoSDK support
ADD --chown=spark:spark --chmod=644 https://jindodata-binary.oss-cn-shanghai.aliyuncs.com/mvn-repo/com/aliyun/jindodata/jindo-core/6.8.0/jindo-core-6.8.0.jar ${SPARK_HOME}/jars
ADD --chown=spark:spark --chmod=644 https://jindodata-binary.oss-cn-shanghai.aliyuncs.com/mvn-repo/com/aliyun/jindodata/jindo-sdk/6.8.0/jindo-sdk-6.8.0.jar ${SPARK_HOME}/jars

Step 3: Store OSS credentials in a Kubernetes Secret

Store your OSS AccessKey ID and AccessKey Secret in a Kubernetes Secret rather than embedding them in the SparkApplication manifest. The Spark operator injects Secret keys as environment variables into the driver and executor pods at runtime.

The hadoopConf fields in Step 4 use EnvironmentVariableCredentialsProvider, which reads credentials from these environment variables automatically. Never embed credentials directly in YAML files or commit them to source control.

Use Hadoop OSS SDK

  1. Create spark-oss-secret.yaml:

    apiVersion: v1
    kind: Secret
    metadata:
      name: spark-oss-secret
      namespace: default
    stringData:
      # Replace <ACCESS_KEY_ID> with the AccessKey ID of your Alibaba Cloud account.
      OSS_ACCESS_KEY_ID: <ACCESS_KEY_ID>
      # Replace <ACCESS_KEY_SECRET> with the AccessKey Secret of your Alibaba Cloud account.
      OSS_ACCESS_KEY_SECRET: <ACCESS_KEY_SECRET>
  2. Apply the Secret:

    kubectl apply -f spark-oss-secret.yaml

    Expected output:

    secret/spark-oss-secret created

Use Hadoop S3 SDK

The Hadoop S3 SDK reads credentials from AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, which is why the key names differ from the OSS SDK variant.

  1. Create spark-s3-secret.yaml:

    apiVersion: v1
    kind: Secret
    metadata:
      name: spark-s3-secret
      namespace: default
    stringData:
      # Replace <ACCESS_KEY_ID> with the AccessKey ID of your Alibaba Cloud account.
      AWS_ACCESS_KEY_ID: <ACCESS_KEY_ID>
      # Replace <ACCESS_KEY_SECRET> with the AccessKey Secret of your Alibaba Cloud account.
      AWS_SECRET_ACCESS_KEY: <ACCESS_KEY_SECRET>
  2. Apply the Secret:

    kubectl apply -f spark-s3-secret.yaml

    Expected output:

    secret/spark-s3-secret created

Use JindoSDK

  1. Create spark-oss-secret.yaml:

    apiVersion: v1
    kind: Secret
    metadata:
      name: spark-oss-secret
      namespace: default
    stringData:
      # Replace <ACCESS_KEY_ID> with the AccessKey ID of your Alibaba Cloud account.
      OSS_ACCESS_KEY_ID: <ACCESS_KEY_ID>
      # Replace <ACCESS_KEY_SECRET> with the AccessKey Secret of your Alibaba Cloud account.
      OSS_ACCESS_KEY_SECRET: <ACCESS_KEY_SECRET>
  2. Apply the Secret:

    kubectl apply -f spark-oss-secret.yaml

    Expected output:

    secret/spark-oss-secret created

Step 4: Submit the Spark job

Create and submit a SparkApplication manifest to run the PageRank job against your OSS dataset.

Use Hadoop OSS SDK

Create spark-pagerank.yaml. For a full list of OSS configuration parameters, see Hadoop-Aliyun module.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pagerank
  namespace: default
spec:
  type: Scala
  mode: cluster
  # Replace <SPARK_IMAGE> with the Spark container image built in Step 2.
  image: <SPARK_IMAGE>
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.5.jar
  mainClass: org.apache.spark.examples.SparkPageRank
  arguments:
  - oss://<OSS_BUCKET>/data/pagerank_dataset.txt           # Specify the input test dataset. Replace <OSS_BUCKET> with your OSS bucket name.
  - "10"                                                   # The number of iterations.
  sparkVersion: 3.5.5
  hadoopConf:
    fs.AbstractFileSystem.oss.impl: org.apache.hadoop.fs.aliyun.oss.OSS
    fs.oss.impl: org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem
    # OSS endpoint. Replace <OSS_ENDPOINT> with your OSS endpoint.
    # For example, the internal endpoint for the China (Beijing) region is oss-cn-beijing-internal.aliyuncs.com.
    fs.oss.endpoint: <OSS_ENDPOINT>
    fs.oss.credentials.provider: com.aliyun.oss.common.auth.EnvironmentVariableCredentialsProvider
  driver:
    cores: 1
    coreLimit: 1200m
    memory: 512m
    envFrom:
    - secretRef:
        name: spark-oss-secret               # Specify the Secret used to access OSS.
    serviceAccount: spark-operator-spark
  executor:
    instances: 2
    cores: 1
    coreLimit: "2"
    memory: 8g
    envFrom:
    - secretRef:
        name: spark-oss-secret               # Specify the Secret used to access OSS.
  restartPolicy:
    type: Never

Use Hadoop S3 SDK

Create spark-pagerank.yaml. For a full list of S3 configuration parameters, see Hadoop-AWS module.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pagerank
  namespace: default
spec:
  type: Scala
  mode: cluster
  # Replace <SPARK_IMAGE> with the Spark container image built in Step 2.
  image: <SPARK_IMAGE>
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.5.jar
  mainClass: org.apache.spark.examples.SparkPageRank
  arguments:
  - s3a://<OSS_BUCKET>/data/pagerank_dataset.txt           # Specify the input test dataset. Replace <OSS_BUCKET> with your OSS bucket name.
  - "10"                                                   # The number of iterations.
  sparkVersion: 3.5.5
  hadoopConf:
    fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
    # OSS endpoint. Replace <OSS_ENDPOINT> with your OSS endpoint.
    # For example, the internal endpoint for the China (Beijing) region is oss-cn-beijing-internal.aliyuncs.com.
    fs.s3a.endpoint: <OSS_ENDPOINT>
    # The region where the OSS endpoint is located. For example, cn-beijing for the China (Beijing) region.
    fs.s3a.endpoint.region: <OSS_REGION>
  driver:
    cores: 1
    coreLimit: 1200m
    memory: 512m
    envFrom:
    - secretRef:
        name: spark-s3-secret               # Specify the Secret used to access OSS.
    serviceAccount: spark-operator-spark
  executor:
    instances: 2
    cores: 1
    coreLimit: "2"
    memory: 8g
    envFrom:
    - secretRef:
        name: spark-s3-secret               # Specify the Secret used to access OSS.
  restartPolicy:
    type: Never

Use JindoSDK

Create spark-pagerank.yaml:

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pagerank
  namespace: default
spec:
  type: Scala
  mode: cluster
  # Replace <SPARK_IMAGE> with the Spark container image built in Step 2.
  image: <SPARK_IMAGE>
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.5.jar
  mainClass: org.apache.spark.examples.SparkPageRank
  arguments:
  - oss://<OSS_BUCKET>/data/pagerank_dataset.txt    # Specify the input test dataset. Replace <OSS_BUCKET> with your OSS bucket name.
  - "10"                                            # The number of iterations.
  sparkVersion: 3.5.5
  hadoopConf:
    fs.AbstractFileSystem.oss.impl: com.aliyun.jindodata.oss.JindoOSS
    fs.oss.impl: com.aliyun.jindodata.oss.JindoOssFileSystem
    # OSS endpoint. Replace <OSS_ENDPOINT> with your OSS endpoint.
    # For example, the internal endpoint for the China (Beijing) region is oss-cn-beijing-internal.aliyuncs.com.
    fs.oss.endpoint: <OSS_ENDPOINT>
    fs.oss.credentials.provider: com.aliyun.jindodata.oss.auth.EnvironmentVariableCredentialsProvider
  driver:
    cores: 1
    coreLimit: 1200m
    memory: 512m
    serviceAccount: spark-operator-spark
    envFrom:
    - secretRef:
        name: spark-oss-secret                    # Specify the Secret used to access OSS.
  executor:
    instances: 2
    cores: 1
    coreLimit: "2"
    memory: 8g
    envFrom:
    - secretRef:
        name: spark-oss-secret                    # Specify the Secret used to access OSS.
  restartPolicy:
    type: Never

Submit and verify the job

The following commands apply to all three SDK options.

  1. Submit the job:

    kubectl apply -f spark-pagerank.yaml
  2. Monitor job status:

    kubectl get sparkapplications spark-pagerank

    Expected output when the job completes:

    NAME             STATUS      ATTEMPTS   START                  FINISH                 AGE
    spark-pagerank   COMPLETED   1          2024-10-09T12:54:25Z   2024-10-09T12:55:46Z   90s
  3. View the last 20 log lines from the driver pod:

    kubectl logs spark-pagerank-driver --tail=20

    Hadoop OSS SDK — expected output:

    30024 has rank:  1.0709659078941967 .
    21390 has rank:  0.9933356174074005 .
    28500 has rank:  1.0404018494028928 .
    2137 has rank:  0.9931000490520374 .
    3406 has rank:  0.9562543137167121 .
    20904 has rank:  0.8827028621652337 .
    25604 has rank:  1.0270134041934191 .
    24/10/09 12:48:36 INFO SparkUI: Stopped Spark web UI at http://spark-pagerank-dd0d4d927151c9d0-driver-svc.default.svc:4040
    24/10/09 12:48:36 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
    24/10/09 12:48:36 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
    24/10/09 12:48:36 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.
    24/10/09 12:48:36 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    24/10/09 12:48:36 INFO MemoryStore: MemoryStore cleared
    24/10/09 12:48:36 INFO BlockManager: BlockManager stopped
    24/10/09 12:48:36 INFO BlockManagerMaster: BlockManagerMaster stopped
    24/10/09 12:48:36 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    24/10/09 12:48:36 INFO SparkContext: Successfully stopped SparkContext
    24/10/09 12:48:36 INFO ShutdownHookManager: Shutdown hook called
    24/10/09 12:48:36 INFO ShutdownHookManager: Deleting directory /tmp/spark-e8b8c2ab-c916-4f84-b60f-f54c0de3a7f0
    24/10/09 12:48:36 INFO ShutdownHookManager: Deleting directory /var/data/spark-c5917d98-06fb-46fe-85bc-199b839cb885/spark-23e2c2ae-4754-43ae-854d-2752eb83b2c5

    Hadoop S3 SDK — expected output:

    3406 has rank:  0.9562543137167121 .
    20904 has rank:  0.8827028621652337 .
    25604 has rank:  1.0270134041934191 .
    25/04/07 03:54:11 INFO SparkContext: SparkContext is stopping with exitCode 0.
    25/04/07 03:54:11 INFO SparkUI: Stopped Spark web UI at http://spark-pagerank-0f7dec960e615617-driver-svc.spark.svc:4040
    25/04/07 03:54:11 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
    25/04/07 03:54:11 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
    25/04/07 03:54:11 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.
    25/04/07 03:54:11 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    25/04/07 03:54:11 INFO MemoryStore: MemoryStore cleared
    25/04/07 03:54:11 INFO BlockManager: BlockManager stopped
    25/04/07 03:54:11 INFO BlockManagerMaster: BlockManagerMaster stopped
    25/04/07 03:54:11 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    25/04/07 03:54:11 INFO SparkContext: Successfully stopped SparkContext
    25/04/07 03:54:11 INFO ShutdownHookManager: Shutdown hook called
    25/04/07 03:54:11 INFO ShutdownHookManager: Deleting directory /var/data/spark-20d425bb-f442-4b0a-83e2-5a0202959a54/spark-ff5bbf08-4343-4a7a-9ce0-3f7c127cf4a9
    25/04/07 03:54:11 INFO ShutdownHookManager: Deleting directory /tmp/spark-a421839a-07af-49c0-b637-f15f76c3e752
    25/04/07 03:54:11 INFO MetricsSystemImpl: Stopping s3a-file-system metrics system...
    25/04/07 03:54:11 INFO MetricsSystemImpl: s3a-file-system metrics system stopped.
    25/04/07 03:54:11 INFO MetricsSystemImpl: s3a-file-system metrics system shutdown complete.

    JindoSDK — expected output:

    21390 has rank:  0.9933356174074005 .
    28500 has rank:  1.0404018494028928 .
    2137 has rank:  0.9931000490520374 .
    3406 has rank:  0.9562543137167121 .
    20904 has rank:  0.8827028621652337 .
    25604 has rank:  1.0270134041934191 .
    24/10/09 12:55:44 INFO SparkContext: SparkContext is stopping with exitCode 0.
    24/10/09 12:55:44 INFO SparkUI: Stopped Spark web UI at http://spark-pagerank-6a5e3d9271584856-driver-svc.default.svc:4040
    24/10/09 12:55:44 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
    24/10/09 12:55:44 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
    24/10/09 12:55:44 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.
    24/10/09 12:55:45 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    24/10/09 12:55:45 INFO MemoryStore: MemoryStore cleared
    24/10/09 12:55:45 INFO BlockManager: BlockManager stopped
    24/10/09 12:55:45 INFO BlockManagerMaster: BlockManagerMaster stopped
    24/10/09 12:55:45 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    24/10/09 12:55:45 INFO SparkContext: Successfully stopped SparkContext
    24/10/09 12:55:45 INFO ShutdownHookManager: Shutdown hook called
    24/10/09 12:55:45 INFO ShutdownHookManager: Deleting directory /var/data/spark-87e8406e-06a7-4b4a-b18f-2193da299d35/spark-093a1b71-121a-4367-9d22-ad4e397c9815
    24/10/09 12:55:45 INFO ShutdownHookManager: Deleting directory /tmp/spark-723e2039-a493-49e8-b86d-fff5fd1bb168

(Optional) Step 5: Clean up

Delete the Spark job and the Secret when you no longer need them.

Delete the Spark job:

kubectl delete -f spark-pagerank.yaml

Delete the Secret:

Hadoop OSS SDK or JindoSDK:

kubectl delete -f spark-oss-secret.yaml

Hadoop S3 SDK:

kubectl delete -f spark-s3-secret.yaml

What's next