Spark on Container Service for Kubernetes (ACK) lets you build an efficient, flexible, and scalable big data processing platform on Kubernetes without managing the underlying infrastructure. It extends the open-source Spark Operator with ACK-native capabilities: deep integration with Alibaba Cloud storage, observability, and elastic computing resources.
Billing
Installing Spark-related ACK components (ack-spark-operator, ack-spark-history-server, and others) is free. Standard ACK cluster fees — cluster management fees and associated cloud resource fees — apply. For details, see Billing overview.
Additional fees from other cloud products may apply. For example, Simple Log Service charges for log collection, and OSS or NAS charges apply for data read and write operations by Spark jobs.
Getting started
Running Spark jobs on ACK follows a layered setup: start with the basics, add observability, then tune for performance.

Prerequisites
Before you begin, make sure you have:
A running ACK cluster with kubectl access configured
Sufficient permissions to create and manage pods in your cluster. Run the following command to verify:
kubectl auth can-i create pods
A dedicated namespace for Spark jobs (this guide uses spark):
kubectl create namespace spark
A service account for Spark driver pods. The driver pod requires permissions to create, list, and delete executor pods and services. Ensure the service account (for example, spark-operator-spark) has the appropriate RBAC permissions in your job namespace before submitting jobs.
Basic usage
Step 1: Build a Spark container image
Use the open-source Spark image directly, or customize it to add dependencies such as OSS support or Celeborn RSS. The following Dockerfile adds the common dependencies used in this guide.
Expand to view the sample Dockerfile
ARG SPARK_IMAGE=spark:3.5.4
FROM ${SPARK_IMAGE}
# Add dependency for Hadoop Aliyun OSS support
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aliyun/3.3.4/hadoop-aliyun-3.3.4.jar ${SPARK_HOME}/jars
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/com/aliyun/oss/aliyun-sdk-oss/3.17.4/aliyun-sdk-oss-3.17.4.jar ${SPARK_HOME}/jars
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/org/jdom/jdom2/2.0.6.1/jdom2-2.0.6.1.jar ${SPARK_HOME}/jars
# Add dependency for log4j-layout-template-json
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/org/apache/logging/log4j/log4j-layout-template-json/2.24.1/log4j-layout-template-json-2.24.1.jar ${SPARK_HOME}/jars
# Add dependency for Celeborn
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/org/apache/celeborn/celeborn-client-spark-3-shaded_2.12/0.5.3/celeborn-client-spark-3-shaded_2.12-0.5.3.jar ${SPARK_HOME}/jars
Build the image and push it to your image repository, then reference it in your SparkApplication resources.
Step 2: Deploy Spark Operator and run your first job
Deploy the ack-spark-operator component and set spark.jobNamespaces=["spark"] so it only watches jobs in the spark namespace.
The following is a minimal SparkApplication that runs the SparkPi example — enough to verify that Spark Operator is working:
Expand to view the sample Spark job
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: spark-pi
namespace: spark # Must be in the namespace list specified by spark.jobNamespaces
spec:
type: Scala
mode: cluster
# Replace <SPARK_IMAGE> with your own Spark container image
image: <SPARK_IMAGE>
imagePullPolicy: IfNotPresent
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.4.jar
arguments:
- "5000"
sparkVersion: 3.5.4
driver:
cores: 1
coreLimit: 1200m
memory: 512m
template:
spec:
containers:
- name: spark-kubernetes-driver
serviceAccount: spark-operator-spark
executor:
instances: 1
cores: 1
coreLimit: 1200m
memory: 512m
template:
spec:
containers:
- name: spark-kubernetes-executor
restartPolicy:
type: Never
restartPolicy: type: Never is appropriate for batch jobs that should not retry on failure. Set it to OnFailure (with onFailureRetries and onFailureRetryInterval) for production pipelines that require automatic retries.
For more information, see Use Spark Operator to run Spark jobs.
Step 3: Read and write OSS data
Spark jobs can access OSS using Hadoop Aliyun SDK, Hadoop AWS SDK, or JindoSDK. Include the corresponding dependencies in your container image and configure the Hadoop parameters in the job.
Expand to view the sample code
This example runs SparkPageRank and reads input data from OSS. Upload your test dataset to OSS first — see Read and write OSS data in Spark jobs for instructions.
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: spark-pagerank
namespace: spark
spec:
type: Scala
mode: cluster
# Replace <SPARK_IMAGE> with your own Spark image
image: <SPARK_IMAGE>
imagePullPolicy: IfNotPresent
mainClass: org.apache.spark.examples.SparkPageRank
mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.4.jar
arguments:
# Replace <OSS_BUCKET> with your OSS bucket name
- oss://<OSS_BUCKET>/data/pagerank_dataset.txt
# Number of iterations
- "10"
sparkVersion: 3.5.4
hadoopConf:
fs.oss.impl: org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem
# Replace <OSS_ENDPOINT> with the OSS endpoint, for example oss-cn-beijing-internal.aliyuncs.com
fs.oss.endpoint: <OSS_ENDPOINT>
fs.oss.credentials.provider: com.aliyun.oss.common.auth.EnvironmentVariableCredentialsProvider
driver:
cores: 1
coreLimit: "1"
memory: 4g
template:
spec:
containers:
- name: spark-kubernetes-driver
envFrom:
# Read OSS credentials from a Kubernetes Secret
- secretRef:
name: spark-oss-secret
serviceAccount: spark-operator-spark
executor:
instances: 2
cores: 1
coreLimit: "1"
memory: 4g
template:
spec:
containers:
- name: spark-kubernetes-executor
envFrom:
- secretRef:
name: spark-oss-secret
restartPolicy:
type: Never
For more information, see Read and write OSS data in Spark jobs.
Observability
Deploy Spark History Server
Deploy ack-spark-history-server in the spark namespace. It reads Spark event logs from a configured storage backend (PVC, OSS/OSS-HDFS, or HDFS) and exposes them through a web UI.
The following example configures Spark History Server to read event logs from a NAS file system at /spark/event-logs:
Expand to view the sample configuration
# Spark configuration
sparkConf:
spark.history.fs.logDirectory: file:///mnt/nas/spark/event-logs
# Environment variables
env:
- name: SPARK_DAEMON_MEMORY
value: 7g
# Data volume
volumes:
- name: nas
persistentVolumeClaim:
claimName: nas-pvc
# Data volume mount
volumeMounts:
- name: nas
subPath: spark/event-logs
mountPath: /mnt/nas/spark/event-logs
# Adjust resource size based on the number and scale of Spark jobs
resources:
requests:
cpu: 2
memory: 8Gi
limits:
cpu: 2
memory: 8Gi
Mount the same NAS file system in your Spark jobs and configure spark.eventLog.dir to write event logs to the same path. The following example shows a complete job with event logging enabled:
Expand to view the sample Spark job
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: spark-pi
namespace: spark
spec:
type: Scala
mode: cluster
# Replace <SPARK_IMAGE> with your Spark image
image: <SPARK_IMAGE>
imagePullPolicy: IfNotPresent
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.4.jar
arguments:
- "5000"
sparkVersion: 3.5.4
sparkConf:
spark.eventLog.enabled: "true"
spark.eventLog.dir: file:///mnt/nas/spark/event-logs
driver:
cores: 1
coreLimit: 1200m
memory: 512m
template:
spec:
containers:
- name: spark-kubernetes-driver
volumeMounts:
- name: nas
subPath: spark/event-logs
mountPath: /mnt/nas/spark/event-logs
volumes:
- name: nas
persistentVolumeClaim:
claimName: nas-pvc
serviceAccount: spark-operator-spark
executor:
instances: 1
cores: 1
coreLimit: 1200m
memory: 512m
template:
spec:
containers:
- name: spark-kubernetes-executor
restartPolicy:
type: Never
For more information, see Use Spark History Server to view information about Spark jobs.
Collect Spark logs with Simple Log Service
When running many jobs in a cluster, use Simple Log Service to centrally collect stdout and stderr logs from all Spark containers for querying and analysis.
Expand to view the sample job
This example configures Simple Log Service to collect logs from /opt/spark/logs/*.log in Spark containers.
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: spark-pi
namespace: spark
spec:
type: Scala
mode: cluster
# Replace <SPARK_IMAGE> with the Spark image built in step one
image: <SPARK_IMAGE>
imagePullPolicy: IfNotPresent
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.4.jar
arguments:
- "5000"
sparkVersion: 3.5.4
# Read log4j2.properties from the specified ConfigMap
sparkConfigMap: spark-log-conf
sparkConf:
spark.eventLog.enabled: "true"
spark.eventLog.dir: file:///mnt/nas/spark/event-logs
driver:
cores: 1
coreLimit: 1200m
memory: 512m
template:
spec:
containers:
- name: spark-kubernetes-driver
volumeMounts:
- name: nas
subPath: spark/event-logs
mountPath: /mnt/nas/spark/event-logs
serviceAccount: spark-operator-spark
volumes:
- name: nas
persistentVolumeClaim:
claimName: nas-pvc
executor:
instances: 1
cores: 1
coreLimit: 1200m
memory: 512m
template:
spec:
containers:
- name: spark-kubernetes-executor
restartPolicy:
type: Never
For more information, see Use Simple Log Service to collect the logs of Spark jobs.
Performance optimization
Improve shuffle performance with RSS
Shuffle operations involve significant disk I/O, data serialization, and network I/O — common sources of OOM errors and fetch failures in large-scale jobs. Configure Apache Celeborn as the Remote Shuffle Service to achieve storage-compute separation and improve shuffle stability.
Deploy the ack-celeborn component first, then reference it in your job configuration. All examples use spark.shuffle.manager: org.apache.spark.shuffle.celeborn.SparkShuffleManager and spark.celeborn.master.endpoints pointing to the Celeborn master pods.
Expand to view the sample code
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: spark-pagerank
namespace: spark
spec:
type: Scala
mode: cluster
# Replace <SPARK_IMAGE> with your Spark image
image: <SPARK_IMAGE>
imagePullPolicy: IfNotPresent
mainClass: org.apache.spark.examples.SparkPageRank
mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.4.jar
arguments:
- oss://<OSS_BUCKET>/data/pagerank_dataset.txt
- "10"
sparkVersion: 3.5.4
hadoopConf:
fs.oss.impl: org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem
fs.oss.endpoint: <OSS_ENDPOINT>
fs.oss.credentials.provider: com.aliyun.oss.common.auth.EnvironmentVariableCredentialsProvider
sparkConfigMap: spark-log-conf
sparkConf:
spark.eventLog.enabled: "true"
spark.eventLog.dir: file:///mnt/nas/spark/event-logs
# Celeborn RSS configuration
spark.shuffle.manager: org.apache.spark.shuffle.celeborn.SparkShuffleManager
# KryoSerializer is required because Java serializer does not support relocation
spark.serializer: org.apache.spark.serializer.KryoSerializer
# Configure based on the number of Celeborn master replicas
spark.celeborn.master.endpoints: celeborn-master-0.celeborn-master-svc.celeborn.svc.cluster.local,celeborn-master-1.celeborn-master-svc.celeborn.svc.cluster.local,celeborn-master-2.celeborn-master-svc.celeborn.svc.cluster.local
spark.celeborn.client.spark.shuffle.writer: hash
spark.celeborn.client.push.replicate.enabled: "false"
spark.sql.adaptive.localShuffleReader.enabled: "false"
spark.sql.adaptive.enabled: "true"
spark.sql.adaptive.skewJoin.enabled: "true"
spark.shuffle.sort.io.plugin.class: org.apache.spark.shuffle.celeborn.CelebornShuffleDataIO
spark.dynamicAllocation.shuffleTracking.enabled: "false"
spark.executor.userClassPathFirst: "false"
driver:
cores: 1
coreLimit: "1"
memory: 4g
template:
spec:
containers:
- name: spark-kubernetes-driver
envFrom:
- secretRef:
name: spark-oss-secret
volumeMounts:
- name: nas
subPath: spark/event-logs
mountPath: /mnt/nas/spark/event-logs
volumes:
- name: nas
persistentVolumeClaim:
claimName: nas-pvc
serviceAccount: spark-operator-spark
executor:
instances: 2
cores: 1
coreLimit: "1"
memory: 4g
template:
spec:
containers:
- name: spark-kubernetes-executor
envFrom:
- secretRef:
name: spark-oss-secret
restartPolicy:
type: Never
For more information, see Use Celeborn as RSS in Spark jobs.
Define elastic resource scheduling priority
Use ECI-based pods with a ResourcePolicy to run Spark jobs on demand and pay only for actual resource usage. The ACK scheduler automatically assigns pods to ECS or ECI resources based on the configured strategy — no changes to the SparkApplication spec are required.
Expand to view the sample elastic policy
This example prioritizes ECS resources (up to 10 pods) and falls back to elastic container instances (up to 10 pods) when ECS capacity is insufficient:
apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
name: spark
namespace: spark
spec:
# Apply this strategy to pods launched by Spark Operator
selector:
sparkoperator.k8s.io/launched-by-spark-operator: "true"
strategy: prefer
units:
# First: use ECS resources, up to 10 pods
- resource: ecs
max: 10
podLabels:
k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true"
nodeSelector:
node.alibabacloud.com/instance-charge-type: PostPaid
# Second: use ECI resources, up to 10 pods
- resource: eci
max: 10
ignorePreviousPod: false
ignoreTerminatingPod: true
preemptPolicy: AfterAllUnits
whenTryNextUnits:
policy: TimeoutOrExceedMax
# Wait up to 30 seconds for ECS autoscaling before falling back to ECI
timeout: 30s
For more information, see Use elastic container instances to run Spark jobs.
Configure Dynamic Resource Allocation
Dynamic Resource Allocation (DRA) adjusts executor count based on workload size, preventing both resource starvation and waste. The following example configures DRA together with Celeborn RSS:
Expand to view the sample job
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: spark-pagerank
namespace: spark
spec:
type: Scala
mode: cluster
# Replace <SPARK_IMAGE> with your Spark image
image: <SPARK_IMAGE>
imagePullPolicy: IfNotPresent
mainClass: org.apache.spark.examples.SparkPageRank
mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.4.jar
arguments:
- oss://<OSS_BUCKET>/data/pagerank_dataset.txt
- "10"
sparkVersion: 3.5.4
hadoopConf:
fs.oss.impl: org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem
fs.oss.endpoint: <OSS_ENDPOINT>
fs.oss.credentials.provider: com.aliyun.oss.common.auth.EnvironmentVariableCredentialsProvider
sparkConfigMap: spark-log-conf
sparkConf:
# ====================
# Event log
# ====================
spark.eventLog.enabled: "true"
spark.eventLog.dir: file:///mnt/nas/spark/event-logs
# ====================
# Celeborn
# Ref: https://github.com/apache/celeborn/blob/main/README.md#spark-configuration
# ====================
# Shuffle manager class name changed in 0.3.0:
# before 0.3.0: `org.apache.spark.shuffle.celeborn.RssShuffleManager`
# since 0.3.0: `org.apache.spark.shuffle.celeborn.SparkShuffleManager`
spark.shuffle.manager: org.apache.spark.shuffle.celeborn.SparkShuffleManager
# Must use KryoSerializer because Java serializer does not support relocation
spark.serializer: org.apache.spark.serializer.KryoSerializer
# Configure based on the number of Celeborn master replicas
spark.celeborn.master.endpoints: celeborn-master-0.celeborn-master-svc.celeborn.svc.cluster.local,celeborn-master-1.celeborn-master-svc.celeborn.svc.cluster.local,celeborn-master-2.celeborn-master-svc.celeborn.svc.cluster.local
# options: hash, sort
# Hash shuffle writer uses (partition count) * (celeborn.push.buffer.max.size) * (spark.executor.cores) memory.
# Sort shuffle writer uses less memory — use it when partition count is large.
spark.celeborn.client.spark.shuffle.writer: hash
# Enable server-side data replication if you have more than one worker
# If your Celeborn is using HDFS, set this to false
spark.celeborn.client.push.replicate.enabled: "false"
spark.sql.adaptive.localShuffleReader.enabled: "false"
spark.sql.adaptive.enabled: "true"
spark.sql.adaptive.skewJoin.enabled: "true"
# Required for Spark >= 3.5.0 to support dynamic resource allocation with Celeborn
spark.shuffle.sort.io.plugin.class: org.apache.spark.shuffle.celeborn.CelebornShuffleDataIO
spark.executor.userClassPathFirst: "false"
# ====================
# Dynamic resource allocation
# Ref: https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
# ====================
spark.dynamicAllocation.enabled: "true"
# Disable shuffle tracking when using Celeborn as RSS (Spark >= 3.4.0)
spark.dynamicAllocation.shuffleTracking.enabled: "false"
spark.dynamicAllocation.initialExecutors: "3"
spark.dynamicAllocation.minExecutors: "0"
spark.dynamicAllocation.maxExecutors: "10"
# Release idle executors after 60 seconds
spark.dynamicAllocation.executorIdleTimeout: 60s
# Release executors that have cached data blocks after the specified timeout (default: infinity)
# spark.dynamicAllocation.cachedExecutorIdleTimeout:
# Request additional executors when scheduling backlog exceeds 1 second
spark.dynamicAllocation.schedulerBacklogTimeout: 1s
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout: 1s
driver:
cores: 1
coreLimit: "1"
memory: 4g
template:
spec:
containers:
- name: spark-kubernetes-driver
envFrom:
- secretRef:
name: spark-oss-secret
volumeMounts:
- name: nas
subPath: spark/event-logs
mountPath: /mnt/nas/spark/event-logs
volumes:
- name: nas
persistentVolumeClaim:
claimName: nas-pvc
serviceAccount: spark-operator-spark
executor:
cores: 1
coreLimit: "1"
memory: 4g
template:
spec:
containers:
- name: spark-kubernetes-executor
envFrom:
- secretRef:
name: spark-oss-secret
restartPolicy:
type: Never
For more information, see Configure dynamic resource allocation for Spark jobs.
Use Fluid to accelerate data access
If your data is in a remote data center or you're hitting data access bottlenecks, use Fluid's distributed cache to accelerate reads for Spark jobs.
For more information, see Use Fluid to accelerate data access for Spark applications.