All Products
Search
Document Center

Container Service for Kubernetes:Spark on ACK overview

Last Updated:Apr 08, 2025

Spark on Container Service for Kubernetes (ACK) is a solution provided by ACK that leverages Spark on Kubernetes to enable rapid construction of an efficient, flexible, and scalable Spark big data processing platform, utilizing the enterprise-level container application management capabilities provided by ACK.

Introduction to Spark on ACK

Apache Spark is a compute engine tailored for large-scale data processing and is widely utilized in scenarios such as data analysis and machine learning. Beginning with version 2.3, Spark has enabled job submission to Kubernetes clusters (Running Spark on Kubernetes).

Spark Operator is an Operator designed for running Spark workloads on Kubernetes clusters. It automates the management of the Spark job lifecycle in a way native to Kubernetes, including configuration, submission, and retry processes.

The Spark on ACK solution customizes and enhances components like Spark Operator, ensuring compatibility with open-source versions and expanding its capabilities. It integrates seamlessly with the Alibaba Cloud ecosystem, offering features such as log retention, Object Storage Service, and observability, allowing for the quick establishment of a flexible, efficient, and scalable big data processing platform.

Features and advantages

  • Simplified development and operations

    • Portability: Enables packaging of Spark applications and their dependencies into container images for easy migration between Kubernetes clusters.

    • Observability: Allows job status monitoring through the Spark History Server and integrates with Simple Log Service and Managed Service for Prometheus for enhanced job observability.

    • Workflow orchestration: Use workflow orchestration engines such as Apache Airflow and Argo Workflows to manage Spark jobs, automate data pipeline scheduling, and ensure consistent deployments across environments. This enhances operational efficiency and reduces migration costs.

    • Multi-version support: Allows multiple versions of Spark jobs to run concurrently in a single ACK cluster.

  • Job scheduling and resource management

    • Job queue management: Seamlessly integrated with ack-kube-queue, this feature offers flexible management of job queues and resource quotas, automatically optimizing the allocation of resources for workloads and enhancing the utilization of cluster resources.

    • Multiple scheduling strategies: Leverage the existing scheduling capabilities of the ACK scheduler to support a variety of batch scheduling strategies, such as Gang Scheduling and Capacity Scheduling.

    • Multi-architecture scheduling: Supports hybrid use of x86 and Arm architecture Elastic Compute Service (ECS) resources to improve efficiency and reduce costs.

    • Multi-cluster scheduling: Utilize the ACK One multi-cluster fleet to distribute Spark jobs across various clusters, enhancing resource utilization across multiple clusters.

    • Elastic computing power supply: Offers customizable resource priority scheduling and the integration of various elastic solutions, such as node autoscaling and instant elasticity. It also allows for the use of Elastic Container Instance and Alibaba Cloud Container Compute Service (ACS) computing resources without maintaining Elastic Compute Service instances, enabling on-demand and flexible scaling options.

    • Colocation of multiple types of workloads: Seamlessly integrated with ack-koordinator, this feature supports the colocation of various types of workloads, thereby enhancing the utilization of cluster resources.

  • Performance and stability optimization

    • Shuffle performance optimization: Configures Spark jobs to use Celeborn as the Remote Shuffle Service, achieving storage-compute separation and enhancing Shuffle performance and stability.

    • Data access acceleration: Utilizes the data orchestration and access acceleration capabilities provided by Fluid to speed up data access for Spark jobs, thereby improving performance.

Overall architecture

The architecture of Spark on ACK allows for quick job submission through Spark Operator and leverages the observability, scheduling, and resource elasticity features of ACK and Alibaba Cloud products.

image

Billing overview

The installation of components for running Spark jobs in an ACK cluster is free. However, the costs for the ACK cluster itself, including cluster management fees and associated cloud product fees, are charged as usual. For more information, see Billing overview.

Additional cloud product fees, such as those for collecting logs with Simple Log Service or for reading and writing data in OSS/NAS by Spark jobs, are charged by each respective cloud product. Refer to the operation documents below for further details.

Getting Started

Running Spark jobs in an ACK cluster typically involves a series of steps, such as basic usage, observability, and advanced configuration, which you can select and configure according to your needs.

image

Basic usage

Process

Description

Build a Spark container image

You can choose to directly use the Spark container image provided by the open-source community or customize it based on the open-source container image and push it to your own image repository. Below is a Dockerfile example. You can modify this Dockerfile as needed, such as replacing the Spark base image or adding dependent JAR packages, then build the image and push it to the image repository.

Expand to view the example Dockerfile

ARG SPARK_IMAGE=spark:3.5.4

FROM ${SPARK_IMAGE}

# Add dependency for Hadoop Aliyun OSS support
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aliyun/3.3.4/hadoop-aliyun-3.3.4.jar ${SPARK_HOME}/jars
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/com/aliyun/oss/aliyun-sdk-oss/3.17.4/aliyun-sdk-oss-3.17.4.jar ${SPARK_HOME}/jars
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/org/jdom/jdom2/2.0.6.1/jdom2-2.0.6.1.jar ${SPARK_HOME}/jars

# Add dependency for log4j-layout-template-json
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/org/apache/logging/log4j/log4j-layout-template-json/2.24.1/log4j-layout-template-json-2.24.1.jar ${SPARK_HOME}/jars

# Add dependency for Celeborn
ADD --chown=spark:spark --chmod=644 https://repo1.maven.org/maven2/org/apache/celeborn/celeborn-client-spark-3-shaded_2.12/0.5.3/celeborn-client-spark-3-shaded_2.12-0.5.3.jar ${SPARK_HOME}/jars

Create a dedicated namespace

Create one or more dedicated namespaces for Spark jobs (this tutorial uses spark) to achieve resource isolation and resource quotas. Subsequent Spark jobs will run in this namespace. The creation command is as follows.

kubectl create namespace spark

Use Spark Operator to run Spark jobs

Deploy the ack-spark-operator component and configure spark.jobNamespaces=["spark"] (only listen to Spark jobs submitted in the spark namespace). After deployment, you can run the following example Spark job.

Expand to view the example Spark job

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: spark # Ensure this namespace is in the namespace list specified by spark.jobNamespaces.
spec:
  type: Scala
  mode: cluster
  # Replace <SPARK_IMAGE> with your own Spark container image.
  image: <SPARK_IMAGE>
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.4.jar
  arguments:
  - "5000"
  sparkVersion: 3.5.4
  driver:
    cores: 1
    coreLimit: 1200m
    memory: 512m
    template:
      spec:
        containers:
        - name: spark-kubernetes-driver
        serviceAccount: spark-operator-spark
  executor:
    instances: 1
    cores: 1
    coreLimit: 1200m
    memory: 512m
    template:
      spec:
        containers:
        - name: spark-kubernetes-executor
  restartPolicy:
    type: Never
For more information, see Use Spark Operator to run Spark jobs.

Read and write OSS data

There are multiple ways for Spark jobs to access Alibaba Cloud OSS data, including Hadoop Aliyun SDK, Hadoop AWS SDK, and JindoSDK. Depending on the SDK you choose, you need to include the corresponding dependencies in the Spark container image and configure Hadoop-related parameters in the Spark job.

Expand to view the example code

Refer to Read and write OSS data in Spark jobs to upload the test dataset to OSS, and then you can run the following example Spark job.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pagerank
  namespace: spark
spec:
  type: Scala
  mode: cluster
  # Replace <SPARK_IMAGE> with your own Spark image
  image: <SPARK_IMAGE>
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPageRank
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.4.jar
  arguments:
  # Specify the input test dataset, replace <OSS_BUCKET> with the OSS Bucket name
  - oss://<OSS_BUCKET>/data/pagerank_dataset.txt
  # Number of iterations
  - "10"
  sparkVersion: 3.5.4
  hadoopConf:
    # Configure Spark job access to OSS
    fs.oss.impl: org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem
    # Replace <OSS_ENDPOINT> with the OSS access endpoint, for example, the internal network access endpoint of OSS in Beijing is oss-cn-beijing-internal.aliyuncs.com
    fs.oss.endpoint: <OSS_ENDPOINT>
    fs.oss.credentials.provider: com.aliyun.oss.common.auth.EnvironmentVariableCredentialsProvider
  driver:
    cores: 1
    coreLimit: "1"
    memory: 4g
    template:
      spec:
        containers:
        - name: spark-kubernetes-driver
          envFrom:
          # Read environment variables from the specified Secret
          - secretRef:
              name: spark-oss-secret
        serviceAccount: spark-operator-spark
  executor:
    instances: 2
    cores: 1
    coreLimit: "1"
    memory: 4g
    template:
      spec:
        containers:
        - name: spark-kubernetes-executor
          envFrom:
          # Read environment variables from the specified Secret
          - secretRef:
              name: spark-oss-secret
  restartPolicy:
    type: Never
For more information, see Read and write OSS data in Spark jobs.

Observability

Process

Description

Deploy Spark History Server

Deploy ack-spark-history-server in the spark namespace, configure log storage backend (supporting PVC, OSS/OSS-HDFS, HDFS), and other information to read Spark event logs from the specified storage system and parse them into a Web UI for users to view. The following example configuration shows how to configure Spark History Server to read event logs from the /spark/event-logs path of the specified NAS file system.

Expand to view the example configuration

# Spark configuration
sparkConf:
  spark.history.fs.logDirectory: file:///mnt/nas/spark/event-logs

# Environment variables
env:
- name: SPARK_DAEMON_MEMORY
  value: 7g

# Data volume
volumes:
- name: nas
  persistentVolumeClaim:
    claimName: nas-pvc

# Data volume mount
volumeMounts:
- name: nas
  subPath: spark/event-logs
  mountPath: /mnt/nas/spark/event-logs

# Adjust resource size based on the number and scale of Spark jobs
resources:
  requests:
    cpu: 2
    memory: 8Gi
  limits:
    cpu: 2
    memory: 8Gi

Next, mount the same NAS file system when submitting Spark jobs and configure Spark to write event logs to the same path. You will then be able to view the job from the Spark History Server. Below is an example job.

Expand to view the example Spark job

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: spark
spec:
  type: Scala
  mode: cluster
  # Replace <SPARK_IMAGE> with your Spark image
  image: <SPARK_IMAGE>
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.4.jar
  arguments:
  - "5000"
  sparkVersion: 3.5.4
  sparkConf:
    # Event log
    spark.eventLog.enabled: "true"
    spark.eventLog.dir: file:///mnt/nas/spark/event-logs
  driver:
    cores: 1
    coreLimit: 1200m
    memory: 512m
    template:
      spec:
        containers:
        - name: spark-kubernetes-driver
          volumeMounts:
          - name: nas
            subPath: spark/event-logs
            mountPath: /mnt/nas/spark/event-logs
        volumes:
        - name: nas
          persistentVolumeClaim:
            claimName: nas-pvc
        serviceAccount: spark-operator-spark
  executor:
    instances: 1
    cores: 1
    coreLimit: 1200m
    memory: 512m
    template:
      spec:
        containers:
        - name: spark-kubernetes-executor
  restartPolicy:
    type: Never
For more information, see Use Spark History Server to view information about Spark jobs.

Configure Simple Log Service to collect Spark logs

When running many Spark jobs in the cluster, it is recommended to use Simple Log Service to uniformly collect all Spark job logs for querying and analyzing the stdout and stderr logs of Spark containers.

Expand to view the example job

This code uses Simple Log Service in Spark jobs to collect logs located at /opt/spark/logs/*.log in the Spark container.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: spark
spec:
  type: Scala
  mode: cluster
  # Replace <SPARK_IMAGE> with the Spark image built in step one
  image: <SPARK_IMAGE>
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.4.jar
  arguments:
  - "5000"
  sparkVersion: 3.5.4
  # Read the log configuration file log4j2.properties from the specified ConfigMap
  sparkConfigMap: spark-log-conf
  sparkConf:
    # Event log
    spark.eventLog.enabled: "true"
    spark.eventLog.dir: file:///mnt/nas/spark/event-logs
  driver:
    cores: 1
    coreLimit: 1200m
    memory: 512m
    template:
      spec:
        containers:
        - name: spark-kubernetes-driver
          volumeMounts:
          - name: nas
            subPath: spark/event-logs
            mountPath: /mnt/nas/spark/event-logs
        serviceAccount: spark-operator-spark
        volumes:
        - name: nas
          persistentVolumeClaim:
            claimName: nas-pvc
  executor:
    instances: 1
    cores: 1
    coreLimit: 1200m
    memory: 512m
    template:
      spec:
        containers:
        - name: spark-kubernetes-executor
  restartPolicy:
    type: Never
For more information, see Use Simple Log Service to collect the logs of Spark jobs.

Performance Optimization

Process

Description

Improve Shuffle performance through RSS

Shuffle is an important operation in distributed computing, often accompanied by a large amount of disk IO, data serialization, and network IO, which can easily lead to OOM and data retrieval failures (Fetch failures). To optimize Shuffle performance and stability and improve the quality of computing services, you can use Apache Celeborn as the Remote Shuffle Service (RSS) in the Spark job configuration.

Expand to view the example code

Refer to Use Celeborn to enable RSS for Spark jobs to deploy the ack-celeborn component in the cluster, and then you can submit Spark jobs based on the code below, using Celeborn as RSS.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pagerank
  namespace: spark
spec:
  type: Scala
  mode: cluster
  # Replace <SPARK_IMAGE> with your Spark image
  image: <SPARK_IMAGE>
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPageRank
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.4.jar
  arguments:
  # Specify the input test dataset, replace <OSS_BUCKET> with the OSS Bucket name
  - oss://<OSS_BUCKET>/data/pagerank_dataset.txt
  # Number of iterations
  - "10"
  sparkVersion: 3.5.4
  hadoopConf:
    # Configure Spark job access to OSS
    fs.oss.impl: org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem
    # Replace <OSS_ENDPOINT> with the OSS access endpoint, for example, the internal network access endpoint of OSS in Beijing is oss-cn-beijing-internal.aliyuncs.com
    fs.oss.endpoint: <OSS_ENDPOINT>
    fs.oss.credentials.provider: com.aliyun.oss.common.auth.EnvironmentVariableCredentialsProvider
  # Read the log configuration file log4j2.properties from the specified ConfigMap
  sparkConfigMap: spark-log-conf
  sparkConf:
    # Event log
    spark.eventLog.enabled: "true"
    spark.eventLog.dir: file:///mnt/nas/spark/event-logs
    
    # Celeborn related configuration
    spark.shuffle.manager: org.apache.spark.shuffle.celeborn.SparkShuffleManager
    spark.serializer: org.apache.spark.serializer.KryoSerializer
    # Configure based on the number of Celeborn master replicas
    spark.celeborn.master.endpoints: celeborn-master-0.celeborn-master-svc.celeborn.svc.cluster.local,celeborn-master-1.celeborn-master-svc.celeborn.svc.cluster.local,celeborn-master-2.celeborn-master-svc.celeborn.svc.cluster.local
    spark.celeborn.client.spark.shuffle.writer: hash
    spark.celeborn.client.push.replicate.enabled: "false"
    spark.sql.adaptive.localShuffleReader.enabled: "false"
    spark.sql.adaptive.enabled: "true"
    spark.sql.adaptive.skewJoin.enabled: "true"
    spark.shuffle.sort.io.plugin.class: org.apache.spark.shuffle.celeborn.CelebornShuffleDataIO
    spark.dynamicAllocation.shuffleTracking.enabled: "false"
    spark.executor.userClassPathFirst: "false"
  driver:
    cores: 1
    coreLimit: "1"
    memory: 4g
    template:
      spec:
        containers:
        - name: spark-kubernetes-driver
          envFrom:
          # Read environment variables from the specified Secret
          - secretRef:
              name: spark-oss-secret
          volumeMounts:
          - name: nas
            subPath: spark/event-logs
            mountPath: /mnt/nas/spark/event-logs
        volumes:
        - name: nas
          persistentVolumeClaim:
            claimName: nas-pvc
        serviceAccount: spark-operator-spark
  executor:
    instances: 2
    cores: 1
    coreLimit: "1"
    memory: 4g
    template:
      spec:
        containers:
        - name: spark-kubernetes-executor
          envFrom:
          # Read environment variables from the specified Secret
          - secretRef:
              name: spark-oss-secret
  restartPolicy:
    type: Never
For more information, see Use Celeborn as RSS in Spark jobs.

Define elastic resource scheduling priority

By using Elastic Container Instance-based pods and configuring appropriate scheduling strategies, you can create on-demand and pay based on actual resource usage, effectively reducing the cost waste caused by idle cluster resources. In scenarios where ECS instances and elastic container instances are mixed, you can also specify scheduling priorities.

You do not need to modify scheduling-related configurations in SparkApplication. The ACK scheduler will automatically complete Pod scheduling based on the configured elastic strategy. You can flexibly customize the mixed use of various elastic resources (such as ECS instances and elastic container instances) as needed.

Expand to view the example elastic strategy

The following example customizes an elastic strategy: for Pods launched by Spark Operator in the spark namespace, ECS resources are preferred and up to 10 Pods can be scheduled to ECS instances. When ECS resources are insufficient, elastic container instances are used and up to 10 Pods can be scheduled to elastic container instances.

apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
  name: spark
  namespace: spark
spec:
  # Specify the scheduling strategy applied to Pods through label selectors
  selector:
    # For example, specify the scheduling strategy applied to Pods submitted through Spark Operator
    sparkoperator.k8s.io/launched-by-spark-operator: "true"
  strategy: prefer
  # Scheduling unit configuration
  # During scale-out, scale-out will be performed in the order of scheduling units; during scale-in, scale-in will be performed in the reverse order of scheduling units.
  units:
  # The first scheduling unit uses ECS resources, and up to 10 Pods can be scheduled to ECS instances
  - resource: ecs
    max: 10
    # The scheduler will update the label information to the Pod
    podLabels:
      # This is a special label that will not be updated to the Pod
      k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true"
    # When using ECS resources, you can specify schedulable nodes through node selectors
    nodeSelector:
      # For example, select ECS nodes of the pay-as-you-go type
      node.alibabacloud.com/instance-charge-type: PostPaid
  # The second scheduling unit uses ECI resources, and up to 10 Pods can be scheduled to ECI instances
  - resource: eci
    max: 10
  # Ignore Pods that were already scheduled before the ResourcePolicy was created when counting the number of Pods
  ignorePreviousPod: false
  # Ignore Pods in the Terminating state when counting the number of Pods
  ignoreTerminatingPod: true
  # Preemption strategy
  # BeforeNextUnit indicates that the scheduler will attempt preemption when each Unit scheduling fails
  # AfterAllUnits indicates that ResourcePolicy will only attempt preemption when the last Unit scheduling fails
  preemptPolicy: AfterAllUnits
  # Under what circumstances Pods are allowed to use resources in subsequent Units
  whenTryNextUnits:
    # When one of the following two conditions is met, resources in subsequent Units are allowed to be used
    # 1. The Max of the current Unit is set, and the number of Pods in that Unit is greater than or equal to the set Max value;
    # 2. The Max of the current Unit is not set, and the PodLabels of that Unit contain the label k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true", and after waiting for a timeout.
    policy: TimeoutOrExceedMax
    # When the policy is TimeoutOrExceedMax, if the current Unit resources are insufficient to schedule Pods, wait in the current Unit, and the maximum waiting time is timeout.
    # This strategy can be used with auto-scaling and ECI to achieve the effect of prioritizing node pool auto-scaling and automatically using ECI after a timeout.
    timeout: 30s
For more information, see Use elastic container instances to run Spark jobs.

Configure dynamic resource allocation

Dynamic Resource Allocation (DRA) can dynamically adjust the computing resources used by jobs based on the size of the workload. You can enable dynamic resource allocation for Spark jobs to avoid long job execution times due to insufficient resources or resource waste due to excess resources.

Expand to view the example job

This example job further configures dynamic resource allocation in combination with Celeborn RSS.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pagerank
  namespace: spark
spec:
  type: Scala
  mode: cluster
  # Replace <SPARK_IMAGE> with your Spark image
  image: <SPARK_IMAGE>
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPageRank
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.4.jar
  arguments:
  # Specify the input test dataset, replace <OSS_BUCKET> with the OSS Bucket name
  - oss://<OSS_BUCKET>/data/pagerank_dataset.txt
  # Number of iterations
  - "10"
  sparkVersion: 3.5.4
  hadoopConf:
    # Configure Spark job access to OSS
    fs.oss.impl: org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem
    # Replace <OSS_ENDPOINT> with the OSS access endpoint, for example, the internal network access endpoint of OSS in Beijing is oss-cn-beijing-internal.aliyuncs.com
    fs.oss.endpoint: <OSS_ENDPOINT>
    fs.oss.credentials.provider: com.aliyun.oss.common.auth.EnvironmentVariableCredentialsProvider
  # Read the log configuration file log4j2.properties from the specified ConfigMap
  sparkConfigMap: spark-log-conf
  sparkConf:
    # ====================
    # Event log
    # ====================
    spark.eventLog.enabled: "true"
    spark.eventLog.dir: file:///mnt/nas/spark/event-logs

    # ====================
    # Celeborn
    # Ref: https://github.com/apache/celeborn/blob/main/README.md#spark-configuration
    # ====================
    # Shuffle manager class name changed in 0.3.0:
    #    before 0.3.0: `org.apache.spark.shuffle.celeborn.RssShuffleManager`
    #    since 0.3.0: `org.apache.spark.shuffle.celeborn.SparkShuffleManager`
    spark.shuffle.manager: org.apache.spark.shuffle.celeborn.SparkShuffleManager
    # Must use kryo serializer because java serializer do not support relocation
    spark.serializer: org.apache.spark.serializer.KryoSerializer
    # Configure based on the number of Celeborn master replicas.
    spark.celeborn.master.endpoints: celeborn-master-0.celeborn-master-svc.celeborn.svc.cluster.local,celeborn-master-1.celeborn-master-svc.celeborn.svc.cluster.local,celeborn-master-2.celeborn-master-svc.celeborn.svc.cluster.local
    # options: hash, sort
    # Hash shuffle writer use (partition count) * (celeborn.push.buffer.max.size) * (spark.executor.cores) memory.
    # Sort shuffle writer uses less memory than hash shuffle writer, if your shuffle partition count is large, try to use sort hash writer.
    spark.celeborn.client.spark.shuffle.writer: hash
    # We recommend setting `spark.celeborn.client.push.replicate.enabled` to true to enable server-side data replication
    # If you have only one worker, this setting must be false 
    # If your Celeborn is using HDFS, it's recommended to set this setting to false
    spark.celeborn.client.push.replicate.enabled: "false"
    # Support for Spark AQE only tested under Spark 3
    spark.sql.adaptive.localShuffleReader.enabled: "false"
    # we recommend enabling aqe support to gain better performance
    spark.sql.adaptive.enabled: "true"
    spark.sql.adaptive.skewJoin.enabled: "true"
    # If the Spark version is 3.5.0 or later, configure this parameter to support dynamic resource allocation
    spark.shuffle.sort.io.plugin.class: org.apache.spark.shuffle.celeborn.CelebornShuffleDataIO
    spark.executor.userClassPathFirst: "false"

    # ====================
    # Dynamic resource allocation
    # Ref: https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
    # ====================
    # Enable dynamic resource allocation
    spark.dynamicAllocation.enabled: "true"
    # Enable shuffled file tracing and implement dynamic resource allocation without relying on ESS.
    # If the Spark version is 3.4.0 or later, we recommend that you disable this feature when you use Celeborn as RSS.
    spark.dynamicAllocation.shuffleTracking.enabled: "false"
    # The initial value of the number of executors.
    spark.dynamicAllocation.initialExecutors: "3"
    # The minimum number of executors.
    spark.dynamicAllocation.minExecutors: "0"
    # The maximum number of executors.
    spark.dynamicAllocation.maxExecutors: "10"
    # The idle timeout period of the executor. If the timeout period is exceeded, the executor is released.
    spark.dynamicAllocation.executorIdleTimeout: 60s
    # The idle timeout period of the executor that cached the data block. If the timeout period is exceeded, the executor is released. The default value is infinity, which specifies that the data block is not released.
    # spark.dynamicAllocation.cachedExecutorIdleTimeout:
    # When a job exceeds the specified scheduled time, additional executors are requested.
    spark.dynamicAllocation.schedulerBacklogTimeout: 1s
    # After each time interval, subsequent batches of executors are requested. 
    spark.dynamicAllocation.sustainedSchedulerBacklogTimeout: 1s
  driver:
    cores: 1
    coreLimit: "1"
    memory: 4g
    template:
      spec:
        containers:
        - name: spark-kubernetes-driver
          envFrom:
          # Read environment variables from the specified Secret
          - secretRef:
              name: spark-oss-secret
          volumeMounts:
          - name: nas
            subPath: spark/event-logs
            mountPath: /mnt/nas/spark/event-logs
        volumes:
        - name: nas
          persistentVolumeClaim:
            claimName: nas-pvc
        serviceAccount: spark-operator-spark
  executor:
    cores: 1
    coreLimit: "1"
    memory: 4g
    template:
      spec:
        containers:
        - name: spark-kubernetes-executor
          envFrom:
          # Read environment variables from the specified Secret
          - secretRef:
              name: spark-oss-secret
  restartPolicy:
    type: Never
For more information, see Configure dynamic resource allocation for Spark jobs.

Use Fluid to accelerate data access

If your data is located in a data center or encounters a performance bottleneck during data access, you can use the data access and distributed cache orchestration capabilities provided by Fluid to accelerate data access.

For more information, see Use Fluid to accelerate data access for Spark applications.

References