Deploy Spark Jobs on ACK Using Spark Operator - Container Service for Kubernetes

Apache Spark is a high-speed computing engine for large-scale data processing, widely used in data analytics and machine learning. Spark Operator automates the deployment and lifecycle management of Spark jobs on Kubernetes. This topic guides you through using Spark Operator to run and manage Spark jobs on an ACK cluster to efficiently handle big data workloads.

Prerequisites

An ACK Pro cluster or ACK Serverless Pro cluster running Kubernetes 1.24 or later. For more information, see Create an ACK managed cluster, Create an ACK Serverless cluster, and Manually upgrade ACK clusters.

A kubectl client is connected to the ACK cluster. For more information, see Connect to an ACK cluster using kubectl.

How it works

Spark Operator automates the lifecycle of Spark jobs on Kubernetes. It uses CustomResourceDefinition (CRD) resources such as SparkApplication and ScheduledSparkApplication to declaratively manage Spark jobs. Spark Operator leverages native Kubernetes features like auto-scaling, health checks, and resource management to efficiently run and monitor your workloads. ACK provides the ack-spark-operator component, which is based on the community's kubeflow/spark-operator. For more information, see Spark Operator | Kubeflow.

Benefits:

Simplified management: Automate the deployment and lifecycle management of Spark jobs by using declarative configurations in Kubernetes.
Multi-tenancy support: Use Kubernetes namespaces and resource quotas for user-level resource isolation and allocation. Use node selection to ensure that Spark workloads run on dedicated resources.
Elastic resource provisioning: Leverage elastic resources such as Elastic Container Instance (ECI) or elastic node pools to quickly obtain large amounts of elastic resources during business peaks, balancing performance and cost.

Use cases:

Data analytics: Data scientists can use Spark for interactive data analytics and data cleansing.
Batch computing: Run scheduled batch jobs to process large-scale datasets.
Real-time processing: The Spark Streaming library provides the capability for real-time data stream processing.

Procedure overview

This topic describes how to use Spark Operator to run and manage Spark jobs on an ACK cluster.

Deploy the ack-spark-operator component: Install Spark Operator in your ACK cluster to enable the management and execution of Spark jobs.
Submit a Spark job: Create and submit a Spark job manifest to run a specific data processing task.
Monitor the Spark job: Monitor the running status of the job and obtain detailed execution information and logs.
Access the Spark web UI: Access the web UI for a more intuitive view of the Spark job's execution.
Update the Spark job: Adjust the job configuration based on your requirements and apply the updates.
Delete the Spark job: Clean up completed or unnecessary Spark jobs to avoid unexpected costs.

Step 1: Deploy the ack-spark-operator component

Log on to the ACK console. In the left navigation pane, click Marketplace > Marketplace.
On the Marketplace page, click the App Catalog tab, then search for and select ack-spark-operator.
On the ack-spark-operator page, click Deploy.
In the Create panel, select a cluster and namespace, and then click Next.

On the Parameters page, configure the parameters and then click OK.

The following table describes key configuration parameters. For a complete list, see the ConfigMaps tab on the ack-spark-operator page.

Parameter	Description	Default
`controller.replicas`	The number of controller replicas.	1
`webhook.replicas`	The number of webhook replicas.	1
`spark.jobNamespaces`	The list of namespaces where Spark jobs can run. An empty string "" allows all namespaces. Multiple namespaces are separated by a comma (`,`).	`["default"]` (default) `[""]` (all namespaces) `["ns1","ns2","ns3"]` (multiple namespaces)
`spark.serviceAccount.name`	Spark Operator automatically creates a ServiceAccount named `spark-operator-spark` and the required RBAC resources in each namespace specified by `spark.jobNamespaces`. If you customize this name, you must specify the new name when submitting Spark jobs.	`spark-operator-spark`

Step 2: Submit a Spark job

Create a SparkApplication manifest to submit a Spark job for data processing.

Create the following SparkApplication manifest and save it as spark-pi.yaml.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: default     # The namespace must be in the list of namespaces specified by spark.jobNamespaces.
spec:
  type: Scala
  mode: cluster
  image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/spark:3.5.4
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.4.jar
  arguments:
  - "1000"
  sparkVersion: 3.5.4
  driver:
    cores: 1
    coreLimit: 1200m
    memory: 512m
    serviceAccount: spark-operator-spark   # If you customized the ServiceAccount name, change the value accordingly.
  executor:
    instances: 1
    cores: 1
    coreLimit: 1200m
    memory: 512m
  restartPolicy:
    type: Never

Run the following command to submit the Spark job.

kubectl apply -f spark-pi.yaml

Expected output:

sparkapplication.sparkoperator.k8s.io/spark-pi created

Step 3: Monitor the Spark job

Run the following commands to check the status, associated pods, and logs of the Spark job.

Run the following command to check the status of the Spark job.

kubectl get sparkapplication spark-pi

Expected output:

NAME       STATUS      ATTEMPTS   START                  FINISH       AGE
spark-pi   SUBMITTED   1          2024-06-04T03:17:11Z   <no value>   15s

Run the following command to check the status of the pods for the Spark job. The command filters pods by the label sparkoperator.k8s.io/app-name with the value spark-pi.
```
kubectl get pod -l sparkoperator.k8s.io/app-name=spark-pi
```
Expected output:
```
NAME                               READY   STATUS    RESTARTS   AGE
spark-pi-driver                    1/1     Running   0          49s
spark-pi-7272428fc8f5f392-exec-1   1/1     Running   0          13s
```
After the Spark job completes, the driver automatically deletes all executor pods.

Run the following command to view the details of the Spark job.

kubectl describe sparkapplication spark-pi

Example output

The specific output varies based on the current status of the job.

Name:         spark-pi
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  sparkoperator.k8s.io/v1beta2
Kind:         SparkApplication
Metadata:
  Creation Timestamp:  2024-06-04T03:16:59Z
  Generation:          1
  Resource Version:    1350200
  UID:                 1a1f9160-5dbb-XXXX-XXXX-be1c1fda4859
Spec:
  Arguments:
    1000
  Driver:
    Core Limit:  1200m
    Cores:       1
    Memory:           512m
    Service Account:  spark
  Executor:
    Core Limit:  1200m
    Cores:       1
    Instances:   1
    Memory:               512m
  Image:                  registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/spark:3.5.4
  Image Pull Policy:      IfNotPresent
  Main Application File:  local:///opt/spark/examples/jars/spark-examples_2.12-3.5.4.jar
  Main Class:             org.apache.spark.examples.SparkPi
  Mode:                   cluster
  Restart Policy:
    Type:         Never
  Spark Version:  3.5.4
  Type:           Scala
Status:
  Application State:
    State:  COMPLETED
  Driver Info:
    Pod Name:             spark-pi-driver
    Web UI Address:       172.XX.XX.92:0
    Web UI Port:          4040
    Web UI Service Name:  spark-pi-ui-svc
  Execution Attempts:     1
  Executor State:
    spark-pi-26c5XXXXX1408337-exec-1:  COMPLETED
  Last Submission Attempt Time:        2024-06-04T03:17:11Z
  Spark Application Id:                spark-0042dead12XXXXXX43675f09552a946
  Submission Attempts:                 1
  Submission ID:                       117ee161-3951-XXXX-XXXX-e7d24626c877
  Termination Time:                    2024-06-04T03:17:55Z
Events:
  Type    Reason                     Age   From            Message
  ----    ------                     ----  ----            -------
  Normal  SparkApplicationAdded      91s   spark-operator  SparkApplication spark-pi was added, enqueuing it for submission
  Normal  SparkApplicationSubmitted  79s   spark-operator  SparkApplication spark-pi was submitted successfully
  Normal  SparkDriverRunning         61s   spark-operator  Driver spark-pi-driver is running
  Normal  SparkExecutorPending       56s   spark-operator  Executor [spark-pi-26c5XXXXX1408337-exec-1] is pending
  Normal  SparkExecutorRunning       53s   spark-operator  Executor [spark-pi-26c5XXXXX1408337-exec-1] is running
  Normal  SparkDriverCompleted       35s   spark-operator  Driver spark-pi-driver completed
  Normal  SparkApplicationCompleted  35s   spark-operator  SparkApplication spark-pi completed
  Normal  SparkExecutorCompleted     35s   spark-operator  Executor [spark-pi-26c5XXXXX1408337-exec-1] completed

Run the following command to view the last 20 lines of logs from the driver pod.

kubectl logs --tail=20 spark-pi-driver

Expected output:

24/05/30 10:05:30 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
24/05/30 10:05:30 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 7.942 s
24/05/30 10:05:30 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
24/05/30 10:05:30 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
24/05/30 10:05:30 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 8.043996 s
Pi is roughly 3.1419522314195225
24/05/30 10:05:30 INFO SparkContext: SparkContext is stopping with exitCode 0.
24/05/30 10:05:30 INFO SparkUI: Stopped Spark web UI at http://spark-pi-1e18858fc8f56b14-driver-svc.default.svc:4040
24/05/30 10:05:30 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
24/05/30 10:05:30 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
24/05/30 10:05:30 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.
24/05/30 10:05:30 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
24/05/30 10:05:30 INFO MemoryStore: MemoryStore cleared
24/05/30 10:05:30 INFO BlockManager: BlockManager stopped
24/05/30 10:05:30 INFO BlockManagerMaster: BlockManagerMaster stopped
24/05/30 10:05:30 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
24/05/30 10:05:30 INFO SparkContext: Successfully stopped SparkContext
24/05/30 10:05:30 INFO ShutdownHookManager: Shutdown hook called
24/05/30 10:05:30 INFO ShutdownHookManager: Deleting directory /var/data/spark-14ed60f1-82cd-4a33-b1b3-9e5d975c5b1e/spark-01120c89-5296-4c83-8a20-0799eef4e0ee
24/05/30 10:05:30 INFO ShutdownHookManager: Deleting directory /tmp/spark-5f98ed73-576a-41be-855d-dabdcf7de189

Step 4: Access the Spark web UI

The web UI is accessible only while the Spark job's driver pod is in the Running state. The UI becomes unavailable after the job completes.

By default, when you deploy the ack-spark-operator component, the controller.uiService.enable parameter is set to true. This automatically creates a Service to expose the web UI, which you can then access using port forwarding. If you set this parameter to false during deployment, no Service is created. In this case, you must forward the port from the driver pod directly.

Important

Using kubectl port-forward is suitable for quick verification in test environments but is not recommended for production due to security risks.

Forward the web UI port to your local machine by using one of the following commands based on your scenario:

Port-forward via Service

kubectl port-forward services/spark-pi-ui-svc 4040

Port-forward via pod

kubectl port-forward pods/spark-pi-driver 4040

Expected output:

Forwarding from 127.0.0.1:4040 -> 4040
Forwarding from [::1]:4040 -> 4040

Open http://127.0.0.1:4040 in your web browser to access the web UI.

(Optional) Step 5: Update the Spark job

If you need to modify the parameters of a Spark job, you can update its manifest.

Edit the spark-pi.yaml manifest. For example, change the value of arguments to 10000 and the number of executor instances to 2.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pi
spec:
  type: Scala
  mode: cluster
  image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/spark:3.5.4
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.4.jar
  arguments:
  - "10000"
  sparkVersion: 3.5.4
  driver:
    cores: 1
    coreLimit: 1200m
    memory: 512m
    serviceAccount: spark-operator-spark # If you customized the ServiceAccount name, change the value accordingly.
  executor:
    instances: 2
    cores: 1
    coreLimit: 1200m
    memory: 512m
  restartPolicy:
    type: Never

Run the following command to apply the changes.
```
kubectl apply -f spark-pi.yaml
```

Run the following command to check the job status.

kubectl get sparkapplication spark-pi

The Spark job runs again. The expected output shows a RUNNING status:

NAME       STATUS    ATTEMPTS   START                  FINISH       AGE
spark-pi   RUNNING   1          2024-06-04T03:37:34Z   <no value>   20m

(Optional) Step 6: Delete the Spark job

When you no longer need the Spark job, delete it to release its associated resources.

Delete the Spark job that you created.

kubectl delete -f spark-pi.yaml

Alternatively, you can run the following command:

kubectl delete sparkapplication spark-pi

References

For information about how to use Spark History Server to view Spark job information, see View Spark job information with Spark History Server.
For information about how to use Log Service to collect Spark job logs, see Collect Spark job logs with Log Service.
For information about how to configure Spark jobs to read and write Object Storage Service (OSS) data, see Read and write OSS data in Spark jobs.
For information about how to use elastic resources to run Spark jobs, see Run Spark jobs with ECI elastic resources.
For information about how to use Celeborn as a remote shuffle service (RSS) in Spark jobs, see Use Celeborn as RSS in Spark jobs.