Apache Spark is a high-speed computing engine for large-scale data processing, widely used in data analytics and machine learning. Spark Operator automates the deployment and lifecycle management of Spark jobs on Kubernetes. This topic guides you through using Spark Operator to run and manage Spark jobs on an ACK cluster to efficiently handle big data workloads.
Prerequisites
-
An ACK Pro cluster or ACK Serverless Pro cluster running Kubernetes 1.24 or later. For more information, see Create an ACK managed cluster, Create an ACK Serverless cluster, and Manually upgrade ACK clusters.
A kubectl client is connected to the ACK cluster. For more information, see Connect to an ACK cluster using kubectl.
How it works
Spark Operator automates the lifecycle of Spark jobs on Kubernetes. It uses CustomResourceDefinition (CRD) resources such as SparkApplication and ScheduledSparkApplication to declaratively manage Spark jobs. Spark Operator leverages native Kubernetes features like auto-scaling, health checks, and resource management to efficiently run and monitor your workloads. ACK provides the ack-spark-operator component, which is based on the community's kubeflow/spark-operator. For more information, see Spark Operator | Kubeflow.
Benefits:
-
Simplified management: Automate the deployment and lifecycle management of Spark jobs by using declarative configurations in Kubernetes.
-
Multi-tenancy support: Use Kubernetes namespaces and resource quotas for user-level resource isolation and allocation. Use node selection to ensure that Spark workloads run on dedicated resources.
-
Elastic resource provisioning: Leverage elastic resources such as Elastic Container Instance (ECI) or elastic node pools to quickly obtain large amounts of elastic resources during business peaks, balancing performance and cost.
Use cases:
-
Data analytics: Data scientists can use Spark for interactive data analytics and data cleansing.
-
Batch computing: Run scheduled batch jobs to process large-scale datasets.
-
Real-time processing: The Spark Streaming library provides the capability for real-time data stream processing.
Procedure overview
This topic describes how to use Spark Operator to run and manage Spark jobs on an ACK cluster.
-
Deploy the ack-spark-operator component: Install Spark Operator in your ACK cluster to enable the management and execution of Spark jobs.
-
Submit a Spark job: Create and submit a Spark job manifest to run a specific data processing task.
-
Monitor the Spark job: Monitor the running status of the job and obtain detailed execution information and logs.
-
Access the Spark web UI: Access the web UI for a more intuitive view of the Spark job's execution.
-
Update the Spark job: Adjust the job configuration based on your requirements and apply the updates.
-
Delete the Spark job: Clean up completed or unnecessary Spark jobs to avoid unexpected costs.
Step 1: Deploy the ack-spark-operator component
Log on to the ACK console. In the left navigation pane, click .
-
On the Marketplace page, click the App Catalog tab, then search for and select ack-spark-operator.
-
On the ack-spark-operator page, click Deploy.
-
In the Create panel, select a cluster and namespace, and then click Next.
-
On the Parameters page, configure the parameters and then click OK.
The following table describes key configuration parameters. For a complete list, see the ConfigMaps tab on the ack-spark-operator page.
Parameter
Description
Default
controller.replicasThe number of controller replicas.
1
webhook.replicasThe number of webhook replicas.
1
spark.jobNamespacesThe list of namespaces where Spark jobs can run. An empty string "" allows all namespaces. Multiple namespaces are separated by a comma (
,).-
["default"](default) -
[""](all namespaces) -
["ns1","ns2","ns3"](multiple namespaces)
spark.serviceAccount.nameSpark Operator automatically creates a ServiceAccount named
spark-operator-sparkand the required RBAC resources in each namespace specified byspark.jobNamespaces. If you customize this name, you must specify the new name when submitting Spark jobs.spark-operator-spark -
Step 2: Submit a Spark job
Create a SparkApplication manifest to submit a Spark job for data processing.
-
Create the following SparkApplication manifest and save it as
spark-pi.yaml.apiVersion: sparkoperator.k8s.io/v1beta2 kind: SparkApplication metadata: name: spark-pi namespace: default # The namespace must be in the list of namespaces specified by spark.jobNamespaces. spec: type: Scala mode: cluster image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/spark:3.5.4 imagePullPolicy: IfNotPresent mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.4.jar arguments: - "1000" sparkVersion: 3.5.4 driver: cores: 1 coreLimit: 1200m memory: 512m serviceAccount: spark-operator-spark # If you customized the ServiceAccount name, change the value accordingly. executor: instances: 1 cores: 1 coreLimit: 1200m memory: 512m restartPolicy: type: Never -
Run the following command to submit the Spark job.
kubectl apply -f spark-pi.yamlExpected output:
sparkapplication.sparkoperator.k8s.io/spark-pi created
Step 3: Monitor the Spark job
Run the following commands to check the status, associated pods, and logs of the Spark job.
-
Run the following command to check the status of the Spark job.
kubectl get sparkapplication spark-piExpected output:
NAME STATUS ATTEMPTS START FINISH AGE spark-pi SUBMITTED 1 2024-06-04T03:17:11Z <no value> 15s -
Run the following command to check the status of the pods for the Spark job. The command filters pods by the label
sparkoperator.k8s.io/app-namewith the valuespark-pi.kubectl get pod -l sparkoperator.k8s.io/app-name=spark-piExpected output:
NAME READY STATUS RESTARTS AGE spark-pi-driver 1/1 Running 0 49s spark-pi-7272428fc8f5f392-exec-1 1/1 Running 0 13sAfter the Spark job completes, the driver automatically deletes all executor pods.
-
Run the following command to view the details of the Spark job.
kubectl describe sparkapplication spark-pi -
Run the following command to view the last 20 lines of logs from the driver pod.
kubectl logs --tail=20 spark-pi-driverExpected output:
24/05/30 10:05:30 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 24/05/30 10:05:30 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 7.942 s 24/05/30 10:05:30 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job 24/05/30 10:05:30 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished 24/05/30 10:05:30 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 8.043996 s Pi is roughly 3.1419522314195225 24/05/30 10:05:30 INFO SparkContext: SparkContext is stopping with exitCode 0. 24/05/30 10:05:30 INFO SparkUI: Stopped Spark web UI at http://spark-pi-1e18858fc8f56b14-driver-svc.default.svc:4040 24/05/30 10:05:30 INFO KubernetesClusterSchedulerBackend: Shutting down all executors 24/05/30 10:05:30 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down 24/05/30 10:05:30 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed. 24/05/30 10:05:30 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 24/05/30 10:05:30 INFO MemoryStore: MemoryStore cleared 24/05/30 10:05:30 INFO BlockManager: BlockManager stopped 24/05/30 10:05:30 INFO BlockManagerMaster: BlockManagerMaster stopped 24/05/30 10:05:30 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 24/05/30 10:05:30 INFO SparkContext: Successfully stopped SparkContext 24/05/30 10:05:30 INFO ShutdownHookManager: Shutdown hook called 24/05/30 10:05:30 INFO ShutdownHookManager: Deleting directory /var/data/spark-14ed60f1-82cd-4a33-b1b3-9e5d975c5b1e/spark-01120c89-5296-4c83-8a20-0799eef4e0ee 24/05/30 10:05:30 INFO ShutdownHookManager: Deleting directory /tmp/spark-5f98ed73-576a-41be-855d-dabdcf7de189
Step 4: Access the Spark web UI
The web UI is accessible only while the Spark job's driver pod is in the Running state. The UI becomes unavailable after the job completes.
By default, when you deploy the ack-spark-operator component, the controller.uiService.enable parameter is set to true. This automatically creates a Service to expose the web UI, which you can then access using port forwarding. If you set this parameter to false during deployment, no Service is created. In this case, you must forward the port from the driver pod directly.
Using kubectl port-forward is suitable for quick verification in test environments but is not recommended for production due to security risks.
-
Forward the web UI port to your local machine by using one of the following commands based on your scenario:
-
Port-forward via Service
kubectl port-forward services/spark-pi-ui-svc 4040 -
Port-forward via pod
kubectl port-forward pods/spark-pi-driver 4040Expected output:
Forwarding from 127.0.0.1:4040 -> 4040 Forwarding from [::1]:4040 -> 4040
-
-
Open http://127.0.0.1:4040 in your web browser to access the web UI.
(Optional) Step 5: Update the Spark job
If you need to modify the parameters of a Spark job, you can update its manifest.
-
Edit the
spark-pi.yamlmanifest. For example, change the value ofargumentsto10000and the number ofexecutorinstances to2.apiVersion: sparkoperator.k8s.io/v1beta2 kind: SparkApplication metadata: name: spark-pi spec: type: Scala mode: cluster image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/spark:3.5.4 imagePullPolicy: IfNotPresent mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.4.jar arguments: - "10000" sparkVersion: 3.5.4 driver: cores: 1 coreLimit: 1200m memory: 512m serviceAccount: spark-operator-spark # If you customized the ServiceAccount name, change the value accordingly. executor: instances: 2 cores: 1 coreLimit: 1200m memory: 512m restartPolicy: type: Never -
Run the following command to apply the changes.
kubectl apply -f spark-pi.yaml -
Run the following command to check the job status.
kubectl get sparkapplication spark-piThe Spark job runs again. The expected output shows a RUNNING status:
NAME STATUS ATTEMPTS START FINISH AGE spark-pi RUNNING 1 2024-06-04T03:37:34Z <no value> 20m
(Optional) Step 6: Delete the Spark job
When you no longer need the Spark job, delete it to release its associated resources.
Delete the Spark job that you created.
kubectl delete -f spark-pi.yaml
Alternatively, you can run the following command:
kubectl delete sparkapplication spark-pi
References
-
For information about how to use Spark History Server to view Spark job information, see View Spark job information with Spark History Server.
-
For information about how to use Log Service to collect Spark job logs, see Collect Spark job logs with Log Service.
-
For information about how to configure Spark jobs to read and write Object Storage Service (OSS) data, see Read and write OSS data in Spark jobs.
-
For information about how to use elastic resources to run Spark jobs, see Run Spark jobs with ECI elastic resources.
-
For information about how to use Celeborn as a remote shuffle service (RSS) in Spark jobs, see Use Celeborn as RSS in Spark jobs.