This topic describes how to use Alibaba Cloud Serverless Kubernetes(ASK) and Elastic Container Instance to build a Spark application.
Background information
Apache Spark is an open source program that is widely used to analyze workloads in scenarios such as big data and machine learning. You can use Kubernetes to run and manage resources on Apache Spark 2.3.0 and later.
Kubernetes Operator for Apache Spark is designed for running Spark jobs in Kubernetes clusters. It allows you to submit Spark tasks that are defined in custom resource definition (CRD) files to Kubernetes clusters. Kubernetes Operator for Apache Spark provides the following benefits:
Compared with open source Apache Spark, Kubernetes Operator for Apache Spark provides more features.
Kubernetes Operator for Apache Spark can be integrated with the storage, monitoring, and logging components in a Kubernetes cluster.
Kubernetes Operator for Apache Spark supports advanced Kubernetes features such as disaster recovery and auto scaling. In addition, Kubernetes Operator for Apache Spark can also optimize resource scheduling.
Preparations
Create an ASK cluster.
Create an ASK cluster in the Container Service for Kubernetes console. For more information, see Create an ASK cluster.
NoteIf you want to pull an image from the Internet or if your training jobs need to access the Internet, you must configure a network address translation (NAT) gateway.
You can use kubectl to manage and access the ASK cluster. Perform the following operations:
If you want to manage the cluster from your on-premises computer, install and configure the kubectl client. For more information, see Connect to ACK clusters by using kubectl.
You can also use kubectl to manage the ASK cluster on Cloud Shell. For more information, see Use kubectl to manage ACK clusters on Cloud Shell.
Create an OSS bucket.
You must create an Object Storage Service (OSS) bucket to store data, including the test data, test results, and test logs. For more information, see Create buckets.
Install Kubernetes Operator for Apache Spark
Install Kubernetes Operator for Apache Spark.
In the left-side navigation pane of the Container Service for Kubernetes console, choose Marketplace> Marketplace.
On the App Catalog tab, search for and click ack-spark-operator.
Click Deploy. Configure the parameters in the panel.
In the Parameters step, Set the
sparkJobNamespace
parameter to the namespace where you want to deploy the Spark job. The default value of this parameter isdefault
. An empty string indicates that the Spark job monitors all namespaces.
Create a ServiceAccount, Role, and RoleBinding.
A Spark job needs a ServiceAccount to obtain the permissions to create pods. Therefore, you must create a ServiceAccount, Role, and RoleBinding. The following YAML example shows how to create a ServiceAccount, Role, and RoleBinding. Replace the namespaces with the actual values.
apiVersion: v1 kind: ServiceAccount metadata: name: spark namespace: default --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: namespace: default name: spark-role rules: - apiGroups: [""] resources: ["pods"] verbs: ["*"] - apiGroups: [""] resources: ["services"] verbs: ["*"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: spark-role-binding namespace: default subjects: - kind: ServiceAccount name: spark namespace: default roleRef: kind: Role name: spark-role apiGroup: rbac.authorization.k8s.io
Build an image of the Spark job
You need to compile the Java Archive (JAR) package of the Spark job and use Dockerfile to package the image.
The following example shows how to configure Dockerfile when a Spark base image of ACK is used.
FROM registry.aliyuncs.com/acs/spark:ack-2.4.5-latest
RUN mkdir -p /opt/spark/jars
# If you want to read data from OSS or sink scheduled events to OSS, add the following JAR packages to the image.
ADD https://repo1.maven.org/maven2/com/aliyun/odps/hadoop-fs-oss/3.3.8-public/hadoop-fs-oss-3.3.8-public.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/com/aliyun/oss/aliyun-sdk-oss/3.8.1/aliyun-sdk-oss-3.8.1.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/org/aspectj/aspectjweaver/1.9.5/aspectjweaver-1.9.5.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/org/jdom/jdom/1.1.3/jdom-1.1.3.jar $SPARK_HOME/jars
COPY SparkExampleScala-assembly-0.1.jar /opt/spark/jars
We recommend that you use a Spark base image provided by Alibaba Cloud. Alibaba Cloud provides the Spark 2.4.5 base image, which is optimized for resource scheduling and auto scaling in Kubernetes clusters and improves the scheduling and startup speeds. You can enable the scheduling optimization feature by setting the enableAlibabaCloudFeatureGates
variable in the Helm chart to true. If you require a faster startup speed, you can set enableWebhook
to false.
Build ImageCache
It takes a long time to pull a large Spark image. You can use ImageCache to accelerate image pulling. For more information, see Manage ImageCache and Use ImageCache to accelerate the creation of pods.
Write a Spark job template and submit a job
Create a YMAL configuration file for a Spark job and deploy the Spark job.
Create a spark-pi.yaml file.
The following code provides an example of a typical Spark job template. For more information, see spark-on-k8s-operator.
apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: spark-pi namespace: default spec: type: Scala mode: cluster image: "registry.aliyuncs.com/acs/spark-pi:ack-2.4.5-latest" imagePullPolicy: Always mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar" sparkVersion: "2.4.5" restartPolicy: type: Never driver: cores: 2 coreLimit: "2" memory: "3g" memoryOverhead: "1g" labels: version: 2.4.5 serviceAccount: spark annotations: k8s.aliyun.com/eci-kube-proxy-enabled: 'true' k8s.aliyun.com/eci-image-cache: "true" tolerations: - key: "virtual-kubelet.io/provider" operator: "Exists" executor: cores: 2 instances: 1 memory: "3g" memoryOverhead: "1g" labels: version: 2.4.5 annotations: k8s.aliyun.com/eci-kube-proxy-enabled: 'true' k8s.aliyun.com/eci-image-cache: "true" tolerations: - key: "virtual-kubelet.io/provider" operator: "Exists"
Deploy a Spark job.
kubectl apply -f spark-pi.yaml
Configure log collection
If you want to collect the standard output logs of a Spark job, you can configure the environment variables in the envVars field of the Spark driver and Spark executor. Then, logs are automatically collected. For more information, see Customize log collection for an elastic container instance.
envVars:
aliyun_logs_test-stdout_project: test-k8s-spark
aliyun_logs_test-stdout_machinegroup: k8s-group-app-spark
aliyun_logs_test-stdout: stdout
Before you submit a Spark job, you can configure the environment variables of the Spark driver and Spark executor as shown in the preceding code to implement automatic log collection.
Configure a history server
A Spark history server allows you to review Spark jobs. You can set the SparkConf field in the CRD of the Spark application to allow the application to sink events to OSS. Then, you can use the history server to retrieve the data from OSS. The following code provides sample configurations:
sparkConf:
"spark.eventLog.enabled": "true"
"spark.eventLog.dir": "oss://bigdatastore/spark-events"
"spark.hadoop.fs.oss.impl": "org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem"
# oss bucket endpoint such as oss-cn-beijing.aliyuncs.com
"spark.hadoop.fs.oss.endpoint": "oss-cn-beijing.aliyuncs.com"
"spark.hadoop.fs.oss.accessKeySecret": ""
"spark.hadoop.fs.oss.accessKeyId": ""
Alibaba Cloud provides a Helm chart for you to deploy Spark history servers. To deploy a Spark history server, log on to the Container Service for Kubernetes console, choose Marketplace > Marketplace in the left-side navigation pane. On the App Catalog tab, search for ack-spark-history-server for deployment. When you deploy the Spark history server, you must configure OSS information in the Parameters section. The following code provides sample configurations:
oss:
enableOSS: true
# Please input your accessKeyId
alibabaCloudAccessKeyId: ""
# Please input your accessKeySecret
alibabaCloudAccessKeySecret: ""
# oss bucket endpoint such as oss-cn-beijing.aliyuncs.com
alibabaCloudOSSEndpoint: "oss-cn-beijing.aliyuncs.com"
# oss file path such as oss://bucket-name/path
eventsDir: "oss://bigdatastore/spark-events"
After you deploy the Spark history server, you can view its external endpoint on the Services page. Then, you can access the external endpoint to view the history of Spark jobs.
View the result of the Spark job
View the execution status of the pod.
kubectl get pods
The following expected output is returned:
NAME READY STATUS RESTARTS AGE spark-pi-1547981232122-driver 1/1 Running 0 12s spark-pi-1547981232122-exec-1 1/1 Running 0 3s
View the real-time Spark user interface.
kubectl port-forward spark-pi-1547981232122-driver 4040:4040
View the status of the Spark application.
kubectl describe sparkapplication spark-pi
The following expected output is returned:
Name: spark-pi Namespace: default Labels: <none> Annotations: kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"sparkoperator.k8s.io/v1alpha1","kind":"SparkApplication","metadata":{"annotations":{},"name":"spark-pi","namespace":"defaul... API Version: sparkoperator.k8s.io/v1alpha1 Kind: SparkApplication Metadata: Creation Timestamp: 2019-01-20T10:47:08Z Generation: 1 Resource Version: 4923532 Self Link: /apis/sparkoperator.k8s.io/v1alpha1/namespaces/default/sparkapplications/spark-pi UID: bbe7445c-1ca0-11e9-9ad4-062fd7c19a7b Spec: Deps: Driver: Core Limit: 200m Cores: 0.1 Labels: Version: 2.4.0 Memory: 512m Service Account: spark Volume Mounts: Mount Path: /tmp Name: test-volume Executor: Cores: 1 Instances: 1 Labels: Version: 2.4.0 Memory: 512m Volume Mounts: Mount Path: /tmp Name: test-volume Image: gcr.io/spark-operator/spark:v2.4.0 Image Pull Policy: Always Main Application File: local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar Main Class: org.apache.spark.examples.SparkPi Mode: cluster Restart Policy: Type: Never Type: Scala Volumes: Host Path: Path: /tmp Type: Directory Name: test-volume Status: Application State: Error Message: State: COMPLETED Driver Info: Pod Name: spark-pi-driver Web UI Port: 31182 Web UI Service Name: spark-pi-ui-svc Execution Attempts: 1 Executor State: Spark - Pi - 1547981232122 - Exec - 1: COMPLETED Last Submission Attempt Time: 2019-01-20T10:47:14Z Spark Application Id: spark-application-1547981285779 Submission Attempts: 1 Termination Time: 2019-01-20T10:48:56Z Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SparkApplicationAdded 55m spark-operator SparkApplication spark-pi was added, Enqueuing it for submission Normal SparkApplicationSubmitted 55m spark-operator SparkApplication spark-pi was submitted successfully Normal SparkDriverPending 55m (x2 over 55m) spark-operator Driver spark-pi-driver is pending Normal SparkExecutorPending 54m (x3 over 54m) spark-operator Executor spark-pi-1547981232122-exec-1 is pending Normal SparkExecutorRunning 53m (x4 over 54m) spark-operator Executor spark-pi-1547981232122-exec-1 is running Normal SparkDriverRunning 53m (x12 over 55m) spark-operator Driver spark-pi-driver is running Normal SparkExecutorCompleted 53m (x2 over 53m) spark-operator Executor spark-pi-1547981232122-exec-1 completed
View the log.
NAME READY STATUS RESTARTS AGE spark-pi-1547981232122-driver 0/1 Completed 0 1m
If the Spark application is in the Succeed state, or the Spark driver pod is in the Completed state, the result of the Spark job is available. You can check the result of the Spark job in the log.
kubectl logs spark-pi-1547981232122-driver Pi is roughly 3.152155760778804