How to use spark-submit to develop Spark jobs - AnalyticDB for MySQL

This topic describes how to use AnalyticDB for MySQL spark-submit to develop Spark applications. In the example, an Elastic Compute Service (ECS) instance is connected to AnalyticDB for MySQL.

Background information

When you connect to an AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster from a client and develop Spark applications, you must use AnalyticDB for MySQL spark-submit to submit Spark applications.

AnalyticDB for MySQL spark-submit can be used to submit Spark batch applications, not Spark SQL applications.

Prerequisites

Java Development Kit (JDK) V1.8 or later is installed.

Download and install AnalyticDB for MySQL spark-submit

Run the following command to download the installation package of AnalyticDB for MySQL spark-submit. The file name is adb-spark-toolkit-submit-0.0.1.tar.gz.
```
wget https://dla003.oss-cn-hangzhou.aliyuncs.com/adb-spark-toolkit-submit-0.0.1.tar.gz
```
Run the following command to extract the package and install AnalyticDB for MySQL spark-submit.
```
tar zxvf adb-spark-toolkit-submit-0.0.1.tar.gz
```

Configure parameters

You can modify the configuration parameters of an AnalyticDB for MySQL Spark batch application in the conf/spark-defaults.conf file or by calling commands. We recommend that you modify configuration parameters in the conf/spark-defaults.conf file. After the modification, AnalyticDB for MySQL spark-submit automatically reads the configuration file. If you use a CLI to modify the configuration parameters, the parameters in the configuration file are not overwritten, but the configurations in the command take precedence.

After you install AnalyticDB for MySQL spark-submit, run the following command to open the adb-spark-toolkit-submit/conf/spark-defaults.conf file:
```
vim adb-spark-toolkit-submit/conf/spark-defaults.conf
```

Configure parameters in the key=value format. Example:

keyId = yourAkId
secretId = yourAkSec
regionId = cn-hangzhou
clusterId = amv-bp15f9q95p****
rgName = sg-default
ossUploadPath = oss://<bucket_name>/sparkjars/
spark.driver.resourceSpec = medium
spark.executor.instances = 2
spark.executor.resourceSpec = medium
spark.adb.roleArn = arn:1234567/adbsparkprocessrole
spark.adb.eni.vswitchId = vsw-defaultswitch
spark.adb.eni.securityGroupId = sg-defaultgroup
spark.app.log.rootPath = oss://<bucket_name>/sparklogs/

Important

You must replace the sample values with the actual values.
In the example, the keyId, secretId, regionId, clusterId, rgName, ossKeyId, and ossUploadPath parameters are supported only by AnalyticDB for MySQL spark-submit, not by Apache Spark. For more information about the parameters, see the "Parameters" section of this topic.
You can configure the parameters in the command for submitting Spark jobs in the --key1 value1 --key2 value2 format. You can also use AnalyticDB for MySQL SparkConf to configure the parameters. For more information, see the "Configuration compatibility" section of this topic.
For information about the other parameters in the sample code or about all configuration parameters that are supported by AnalyticDB for MySQL spark-submit, see Spark application configuration parameters.

Table 1. Parameters
Parameter	Description	Required
keyId	The AccessKey ID of the Alibaba Cloud account or Resource Access Management (RAM) user that is used to run Spark jobs. For information about how to view the AccessKey ID, see Accounts and permissions.	Yes
secretId	The AccessKey secret of the Alibaba Cloud account or RAM user that is used to run Spark jobs.	Yes
regionId	The ID of the region where the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster resides.	Yes
clusterId	The ID of the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster.	Yes
rgName	The resource group of the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster.	Yes
ossKeyId	The AccessKey ID of the Alibaba Cloud account or RAM user that is used to create the Object Storage Service (OSS) bucket. You can specify an OSS directory in the configuration, so that JAR packages in your on-premises storage can be automatically uploaded to the OSS directory.	No
ossSecretId	The AccessKey secret of the Alibaba Cloud account or RAM user that is used to create the OSS bucket.	No
ossEndpoint	The internal endpoint of the OSS bucket. For information about the mappings between OSS regions and endpoints, see Regions and endpoints.	No
ossUploadPath	The OSS directory to which JAR packages are uploaded.	No

Configuration compatibility

To ensure compatibility with open source spark-submit, the keyId, secretId, regionId, clusterId, rgName, ossKeyId, and ossUploadPath parameters can also be configured by using AnalyticDB for MySQL SparkConf in the following format:

--conf spark.adb.access.key.id=<value>    
--conf spark.adb.access.secret.id=<value>
--conf spark.adb.regionId=<value>
--conf spark.adb.clusterId=<value>
--conf spark.adb.rgName=<value>
--conf spark.adb.oss.akId=<value>
--conf spark.adb.oss.akSec=<value>
--conf spark.adb.oss.endpoint=<value>
--conf spark.adb.oss.uploadPath=<value>

Submit a Spark job

Run the following command to open the directory of AnalyticDB for MySQL spark-submit:
```
cd adb-spark-toolkit-submit
```

Submit a job in the following format:

./bin/spark-submit  \
--class com.aliyun.spark.oss.SparkReadOss \
--verbose \
--name <your_job_name> \
--jars oss://<bucket-name>/jars/xxx.jar,oss://<bucket-name>/jars/xxx.jar\
--conf spark.driver.resourceSpec=medium \
--conf spark.executor.instances=1 \
--conf spark.executor.resourceSpec=medium \
oss://<bucket-name>/jars/xxx.jar args0 args1

Note

One of the following response code is returned after you submit the Spark job:

255: The job failed.
0: The job is successfully executed.
143: The job was terminated.

The following table describes the parameters.

Parameter	Example	Description
--class	<class_name>	The entry point of the Java or Scala application. The entry point is not required for a Python application.
--verbose	None	Displays the logs that are generated during the submission of the Spark job.
--name	<spark_name>	The name of the Spark application.
--jars	<jar_name>	The absolute paths of the JAR packages that are required for the Spark application. Separate multiple JAR packages with commas (,). You can specify the OSS directory to which JAR packages are uploaded in the `--oss-upload-path` parameter of the command or in the ossUploadPath parameter of the `conf/spark-defaults.conf` file. When an on-premises file is being uploaded, the system verifies the file based on the MD5 value. If a file that uses the same name and MD5 value already exists in the specified OSS directory, the upload is canceled. If you have manually updated a JAR package in the OSS directory, you need to delete the corresponding MD5 file of the package. Note Make sure that you have the permissions to access the OSS directory. You can log on to the RAM console and grant the AliyunOSSFullAccess permission to the RAM user on the Users page.
--conf	<key=value>	The configuration parameters of the Spark application. The configuration of AnalyticDB for MySQL spark-submit is highly similar to open source spark-submit. For information about the parameters that are different and the parameters that are specific in AnalyticDB for MySQL spark-submit, see the "Parameters specific in AnalyticDB for MySQL spark-submit" section of this topic and Spark application configuration parameters. Note Specify multiple parameters in the following format: --conf key1=value1 --conf key2=value2. The configurations in the command take precedence over those in the configuration file.
oss://<bucket-name>/jars/xxx.jar	<oss_path>	The absolute path of the main file of the Spark application. The main file can be a JAR package that contains the entry point or an executable file that serves as the entry point for the Python program. Note You must store the main files of Spark applications in OSS.
args	<args0 args1>	The parameters that are required for the JAR packages. Separate multiple parameters with spaces.

Query a list of Spark jobs

./bin/spark-submit --list --clusterId <cluster_Id>  --rgName <ResourceGroup_name> --pagenumber 1 --pagesize 3

Query the status of a Spark job

./bin/spark-submit --status <appId>

You can obtain the appId of a job from the list of Spark jobs. For more information, see the "Query a list of Spark jobs" section of this topic.

For more information about status of Spark jobs, see SparkAppInfo.

Query the parameters of the submitted job and SparkUI

./bin/spark-submit --detail <appId>

You can obtain the appId of a job from the list of Spark jobs. For more information, see the "Query a list of Spark jobs" section of this topic.

The Spark WEB UI field in the returned results indicates the Spark UI address.

Queries the logs of a Spark job

./bin/spark-submit --get-log <appId>

You can obtain the appId of a job from the list of Spark jobs. For more information, see the "Query a list of Spark jobs" section of this topic.

Terminate a Spark job

./spark-submit --kill <appId>

You can obtain the appId of a job from the list of Spark jobs. For more information, see the "Query a list of Spark jobs" section of this topic.

Differences between AnalyticDB for MySQL spark-submit and open source spark-submit

Parameters specific in AnalyticDB for MySQL spark-submit

Table 2. Parameters specific in AnalyticDB for MySQL spark-submit
Parameter	Description
--api-retry-times	The maximum number of retries allowed when AnalyticDB for MySQL spark-submit fails to run a command. Commands for submitting Spark jobs are not retried. This is because job submission is not an idempotent operation. Submissions that fail due to reasons like network timeout may have actually succeeded in the background. Therefore, retrying job submission commands may result in duplicated submissions. To check whether a job has been successfully submitted, you need to use `--list` to obtain a list of jobs that are submitted. You can also go to the AnalyticDB for MySQL Spark console and check the information in the job list.
--time-out-seconds	The timeout period in AnalyticDB for MySQL spark-submit after which a command is retried. Unit: seconds. Default value: 10.
--enable-inner-endpoint	When you submit a Spark job from an ECS instance, you can specify this parameter, so that AnalyticDB for MySQL spark-submit can access services within the virtual private cloud (VPC). Important Only the following regions support service access within VPCs: China (Hangzhou), China (Shanghai), and China (Shenzhen).
--list	Obtains a list of jobs. In most cases, this parameter is used together with `--pagenumber` and `--pagesize`. For example, if you want to view five jobs on the first page, you can configure the parameters as follows: `--list --pagenumber 1 --pagesize 5`
--pagenumber	The page number. Default value: 1.
--pagesize	The maximum number of jobs to return on each page. Default value: 10.
--kill	Terminates the job.
--get-log	The logs of the application.
--status	The details about the application.

Parameters specific in open source spark-submit

AnalyticDB for MySQL spark-submit does not support the configuration parameters of open source spark-submit. For more information, see the "Configuration parameters that are not supported for AnalyticDB for MySQL" section of the Spark application configuration parameters topic.