How to configure Spark configuration parameters - AnalyticDB for MySQL

The configuration parameters of AnalyticDB for MySQL Spark are similar to those of Apache Spark. This topic describes the configuration parameters of AnalyticDB for MySQL Spark that are different from those of Apache Spark.

Usage notes

Spark application configuration parameters are used to configure and adjust the behavior and performance of Spark applications. The format of Spark application configuration parameters varies based on Spark development tools.

Development tool	Configuration parameter format	Example
SQL editor	set key=value;	`set spark.sql.hive.metastore.version=adb;`
Spark JAR editor	"key": "value"	`"spark.sql.hive.metastore.version":"adb"`
Notebook editor	"key": "value"	`"spark.sql.hive.metastore.version":"adb"`
spark-submit	key=value	`spark.sql.hive.metastore.version=adb`

Specify the Spark driver and executor resources

Parameter	Required	Description	Corresponding parameter in Apache Spark
spark.driver.resourceSpec	Yes	The resource specifications of the Spark driver. Each type corresponds to distinct specifications. For more information, see the Type column in the "Spark resource specifications" table of this topic. Important If you submit Spark applications, you can use Apache Spark parameters and configure the parameters based on the values of cores and memory that are described in the "Spark resource specifications" table of this topic. Example: `CONF spark.driver.resourceSpec = c.small;`. In this example, the Spark driver provides 1 core and 2 GB memory.	spark.driver.cores and spark.driver.memory
spark.executor.resourceSpec	Yes	The resource specifications of each Spark executor. Each type corresponds to distinct specifications. For more information, see the Type column in the "Spark resource specifications" table of this topic. Important If you submit Spark applications, you can use Apache Spark parameters and configure the parameters based on the values of cores and memory that are described in the "Spark resource specifications" table of this topic. Example: `CONF spark.executor.resourceSpec = c.small;`. In this example, each Spark executor provides 1 core and 2 GB memory.	spark.executor.cores and spark.executor.memory
spark.adb.driverDiskSize	No	The size of additional disk storage that is mounted on the Spark driver to meet large disk storage requirements. By default, the additional disk storage is mounted on the /user_data_dir directory. Unit: GiB. Valid values: (0,100]. Example: spark.adb.driverDiskSize=50Gi. In this example, the additional disk storage that is mounted on the Spark driver is set to 50 GiB.	N/A
spark.adb.executorDiskSize	No	The size of additional disk storage that is mounted on a Spark executor to meet the requirements of shuffle operations. By default, the additional disk storage is mounted on the /shuffle_volume directory. Unit: GiB. Valid values: (0,100]. Example: spark.adb.executorDiskSize=50Gi. In this example, the additional disk storage that is mounted on a Spark executor is set to 50 GiB.	N/A

Spark resource specifications

Important

You can use reserved resources or elastic resources to execute Spark jobs. If you use the on-demand elastic resources of a job resource group to execute Spark jobs, the system calculates the number of used AnalyticDB compute units (ACUs) based on the Spark resource specifications and the CPU-to-memory ratio by using the following formulas:

1:2 CPU-to-memory ratio: Number of used ACUs = Number of CPU cores × 0.8.
1:4 CPU-to-memory ratio: Number of used ACUs = Number of CPU cores × 1.
1:8 CPU-to-memory ratio: Number of used ACUs = Number of CPU cores × 1.25.

For information about the prices of on-demand elastic resources, see Pricing for Data Lakehouse Edition (V3.0).

Spark resource specifications
Type	Specifications			Used ACUs
Type	CPU cores	Memory (GB)	Disk storage¹ (GB)	Used ACUs
c.small	1	2	20	0.8
small	1	4	20	1
m.small	1	8	20	1.5
c.medium	2	4	20	1.6
medium	2	8	20	2
m.medium	2	16	20	3
c.large	4	8	20	3.2
large	4	16	20	4
m.large	4	32	20	6
c.xlarge	8	16	20	6.4
xlarge	8	32	20	8
m.xlarge	8	64	20	12
c.2xlarge	16	32	20	12.8
2xlarge	16	64	20	16
m.2xlarge	16	128	20	24

Note

¹Disk storage: The system is expected to occupy approximately 1% of the disk storage. The actual available disk storage may be less than 20 GB.

Specify priorities for Spark jobs

Parameter

Required

Default value

Description

spark.adb.priority

NORMAL

The priority of a Spark job. If resources are insufficient to execute all Spark jobs that are submitted, the queued jobs that have higher priorities are first executed. Valid values:

HIGH
NORMAL
LOW
LOWEST

Important

We recommend that you set this parameter to HIGH for all streaming Spark jobs.

Access the metadata

Parameter

Required

Default value

Description

spark.sql.catalogImplementation

Spark SQL jobs: hive
Non-Spark SQL jobs: in-memory

The type of the metadata to be accessed. Valid values:

hive: the metadata in the built-in Hive Metastore of Apache Spark.
in-memory: the metadata in the temporary directory.

spark.sql.hive.metastore.version

Spark SQL jobs: adb
Non-Spark SQL jobs: <hive_version>

The version of the metadata to be accessed. Valid values:

adb: the version of AnalyticDB for MySQL.
<hive_version>: the version of the Hive Metastore.

Note

For information about the Hive versions that are supported by Apache Spark, see Spark Configuration.
To access a self-managed Hive Metastore, you can replace the default configuration with the standard Apache Spark configuration. For more information, see Spark Configuration.

Examples

Configure the following setting to access the metadata in AnalyticDB for MySQL:
```
spark.sql.hive.metastore.version=adb;
```
Configure the following settings to access the metadata in the built-in Hive Metastore of Apache Spark:
```
spark.sql.catalogImplementation=hive;
spark.sql.hive.metastore.version=2.1.3;
```
Configure the following setting to access the metadata in the temporary directory:
```
spark.sql.catalogImplementation=in-memory;
```

Configure the Spark UI

Parameter	Required	Default value	Description
spark.app.log.rootPath	No	`oss://<aliyun-oa-adb-spark-Alibaba Cloud account ID-oss-Zone ID>/<Cluster ID>/<Spark application ID>`	The path for storing AnalyticDB for MySQL Spark job logs and output data of the Linux OS. By default, the folder named Spark application ID contains the following content: The file named `Spark application ID-000X` that stores the Spark event logs used for Spark UI rendering. The folders named `driver` and numbers that store the logs of the corresponding nodes. The folders named `stdout` and `stderr` that store the output data of the Linux OS.
spark.adb.event.logUploadDuration	No	false	Specifies whether to record the duration of an event log upload.
spark.adb.buffer.maxNumEvents	No	1000	The maximum number of events that are cached by the driver.
spark.adb.payload.maxNumEvents	No	10000	The maximum number of events that can be uploaded to Object Storage Service (OSS) at a time.
spark.adb.event.pollingIntervalSecs	No	0.5	The interval between two uploads of events to OSS. Unit: seconds. For example, a value of 0.5 indicates that events are uploaded every 0.5 seconds.
spark.adb.event.maxPollingIntervalSecs	No	60	The maximum retry interval when an event upload to OSS fails. Unit: seconds. The interval between a failed upload and an upload retry must be within the range of the `spark.adb.event.pollingIntervalSecs` to `spark.adb.event.maxPollingIntervalSecs values`.
spark.adb.event.maxWaitOnEndSecs	No	10	The maximum wait time for uploading events to OSS. Unit: seconds. The maximum wait time is the interval between the start and completion of an upload. If an upload is not complete within the maximum wait time, the upload is retried.
spark.adb.event.waitForPendingPayloadsSleepIntervalSecs	No	1	The required wait time for retrying an upload that fails to be complete within the `spark.adb.event.maxWaitOnEndSecs` value. Unit: seconds.
spark.adb.eventLog.rolling.maxFileSize	No	209715200	The maximum file size of event logs in OSS. Unit: bytes. Event logs are stored in OSS in the form of multiple files, such as Eventlog.0 and Eventlog.1. You can specify the file size.

Grant permissions to RAM users

Parameter	Required	Default value	Description
spark.adb.roleArn	No	N/A	The Alibaba Cloud Resource Name (ARN) of the Resource Access Management (RAM) role that you want to attach to the RAM user in the RAM console to grant the RAM user the permissions to submit Spark applications. For more information, see RAM role overview. If you submit Spark applications as a RAM user, you must specify this parameter. If you submit Spark applications with an Alibaba Cloud account, you do not need to specify this parameter. Note If you have granted permissions to a RAM user in the RAM console, you do not need to specify this parameter. For more information, see Perform authorization.

Enable the built-in data source connectors

Parameter	Required	Default value	Description
spark.adb.connectors	No	N/A	The names of the built-in connectors of AnalyticDB for MySQL Spark that you want to enable. Separate multiple names with commas (,). Valid values: oss, hudi, delta, adb, odps, external_hive, and jindo.
spark.hadoop.io.compression.codec.snappy.native	No	false	Specifies whether a Snappy file is in the standard Snappy format. By default, Hadoop recognizes the Snappy files that are edited in Hadoop. If you set this parameter to true, the standard Snappy library is used for decompression. If you set this parameter to false, the default Snappy library of Hadoop is used for decompression.

Enable VPC access and data source access

Parameter	Required	Default value	Description
spark.adb.eni.enabled	No	false	Specifies whether to enable Elastic Network Interface (ENI). If you use external tables to access other external data sources, you must enable ENI. Valid values: true false
spark.adb.eni.vswitchId	No	N/A	The ID of the vSwitch that is associated with an ENI. If you connect to AnalyticDB for MySQL from an Elastic Compute Service (ECS) instance over a virtual private cloud (VPC), you must specify a vSwitch ID for the VPC. Note If you have enabled VPC access, you must set the spark.adb.eni.enabled parameter to true.
spark.adb.eni.securityGroupId	No	N/A	The ID of the security group that is associated with an ENI. If you connect to AnalyticDB for MySQL from an ECS instance over a VPC, you must specify a security group ID. Note If you have enabled VPC access, you must set the spark.adb.eni.enabled parameter to true.
spark.adb.eni.extraHosts	No	N/A	The mappings between IP addresses and hostnames. This parameter allows Spark to resolve the hostnames of data sources. If you want to access a self-managed Hive data source, you must specify this parameter. Note Separate IP addresses and hostnames with spaces. Separate multiple groups of IP addresses and hostnames with commas (,). Example: `ip0 master0,ip1 master1`. If you have enabled data source access, you must set the spark.adb.eni.enabled parameter to true.
spark.adb.eni.adbHostAlias.enabled	No	false	Specifies whether to automatically write the domain name resolution information that AnalyticDB for MySQL requires to a mapping table of domain names and IP addresses. Valid values: true false When you use an ENI to read data from or write data to EMR Hive, you must set this parameter to true.

Configure application retries

Parameter

Required

Default value

Description

spark.adb.maxAttempts

The maximum number of attempts that are allowed to run an application. The default value is 1, which specifies that no retry attempts are allowed.

If you set this parameter to 3 for a Spark application, the system attempts to run the application up to three times within a sliding window.

spark.adb.attemptFailuresValidityInterval

Integer.MAX

The duration of the sliding window within which the system attempts to rerun an application. Unit: seconds.

For example, if you set this parameter to 6000 for a Spark application, the system counts the number of attempts within the last 6,000 seconds after a failed run. If the number of attempts is less than the value of the spark.adb.maxAttempts parameter, the system retries to run the application.

Specify a runtime environment for Spark jobs

The following table describes the configuration parameters that are required when you use the virtual environments technology to package a Python environment and submit Spark jobs.

Parameter	Required	Default value	Description
spark.pyspark.python	No	N/A	The path of the Python interpreter on your on-premises device.

Configuration parameters that are not supported by AnalyticDB for MySQL

AnalyticDB for MySQL Spark does not support the following configuration parameters of Apache Spark. These configuration parameters do not take effect for AnalyticDB for MySQL Spark.

Useless options(these options will be ignored):
  --deploy-mode
  --master
  --packages, please use `--jars` instead
  --exclude-packages
  --proxy-user
  --repositories
  --keytab
  --principal
  --queue
  --total-executor-cores
  --driver-library-path
  --driver-class-path
  --supervise
  -S,--silent
  -i <filename>