AnalyticDB for MySQL Spark uses configuration parameters that extend or replace those of Apache Spark. This topic covers only the parameters that differ from standard Apache Spark.
Parameter format by development tool
The format for specifying parameters depends on which tool you use to submit Spark jobs.
| Development tool | Format | Example |
|---|---|---|
| SQL editor | set key=value; | set spark.sql.hive.metastore.version=adb; |
| Spark Jar editor | "key": "value" | "spark.sql.hive.metastore.version":"adb" |
| Notebook editor | "key": "value" | "spark.sql.hive.metastore.version":"adb" |
| spark-submit CLI | key=value | spark.sql.hive.metastore.version=adb |
Specify driver and executor resources
| Parameter | Required | Default | Description | Corresponding Apache Spark parameter |
|---|---|---|---|---|
spark.adb.acuPerApp | No | None | The number of AnalyticDB compute units (ACUs) for a single Spark job. Valid values: [2, maximum computing resources of the job resource group]. When set, the system automatically calculates driver specifications, executor specifications, and the number of executor nodes. See the priority rules below. | N/A |
spark.driver.resourceSpec | Yes | medium | The resource specification for the Spark driver. Each type maps to specific CPU and memory allocations. See the resource specifications table below. Example: CONF spark.driver.resourceSpec = c.small; sets the driver to 1 core and 2 GB memory. | spark.driver.cores and spark.driver.memory |
spark.executor.resourceSpec | Yes | medium | The resource specification for each Spark executor. Each type maps to specific CPU and memory allocations. See the resource specifications table below. Example: CONF spark.executor.resourceSpec = c.small; sets each executor to 1 core and 2 GB memory. | spark.executor.cores and spark.executor.memory |
spark.executor.instances | No | Maximum computing resources of the job resource group / 5 | The number of Spark executors to start. | spark.executor.instances |
spark.adb.executor.cpu-vcores-ratio | No | None | The ratio of virtual cores to actual CPU cores for the Executor. The default value is 1. When the CPU utilization of a single task is low, you can use this configuration to improve CPU utilization. If the Executor is Medium specification (2 cores 8 GB) and this parameter is set to 2, the Executor process can perform concurrency control based on 4 cores, which means scheduling 4 concurrent tasks simultaneously, equivalent to spark.executor.cores=4. | N/A |
spark.adb.driver.cpu-vcores-ratio | No | None | The ratio of virtual cores to actual CPU cores for the Driver. The default value is 1. If the Driver is Medium specification (2 cores 8 GB) and this parameter is set to 2, the Driver process can perform concurrency control based on 4 cores, which is equivalent to spark.driver.cores=4. | N/A |
spark.adb.driverDiskSize | No | None | Additional disk storage mounted on the Spark driver, mounted at /user_data_dir. Unit: GiB. Valid values: (0, 100]. Example: spark.adb.driverDiskSize=50Gi. | N/A |
spark.adb.executorDiskSize | No | None | Additional disk storage mounted on each Spark executor, mounted at /shuffle_volume for shuffle operations. Unit: GiB. Valid values: (0, 100]. Example: spark.adb.executorDiskSize=50Gi. | N/A |
spark.adb.acuPerApp priority rules
When spark.adb.acuPerApp is combined with other resource parameters, the following rules apply:
If
spark.adb.acuPerAppand all other resource parameters (spark.driver.resourceSpec,spark.executor.resourceSpec,spark.executor.instances) are all set,spark.adb.acuPerAppis invalid and the explicitly set values take effect.If only
spark.adb.acuPerAppis set, it is valid and all other resource parameters are auto-calculated.In any other combination,
spark.adb.acuPerAppis valid and auto-calculates only the resource parameters that are not explicitly set.
Spark resource specifications
The following ACU calculations apply when using on-demand elastic resources in a job resource group:
1:2 CPU-to-memory ratio: ACUs = CPU cores × 0.8
1:4 CPU-to-memory ratio: ACUs = CPU cores × 1
1:8 CPU-to-memory ratio: ACUs = CPU cores × 1.5 For pricing details, see Pricing for Data Lakehouse Edition.
| Type | CPU cores | Memory (GB) | Disk storage (GB) | Used ACUs |
|---|---|---|---|---|
| c.small | 1 | 2 | 20 | 0.8 |
| small | 1 | 4 | 20 | 1 |
| m.small | 1 | 8 | 20 | 1.5 |
| c.medium | 2 | 4 | 20 | 1.6 |
| medium | 2 | 8 | 20 | 2 |
| m.medium | 2 | 16 | 20 | 3 |
| c.large | 4 | 8 | 20 | 3.2 |
| large | 4 | 16 | 20 | 4 |
| m.large | 4 | 32 | 20 | 6 |
| c.xlarge | 8 | 16 | 20 | 6.4 |
| xlarge | 8 | 32 | 20 | 8 |
| m.xlarge | 8 | 64 | 20 | 12 |
| c.2xlarge | 16 | 32 | 20 | 12.8 |
| 2xlarge | 16 | 64 | 20 | 16 |
| m.2xlarge | 16 | 128 | 20 | 24 |
| m.4xlarge | 32 | 256 | 20 | 48 |
| m.8xlarge | 64 | 512 | 20 | 96 |
The system reserves approximately 1% of disk storage. The actual available disk space may be less than 20 GB.
Example
The following configuration allocates 32 executors with medium specification (2 cores, 8 GB each) and a driver with small specification (1 core, 4 GB), totaling 65 ACUs.
{
"spark.driver.resourceSpec": "small",
"spark.executor.resourceSpec": "medium",
"spark.executor.instances": "32",
"spark.adb.executorDiskSize": "100Gi"
}Set job priority
| Parameter | Required | Default | Description |
|---|---|---|---|
spark.adb.priority | No | NORMAL | The priority of a Spark job. When resources are insufficient, higher-priority jobs in the queue run first. Valid values: HIGH, NORMAL, LOW, LOWEST. |
For long-running streaming Spark jobs, set this parameter to HIGH.
Access metadata
| Parameter | Required | Default | Description |
|---|---|---|---|
spark.sql.catalogImplementation | No | hive (Spark SQL jobs); in-memory (non-Spark SQL jobs) | The metadata source. hive: uses the built-in Hive Metastore of Apache Spark. in-memory: uses the temporary directory. |
spark.sql.hive.metastore.version | No | adb (Spark SQL jobs); <hive_version> (non-Spark SQL jobs) | The metastore version. adb: connects to AnalyticDB for MySQL metadata. <hive_version>: specifies a Hive Metastore version. For supported Hive versions and self-managed Hive Metastore configuration, see Spark Configuration. |
Examples
Access AnalyticDB for MySQL metadata:
spark.sql.hive.metastore.version=adb;Access the built-in Hive Metastore of Apache Spark:
spark.sql.catalogImplementation=hive;
spark.sql.hive.metastore.version=2.1.3;Access metadata in the temporary directory:
spark.sql.catalogImplementation=in-memory;Configure the Spark UI
All the following parameters are optional.
| Parameter | Default | Description |
|---|---|---|
spark.app.log.rootPath | oss://<aliyun-oa-adb-spark-{Account ID}-oss-{Zone ID}>/<Cluster ID>/<Spark app ID> | The OSS directory for Spark job logs and Linux OS output. The folder named after the Spark application ID contains: the event log file (Spark app ID-000X) for Spark UI rendering, driver and numbered node log folders, and stdout/stderr folders for OS output. |
spark.adb.event.logUploadDuration | false | Specifies whether to record the duration of each event log upload. |
spark.adb.buffer.maxNumEvents | 1000 | Maximum number of events cached by the driver. |
spark.adb.payload.maxNumEvents | 10000 | Maximum number of events uploaded to Object Storage Service (OSS) per batch. |
spark.adb.event.pollingIntervalSecs | 0.5 | Interval between event uploads to OSS, in seconds. |
spark.adb.event.maxPollingIntervalSecs | 60 | Maximum retry interval after a failed upload to OSS, in seconds. The retry interval stays within the range of spark.adb.event.pollingIntervalSecs to spark.adb.event.maxPollingIntervalSecs. |
spark.adb.event.maxWaitOnEndSecs | 10 | Maximum wait time for an upload to complete, in seconds. If the upload does not complete within this time, it is retried. |
spark.adb.event.waitForPendingPayloadsSleepIntervalSecs | 1 | Wait time before retrying an upload that exceeded spark.adb.event.maxWaitOnEndSecs, in seconds. |
spark.adb.eventLog.rolling.maxFileSize | 209715200 | Maximum size of each event log file in OSS, in bytes. Event logs are split into multiple files (for example, Eventlog.0, Eventlog.1). |
Grant permissions to RAM users
| Parameter | Required | Default | Description |
|---|---|---|---|
spark.adb.roleArn | No | N/A | The Alibaba Cloud Resource Name (ARN) of the Resource Access Management (RAM) role to attach to the RAM user, granting permission to submit Spark applications. Required only when submitting Spark applications as a RAM user. Not required when submitting with an Alibaba Cloud account or when permissions are already granted in the RAM console. For more information, see RAM role overview and Account authorization. |
Enable built-in data source connectors
| Parameter | Required | Default | Description |
|---|---|---|---|
spark.adb.connectors | No | N/A | The built-in AnalyticDB for MySQL Spark connectors to enable. Separate multiple values with commas. Valid values: oss, hudi, delta, adb, odps, external_hive, jindo, default. |
spark.hadoop.io.compression.codec.snappy.native | No | false | Specifies whether to treat Snappy files as standard Snappy format. false: uses the Hadoop Snappy library. true: uses the standard Snappy library for decompression. |
Enable VPC and data source access
| Parameter | Required | Default | Description |
|---|---|---|---|
spark.adb.eni.enabled | No | false | Specifies whether to enable Elastic Network Interface (ENI). Set to true when using external tables to access external data sources. |
spark.adb.eni.vswitchId | No | N/A | The vSwitch ID associated with the ENI. Required when connecting to AnalyticDB for MySQL from an Elastic Compute Service (ECS) instance over a virtual private cloud (VPC). Requires spark.adb.eni.enabled=true. |
spark.adb.eni.securityGroupId | No | N/A | The security group ID associated with the ENI. Required when connecting to AnalyticDB for MySQL from an ECS instance over a VPC. Requires spark.adb.eni.enabled=true. |
spark.adb.eni.extraHosts | No | N/A | IP-to-hostname mappings that allow Spark to resolve data source hostnames. Required for accessing a self-managed Hive data source. Format: ip0 master0,ip1 master1. Requires spark.adb.eni.enabled=true. |
spark.adb.eni.adbHostAlias.enabled | No | false | Specifies whether to automatically write AnalyticDB for MySQL domain name resolution entries to the hostname-to-IP mapping table. Set to true when reading from or writing to E-MapReduce (EMR) Hive via ENI. |
Configure application retries
| Parameter | Required | Default | Description |
|---|---|---|---|
spark.adb.maxAttempts | No | 1 | Maximum number of attempts to run an application. The default value of 1 means no retries. For example, setting this to 3 allows the system to attempt the application up to three times within the sliding window. |
spark.adb.attemptFailuresValidityInterval | No | Integer.MAX | Duration of the sliding window for retry counting, in seconds. For example, setting this to 6000 causes the system to count failed attempts within the last 6,000 seconds. If the count is below spark.adb.maxAttempts, the system retries. |
Specify a Python runtime environment
Use spark.pyspark.python with virtual environment packaging to submit PySpark jobs.
| Parameter | Required | Default | Description |
|---|---|---|---|
spark.pyspark.python | No | N/A | Path to the Python interpreter on the local device. |
Specify the Spark version
| Parameter | Required | Default | Description |
|---|---|---|---|
spark.adb.version | No | 3.2 | The Spark version. Valid values: 2.4, 3.2, 3.3, 3.5, 4.0. |
Enable the vectorized execution engine
| Parameter | Required | Default | Description |
|---|---|---|---|
spark.adb.native.enabled | No | false | Specifies whether to enable the high-performance vectorized execution engine built into AnalyticDB for MySQL Spark. The engine is fully compatible with open source Spark and requires no code changes. |
Enable lake storage acceleration
| Parameter | Required | Default | Description |
|---|---|---|---|
spark.adb.lakecache.enabled | No | false | Specifies whether to enable LakeCache for lake storage acceleration. |
Spark SQL read and write C-Store data
When you read and write C-Store tables only through Spark SQL, the following configuration parameters are supported:
Parameter | Required | Default value | Description |
spark.adb.write.batchSize | No | 600 | The number of records to write in a single batch. Valid values: positive integers greater than 0. |
spark.adb.write.arrow.maxMemoryBufferSize | No | 1024 | The maximum memory buffer size for writing. Valid values: positive integers greater than 0. Unit: MB. |
spark.adb.write.arrow.maxRecordSizePerBatch | No | 500 | The maximum number of records to write in a single batch. Valid values: positive integers greater than 0. |
spark.adb.createSnapshot | No | false | Specifies whether to create a snapshot after data is written using the INSERT OVERWRITE statement. Valid values:
|
spark.adb.readDataVersion | No | LATEST_BUILD | The version of data to read. Valid values:
|
Unsupported parameters
The following Apache Spark parameters are not supported by AnalyticDB for MySQL Spark and are ignored if specified. AnalyticDB for MySQL manages these settings automatically in its hosted environment. Where an alternative exists, it is noted.
--deploy-mode
--master
--packages # Use --jars instead
--exclude-packages
--proxy-user
--repositories
--keytab
--principal
--queue
--total-executor-cores
--driver-library-path
--driver-class-path
--supervise
-S, --silent
-i <filename>