General configurations - MaxCompute - Alibaba Cloud Documentation Center

This topic describes the general parameter configurations for Spark clients across different versions.

MaxCompute account parameter configurations

Parameter	Description
`spark.hadoop.odps.project.name`	The MaxCompute project name. If you submit jobs through DataWorks, use the default value. No configuration is required.
`spark.hadoop.odps.access.id`	The AccessKey ID that has access permissions to the target MaxCompute project. You can obtain the AccessKey ID on the AccessKey Management page. If you submit jobs through DataWorks, use the default value. No configuration is required.
`spark.hadoop.odps.access.key`	The AccessKey secret corresponding to the AccessKey ID. If you submit jobs through DataWorks, use the default value. No configuration is required.
`spark.hadoop.odps.access.security.token`	The STS token for the MaxCompute project. If you submit jobs through DataWorks, use the default value. No configuration is required.
`spark.hadoop.odps.end.point`	This is either the public endpoint or the VPC endpoint dedicated to the region where your MaxCompute project resides. The choice between a public endpoint and a VPC endpoint depends on the network environment of your Spark client. When you submit jobs using DataWorks, you can use the default value—no configuration is required. For example, the VPC endpoint for the China (Hangzhou) region is `https://service.cn-hangzhou-vpc.maxcompute.aliyun-inc.com/api`.
`spark.hadoop.odps.runtime.end.point`	The Cloud Product Interconnection Endpoint for the region where MaxCompute resides. For example, the service interconnection endpoint for the China (Hangzhou) region is `https://service.cn-hangzhou-intranet.maxcompute.aliyun-inc.com/api`.

MaxCompute Spark job submission, version, and log configurations

Parameter	Description
`spark.hadoop.odps.kube.mode`	Default value: `false`. Specifies whether to submit jobs in Kubernetes (k8s) mode. MaxCompute Spark has fully upgraded to kube mode. The legacy cupid mode will be retired. New users must enable this configuration.
`spark.hadoop.odps.cupid.data.proxy.enable`	Default value: `false`. Specifies whether to use MaxStorage for data read and write operations. If disabled, kube mode may not work. Enable this option when using kube mode.
`spark.hadoop.odps.cupid.fuxi.shuffle.enable`	Default value: `false`. Specifies whether to use the Fuxi Shuffle Service during shuffle to prevent local disk overflow. Enable this option for large-scale jobs or when disk space is full.
`spark.hadoop.odps.spark.version`	The Spark version used to submit Spark jobs. For supported versions, see the Spark release list. Example: `spark-3.1.1-odps0.35.0`. When submitting jobs through the Spark client, also set `spark.hadoop.odps.spark.libs.public.enable` to `true`.
`spark.hadoop.odps.spark.libs.public.enable`	Default value: `false`. When set to `true`, Spark libraries are pulled from the server instead of uploaded, speeding up startup. This setting takes effect only when `spark.hadoop.odps.spark.version` is also configured.
`spark.eventLog.enabled`	Default value: `false`. Enables event logging to view Spark UI history. Enable this option to troubleshoot issues. In kube mode, also configure `spark.eventLog.dir`.
`spark.eventLog.dir`	Default value: `/tmp/spark-events/` or `/workdir/eventlog`, depending on the version. Specifies where event logs are stored. An incorrect value prevents event log uploads and history viewing. In kube mode, manually set this to `/workdir/eventlog/`.

Spark resource allocation configurations

Parameter	Description
`spark.executor.instances`	Default value: 1. The total number of Executor processes launched by the Spark application in the cluster.
`spark.executor.cores`	Default value: 1. The number of CPU cores available to each Executor process.
`spark.executor.memory`	Default value: 2 g. The total memory per Executor process, including heap and off-heap memory.
`spark.driver.cores`	Default value: 1. The number of CPU cores used by the Driver process.
`spark.driver.memory`	Default value: 2 g. The total memory for the Driver process.
`spark.executor.memoryOverhead`	Default value: see community documentation. Increase this value if off-heap memory usage is high to avoid being killed due to exceeding memory limits. The total memory per Executor is `spark.executor.memory` + `spark.executor.memoryOverhead`.
`spark.driver.memoryOverhead`	Default value: see community documentation. Increase this value if off-heap memory usage is high to avoid being killed due to exceeding memory limits. The total memory for the Driver is `spark.driver.memory` + `spark.driver.memoryOverhead`.
`spark.hadoop.odps.cupid.disk.driver.device_size`	Default value: 20 g. The size of the local disk. Increase this value if you encounter `No space left on device`. Maximum supported size is 100 g. You must configure this in the `spark-defaults.conf` file or in DataWorks configuration items. Do not configure it in code.

MaxCompute read and write configurations

Important

The following configurations that start with spark.sql.catalog.odps apply only to Spark 3.x versions.

Parameter	Description
`spark.sql.catalog.odps.tableReadProvider`	Default value: `v1`. Set to tunnel when using Local mode.
`spark.sql.catalog.odps.tableWriteProvider`	Default value: `v1`. Set to tunnel when using Local mode.
`spark.sql.catalog.odps.metaCacheSize`	Default value: `100`. The maximum number of metadata cache entries, including project, schema, and table metadata. This accelerates read and write operations. Do not change this value.
`spark.sql.catalog.odps.metaCacheExpireSeconds`	Default value: `30`. Unit: seconds. The time-to-live (TTL) for metadata caching. This accelerates read and write operations. Do not increase this value unless tables are frequently accessed and rarely changed. Otherwise, dirty data may be read.
`spark.sql.catalog.odps.viewCacheExpireSeconds`	Default value: `3600`. Unit: seconds. The TTL for view metadata caching. Do not modify this value.
`spark.sql.catalog.odps.enableVectorizedReader`	Default value: `true`. Enables vectorized reads. Do not modify this value.
`spark.sql.catalog.odps.enableVectorizedWriter`	Default value: `true`. Enables vectorized writes. Do not modify this value.
`spark.sql.catalog.odps.columnarReaderBatchSize`	Default value: `4096`. The number of rows per batch when reading data.
`spark.sql.catalog.odps.columnarWriterBatchSize`	Default value: `4096`. The number of rows per batch when writing data.
`spark.sql.catalog.odps.splitParallelism`	Default value: `-1`. Specifies the parallelism for splitting underlying storage. Takes effect only when greater than zero. Do not set this option casually. By default, `splitByRowOffset` or `splitByByteSize` is used, which calculates more suitable split sizes.
`spark.sql.catalog.odps.splitSizeInMB`	Default value: `256`. Unit: MB. The size of each split. Decrease this value to increase read concurrency. Increase it to reduce concurrency.
`spark.sql.catalog.odps.enableExternalProject`	Default value: `false`. Enables support for external projects.
`spark.sql.catalog.odps.enableExternalTable`	Default value: `false`. Enables support for external tables.
`spark.sql.catalog.odps.tableCompressionCodec`	Default value: `none`. The compression algorithm for tables. Default is no compression. Supported algorithms are `lz4_frame` and `zstd`.
`spark.sql.catalog.odps.enableNamespaceSchema`	Default value: `false`. Does MaxCompute support a schema-level syntax switch?
`spark.sql.catalog.odps.defaultSchema`	Default value: `default`. The default schema name. Do not change this value.
`spark.sql.catalog.odps.writerChunkSize`	Default value: `4194304`. The chunk size for writes, in bytes. Default is 4 MB.
`spark.sql.catalog.odps.writerMaxRetires`	Default value: `10`. The number of retries after a write failure.
`spark.sql.catalog.odps.writerRetrySleepIntervalMs`	Default value: `10000`. The interval between retries after a write failure, in milliseconds.
`spark.sql.catalog.odps.writerBlocks`	Default value: `20000`. The maximum number of blocks for writes.
`spark.sql.catalog.odps.splitSessionParallelismEnable`	Default value: `true`. Enables parallel data reads at the partition level.
`spark.sql.catalog.odps.splitSessionParallelism`	Default value: `1`. The number of parallel threads for reading data at the partition level. This parameter rarely becomes a performance bottleneck. Adjust it only if the number of partitions after pruning remains very high.
`spark.sql.catalog.odps.splitMaxFileNum`	Default value: `0`. The maximum number of files per split. A value of 0 means no limit. If too many small files cause long split read times, increase this value to create more splits and improve read speed with more readers.
`spark.sql.catalog.odps.splitMaxWaitTime`	Default value: `15`. Unit: minutes. The maximum wait time for a split. Increase this value if split processing takes too long.
`spark.sql.catalog.odps.enableFilterPushDown`	Default value: `false`. Enables predicate pushdown at the Spark layer.
`spark.sql.catalog.odps.enableDeltaInsertDeduplicate`	Default value: `false`. Enables deduplication within partitions. Deduplication occurs by default for transactional tables with overwrite writes.
`spark.sql.catalog.odps.maxFieldSizeInMB`	Default value: `-1`. Unit: MB. The maximum size for VARCHAR/CHAR/STRING/BINARY fields. A value of -1 means 8 MB. The effective upper limit is the project-level parameter `odps.sql.cfile2.field.maxsize`.
`spark.sql.catalog.odps.splitReaderNum`	This parameter applies only to Spark versions 3.4 and later. It has no effect on earlier versions. Default value: `1`. The number of concurrent readers. Increase this value to speed up reads when CPU is not a bottleneck. Do not exceed 4. Use the concurrency level just before CPU becomes a bottleneck as the upper limit. Higher values may degrade performance. This setting takes effect regardless of whether `asnycRead` or `batchReused` is enabled.
`spark.sql.catalog.odps.enableBatchReused`	This parameter applies only to Spark versions 3.4 and later. It has no effect on earlier versions. Default value: `true`. Enables batch reuse to reduce memory usage during reads. If set to `false`, enable asynchronous reads to trade memory for speed. Monitor memory usage carefully. If `asyncRead` is enabled and `readNum` is 1, this setting is ignored and batches are not reused to emulate `bufferRead`.
`spark.sql.catalog.odps.enableAsyncRead`	This parameter applies only to Spark versions 3.4 and later. It has no effect on earlier versions. Default value: `false`. Enables asynchronous reads to trade memory for speed. Monitor memory usage carefully. Use this with `readerNum` and `batchReused` to maximize read efficiency. For details, see the `asyncQueueSize` configuration.
`spark.sql.catalog.odps.asyncQueueSize`	This parameter applies only to Spark versions 3.4 and later. It has no effect on earlier versions. Default value: `8`. The buffer size for asynchronous reads. Monitor memory usage carefully. If asynchronous reads are enabled and `batchReused=false`, this value equals the actual queue size. Ensure `queueSize >= readerNum`, or peak efficiency may not be achieved. If you enable asynchronous reading and set `batchReused=true`, the actual queue size is `readerNum`. In this case, if you want to `reuseBatch`, do not set `readNum=1`.
`spark.sql.catalog.odps.enhanceWriteCheck`	This parameter applies only to Spark versions 3.4 and later. It has no effect on earlier versions. Default value: `false`. Enables correctness checks during writes.
`spark.sql.catalog.odps.dynamicPartitionLimit`	This parameter applies only to Spark versions 3.4 and later. It has no effect on earlier versions. Default value: `512`. The maximum number of dynamic partitions allowed during writes.
`spark.sql.catalog.odps.streamingWriteLimit`	This parameter applies only to Spark versions 3.4 and later. It has no effect on earlier versions. Default value: `60`. Unit: seconds. The minimum interval between creating `writeSession` objects during streaming writes. Default is 60 seconds. Minimum is 30 seconds.

MaxCompute data interoperability configurations

`spark.hadoop.odps.cupid.resources`

You must configure this parameter in the spark-defaults.conf file or as a DataWorks configuration item. Do not configure this parameter in your code.

Description:
Specifies the MaxCompute resources required for a job to run. The format is <projectname>.<resourcename>. To specify multiple resources, separate them with commas.
The specified resources are downloaded to the current working directory (/workdir) of the Driver and Executors. After the download is complete, the default filename is <projectname>.<resourcename>. Compressed resources are automatically extracted. The name of the top-level directory matches the name of the original archive. For example, if a resource is named examples.tar.gz and is not renamed, its contents are extracted to the /workdir/examples.tar.gz/sub/... path. If you rename the resource to examples, its contents are extracted to the /workdir/examples/sub/... path. The exact path depends on the name of the archive and its internal directory structure.
Example: spark.hadoop.odps.cupid.resources = public.python-python-2.7-ucs4.zip,public.myjar.jar.
Rename resources: To rename a resource during configuration, use the format <projectname>.<resourcename>:<newresourcename>.
Example of renaming: spark.hadoop.odps.cupid.resources = public.myjar.jar:myjar.jar.

Other MaxCompute configurations

Parameter	Description
`spark.hadoop.odps.cupid.eni.enable` & `spark.hadoop.odps.cupid.eni.info`	Configure VPC settings. For details, see Accessing Alibaba Cloud VPC.
`spark.hadoop.odps.cupid.trusted.services.access.list`	No default value. If your Spark cluster cannot access Alibaba Cloud service interconnection sites over the network, configure this parameter. See Accessing Alibaba Cloud OSS.
`spark.hadoop.odps.cupid.smartnat.enable`	Default value: `false`. Enable this to access the Internet.
`spark.hadoop.odps.cupid.internet.access.list`	Default value: `None` After enabling Internet access, configure a whitelist to allow access. See Accessing the Internet.
`spark.hadoop.odps.spark.alinux3.enabled`	Default value: `false`. In cluster mode, enables the alinux3 base image and Python 3.11.
`spark.hadoop.odps.native.engine.enable`	Default value: `false`. In cluster mode, uses the Native Engine (Gluten) to accelerate computation. The Native Engine uses the alinux3 base image by default.
`spark.hadoop.odps.spark.metrics.enable`	Default value: `false`. Enables metric collection inside Spark. Metrics are more accurate when enabled.