MaxCompute:Spark FAQ - MaxCompute

This page answers common questions about developing and running Spark jobs on MaxCompute, organized by topic.

Develop with Spark

How do I perform a self-check on my project?

Check the following items before submitting a job:

pom.xml dependencies — The scope of all spark-xxxx_${scala.binary.version} dependencies must be provided:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_${scala.binary.version}</artifactId>
    <version>${spark.version}</version>
    <scope>provided</scope>
</dependency>

`spark.master` in the main class — Do not hard-code local[N] when submitting in yarn-cluster mode:

val spark = SparkSession
      .builder()
      .appName("SparkPi")
      // Remove .config("spark.master", "local[4]") before submitting in yarn-cluster mode.
      .getOrCreate()

Scala main class definition — The entry point must be an object, not a class:

object SparkPi {  // Must be an object; a class cannot load the main function.
  def main(args: Array[String]) {
    val spark = SparkSession
      .builder()
      .appName("SparkPi")
      .getOrCreate()

Hard-coded configurations — Configurations set directly in code may not take effect in yarn-cluster mode. Add all configuration items to spark-defaults.conf instead:

// Avoid hard-coding MaxCompute configurations like this in cluster mode.
val spark = SparkSession
      .builder()
      .appName("SparkPi")
      .config("key1", "value1")
      .config("key2", "value2")
      .getOrCreate()

When submitting jobs in yarn-cluster mode, add all configuration items to spark-defaults.conf.

What are the steps to run an ODPS Spark node on DataWorks?

Write and package your Spark code locally. The Python environment must be Python 2.7.
Upload the resource package to DataWorks. See Create and use MaxCompute resources.
Create an ODPS Spark node on DataWorks. See Create an ODPS Spark node.
Write the node code and run it. View the results in the DataWorks console.

How do I debug Spark on MaxCompute locally?

Use IntelliJ IDEA to debug locally. See Set up a Linux development environment.

How do I use Spark to access services in a VPC?

See Access VPC-connected instances from Spark.

How do I reference a JAR package as a resource?

Use the spark.hadoop.odps.cupid.resources parameter. Resources can be shared across multiple projects — set the appropriate permissions to ensure data security.

## Add to spark-defaults.conf or DataWorks configuration items.
spark.hadoop.odps.cupid.resources = projectname.xx0.jar,projectname.xx1.jar

How do I pass parameters using Spark?

See the Spark on DataWorks guide on GitHub.

How do I stream DataHub data into MaxCompute using Spark?

For sample code, see the DataHub streaming example on GitHub.

How do I migrate open source Spark code to Spark on MaxCompute?

Choose based on your job's data access requirements:

No MaxCompute tables or OSS access needed — Run your existing JAR package directly. Set Spark and Hadoop dependencies to provided. See Set up a Linux development environment.
Needs access to MaxCompute tables — Add the required dependencies and repackage. See Set up a Linux development environment.
Needs access to OSS — Get the required OSS packages, establish a network connection, then recompile and repackage. See Set up a Linux development environment.

How do I use Spark to process MaxCompute table data?

Spark on MaxCompute supports three running modes: Local, Cluster, and DataWorks. The configurations differ by mode. See Running modes.

How do I set the resource parallelism for Spark?

Parallelism is determined by the number of executors multiplied by the number of CPU cores per executor:

Maximum parallel tasks = spark.executor.instances x spark.executor.cores

Parameter	Description
`spark.executor.instances`	Number of executors the job requests
`spark.executor.cores`	CPU cores per executor process. Each core runs one task at a time. Set to a value between `2` and `4`.

How do I resolve out-of-memory (OOM) issues?

OOM errors appear as:

java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: GC overhead limit exceeded
Cannot allocate memory
The job has been killed by "OOM Killer", please check your job's memory usage

Tune the following parameters to resolve OOM:

Parameter	Description	Guidance
`spark.executor.memory`	Heap memory per executor	Maintain a `1:4` ratio with `spark.executor.cores`. For example, if `spark.executor.cores=1`, set `spark.executor.memory=4g`. Increase if executors throw `OutOfMemoryError`.
`spark.executor.memoryOverhead`	Off-heap memory per executor (JVM overhead, strings, NIO buffers)	Default: `spark.executor.memory x 0.1`, minimum 384 MB. Increase if you see `Cannot allocate memory` or OOM Killer errors in executor logs.
`spark.driver.memory`	Heap memory for the driver	Maintain a `1:4` ratio with `spark.driver.cores`. Increase if the driver collects large amounts of data or throws `OutOfMemoryError`.
`spark.driver.memoryOverhead`	Off-heap memory for the driver	Default: `spark.driver.memory x 0.1`, minimum 384 MB. Increase if you see `Cannot allocate memory` in driver logs.

How do I resolve insufficient disk space issues?

The No space left on device error means local disk space is exhausted — usually on an executor, causing it to exit.

Two ways to fix this:

Increase the disk size — Set spark.hadoop.odps.cupid.disk.driver.device_size (default: 20 GB, maximum: 100 GB). This parameter applies to the driver and each executor, and only takes effect in spark-defaults.conf or DataWorks configuration items.
Increase the number of executors — If the error persists after increasing the disk size to 100 GB, a single executor's shuffle data has exceeded the limit. This may be caused by data skew. In this case, you need to repartition the data. The error may also occur because the data volume is too large. In this case, increase spark.executor.instances to distribute the load.

How do I reference resources in a MaxCompute project?

Spark on MaxCompute supports two methods:

Method 1: Reference by parameter (recommended for large resources)

Set spark.hadoop.odps.cupid.resources in spark-defaults.conf or DataWorks configuration items. Format: <projectname>.<resourcename>[:<newresourcename>]. Separate multiple resources with commas.

Resources are downloaded to the working directory of the driver and each executor. The default filename is <projectname>.<resourcename>. To rename a resource, use <projectname>.<resourcename>:<newresourcename>.

## Add to DataWorks configuration items or spark-defaults.conf.

## Reference multiple resources.
spark.hadoop.odps.cupid.resources=public.python-python-2.7-ucs4.zip,public.myjar.jar

## Rename a resource on download.
spark.hadoop.odps.cupid.resources=public.myjar.jar:myjar.jar

See Resource operations for more information.

Method 2: Reference from DataWorks

Add resources from MaxCompute to a business flow on the Data Development pane in DataWorks. See Manage MaxCompute resources. Then select JAR, file, and archive resources in the ODPS Spark node.

Method 2 uploads resources at task runtime. For large resources, use Method 1.

How do I access a VPC?

Spark on MaxCompute connects to a VPC using an elastic network interface (ENI). ENI connections are limited to VPCs in the same region. To access multiple VPCs, connect the target VPCs to the ENI-connected VPC.

To set up ENI-based VPC access:

Enable the ENI-based leased line connection. See Access VPC-connected instances from Spark.
In the target service, add a whitelist rule to allow access from the MaxCompute security group created in step 1. For example, to access ApsaraDB RDS, add a rule in RDS that allows the security group from step 1. If the service only accepts IP addresses (not security groups), add the vSwitch CIDR block used in step 1.
Configure the ENI parameters for the job in spark-defaults.conf or DataWorks configuration items. Replace [regionid] and [vpcid] with your actual region ID and VPC ID:
```
spark.hadoop.odps.cupid.eni.enable = true
spark.hadoop.odps.cupid.eni.info = [regionid]:[vpcid]
```

How do I access the Internet?

Two methods are available:

Method 1: ENI connection

Enable the ENI-based leased line connection. See Access VPC-connected instances from Spark.
Make sure the connected VPC has Internet access. See Use the SNAT feature of an Internet NAT gateway to access the Internet.

Configure the job. Replace [region] and [vpcid] with your actual region ID and VPC ID:

## Add to DataWorks configuration items or spark-defaults.conf.
spark.hadoop.odps.cupid.internet.access.list=aliyundoc.com:443
spark.hadoop.odps.cupid.eni.enable=true
spark.hadoop.odps.cupid.eni.info=[region]:[vpcid]

Method 2: SmartNAT

SmartNAT is not supported for Spark 3.4 and later.

For example, to access https://aliyundoc.com:443:

Join the MaxCompute developer community (DingTalk group ID: 11782920) and ask the support team to add https://aliyundoc.com:443 to odps.security.outbound.internetlist.

Enable SmartNAT in spark-defaults.conf or DataWorks configuration items:

spark.hadoop.odps.cupid.internet.access.list=aliyundoc.com:443
spark.hadoop.odps.cupid.smartnat.enable=true

How do I access OSS?

Spark on MaxCompute accesses Alibaba Cloud Object Storage Service (OSS) through the Jindo SDK.

Step 1: Configure the Jindo SDK and OSS endpoint

Add the following to spark-defaults.conf or DataWorks configuration items:

## Reference the Jindo SDK JAR package.
spark.hadoop.odps.cupid.resources=public.jindofs-sdk-3.7.2.jar

## Set the OSS implementation class.
spark.hadoop.fs.AbstractFileSystem.oss.impl=com.aliyun.emr.fs.oss.OSS
spark.hadoop.fs.oss.impl=com.aliyun.emr.fs.oss.JindoOssFileSystem

## Set the OSS endpoint (internal endpoint only).
spark.hadoop.fs.oss.endpoint=oss-[YourRegionId]-internal.aliyuncs.com

## Optional: add to the trusted services whitelist if the connection fails at runtime.
spark.hadoop.odps.cupid.trusted.services.access.list=[YourBucketName].oss-[YourRegionId]-internal.aliyuncs.com

In cluster mode, only internal OSS endpoints are supported. Public endpoints are not supported. See OSS regions and endpoints.

Step 2: Configure authentication

Two authentication methods are supported:

AccessKey pair — Set the credentials directly in SparkConf:

val conf = new SparkConf()
  .setAppName("jindo-sdk-demo")
  .set("spark.hadoop.fs.oss.accessKeyId", "<YourAccessKeyId>")
  .set("spark.hadoop.fs.oss.accessKeySecret", "<YourAccessKeySecret>")

Security Token Service (STS) token — Use this method when the MaxCompute project and OSS bucket belong to the same Alibaba Cloud account.
1. Click One-click Authorization to authorize MaxCompute to access OSS resources using an STS token.
2. Enable the local HTTP service in spark-defaults.conf or DataWorks configuration items:
```
spark.hadoop.odps.cupid.http.server.enable = true
```
3. Configure the STS credentials in SparkConf. Replace ${aliyun-uid} with the UID of the Alibaba Cloud account and ${role-name} with the role name:
```
val conf = new SparkConf()
  .setAppName("jindo-sdk-demo")
  .set("spark.hadoop.fs.jfs.cache.oss.credentials.provider", "com.aliyun.emr.fs.auth.CustomCredentialsProvider")
  .set("spark.hadoop.aliyun.oss.provider.url", "http://localhost:10011/sts-token-info?user_id=${aliyun-uid}&role=${role-name}")
```

How do I reference a third-party Python library?

If a PySpark job throws No module named 'xxx', the required library is not in the default Python environment. Three methods are available:

Method 1: Use the MaxCompute public Python environment

Add the following to spark-defaults.conf or DataWorks configuration items:

Python 2.7:

spark.hadoop.odps.cupid.resources = public.python-2.7.13-ucs4.tar.gz
spark.pyspark.python = ./public.python-2.7.13-ucs4.tar.gz/python-2.7.13-ucs4/bin/python

Third-party libraries included: https://odps-repo.oss-cn-hangzhou.aliyuncs.com/pyspark/py27/py27-default_req.txt

Python 3.7:

spark.hadoop.odps.cupid.resources = public.python-3.7.9-ucs4.tar.gz
spark.pyspark.python = ./public.python-3.7.9-ucs4.tar.gz/python-3.7.9-ucs4/bin/python3

Third-party libraries included: https://odps-repo.oss-cn-hangzhou.aliyuncs.com/pyspark/py37/py37-default_req.txt

Method 2: Upload a single wheel package

Use this method when you have few and simple Python dependencies.

## Rename the wheel package to a .zip file (for example, pymysql.zip).
## Upload the .zip file as an archive resource.
## Reference it on the ODPS Spark node (archive type).
## Add the following to spark-defaults.conf or DataWorks configuration items:
spark.executorEnv.PYTHONPATH=pymysql
spark.yarn.appMasterEnv.PYTHONPATH=pymysql

Then import the package in your code:

import pymysql

Method 3: Upload a complete custom Python environment

Use this method for complex dependencies or when you need a custom Python version. Package the Python environment in a Docker container and upload it. See Package dependencies.

How do I resolve JAR dependency conflicts?

A NoClassDefFoundError or NoSuchMethodError at runtime usually means a third-party dependency version in your JAR conflicts with Spark's bundled dependencies.

In pom.xml, set all Spark community edition, Hadoop community edition, and ODPS/Cupid dependencies to provided.
Identify and exclude the conflicting dependency.
If excluding isn't enough, use maven-shade-plugin relocation to isolate the conflicting package.

How do I debug in local mode?

The configuration method depends on your Spark version.

Spark 2.3.0

Add the following to spark-defaults.conf:

spark.hadoop.odps.project.name=<Yourprojectname>
spark.hadoop.odps.access.id=<YourAccessKeyID>
spark.hadoop.odps.access.key=<YourAccessKeySecret>
spark.hadoop.odps.end.point=<endpoint>

Run in local mode:

./bin/spark-submit --master local spark_sql.py

Spark 2.4.5 / Spark 3.1.1

Create odps.conf with the following content:

odps.access.id=<YourAccessKeyID>
odps.access.key=<YourAccessKeySecret>
odps.end.point=<endpoint>
odps.project.name=<Yourprojectname>

Set the environment variable to point to the file:
```
export ODPS_CONF_FILE=/path/to/odps.conf
```

Run in local mode:

./bin/spark-submit --master local spark_sql.py

Common local mode errors

Error	Cause	Solution
`Incomplete config, no accessId or accessKey` / `Incomplete config, no odps.service.endpoint`	EventLog is enabled in local mode	Remove `spark.eventLog.enabled=true` from `spark-defaults.conf`
`Cannot create CupidSession with empty CupidConf`	Spark 2.4.5 or 3.1.1 cannot read `odps.access.id` from `spark-defaults.conf`	Create `odps.conf`, set the `ODPS_CONF_FILE` environment variable, and run again
`java.util.NoSuchElementException: odps.access.id`	Spark 2.3.0 cannot find the access ID	Add `spark.hadoop.odps.access.id` and related parameters to `spark-defaults.conf`

Job errors

What do I do if "User signature does not match" occurs?

com.aliyun.odps.OdpsException: ODPS-0410042:
Invalid signature value - User signature does not match

The AccessKey ID or AccessKey Secret in spark-defaults.conf is incorrect. Check them against the AccessKey ID and AccessKey Secret in User Information Management in the Alibaba Cloud console and correct any mismatches.

What do I do if "You have NO privilege" occurs?

com.aliyun.odps.OdpsException: ODPS-0420095:
Access Denied - Authorization Failed [4019], You have NO privilege 'odps:CreateResource' on {acs:odps:*:projects/*}

Your account lacks the required permission. Ask the project owner to grant you Read and Create permissions on the resource. See MaxCompute permissions.

What do I do if "Access Denied" occurs?

com.aliyun.odps.OdpsException: ODPS-0420095: Access Denied - The task is not in release range: CUPID

Diagnose the cause:

Is the AccessKey ID or AccessKey Secret in `spark-defaults.conf` correct? — Verify them against User Information Management in the Alibaba Cloud console. See Set up a Linux development environment for the correct configuration format.
Is Spark on MaxCompute available in your region? — The service may not be enabled in your region. Check region availability or contact support by joining DingTalk group 21969532 (Spark on MaxCompute support).

What do I do if "No space left on device" occurs?

Spark uses local disk for shuffle data and BlockManager overflow. The disk size is controlled by spark.hadoop.odps.cupid.disk.driver.device_size (default: 20 GB, maximum: 100 GB).

If the error persists after you increase the disk size to 100 GB, a common cause is data skew, where data is concentrated in a few blocks during the shuffle or cache process. Decrease spark.executor.cores and increase spark.executor.instances to distribute the data more evenly.

What do I do if "Table or view not found" occurs?

Table or view not found: xxx

Diagnose the cause:

Does the table or view exist? — Create the table before running the job.

Is Hive catalog support enabled? — If the table exists but is not found, check whether enableHiveSupport() is present in your session builder. Remove it:

# Before
spark = SparkSession.builder.appName(app_name).enableHiveSupport().getOrCreate()
# After
spark = SparkSession.builder.appName(app_name).getOrCreate()

What do I do if "Shutdown hook called before final status was reported" occurs?

App Status: SUCCEEDED, diagnostics: Shutdown hook called before final status was reported.

The main application didn't request cluster resources through the ApplicationMaster (AM). The most common reason is that spark.master is set to local in the code, or a SparkContext was never created. Remove the spark.master=local setting from your code when submitting to the cluster.

What do I do if a JAR package version conflict error occurs?

User class threw exception: java.lang.NoSuchMethodError

A JAR package version conflict or incorrect class is loaded. To identify the conflicting dependency:

Find the JAR containing the problematic class in $SPARK_HOME/jars:
```
grep <AbnormalClassName> $SPARK_HOME/jars/*.jar
```
View all project dependencies:
```
mvn dependency:tree
```
Exclude the conflicting dependency:
```
maven dependency exclusions
```
Recompile and resubmit.

What do I do if a "ClassNotFound" error occurs?

java.lang.ClassNotFoundException: xxxx.xxx.xxxxx

The class is missing from the submitted JAR, or the dependency is misconfigured.

Verify the class exists in your JAR:
```
jar -tf <JobJARFile> | grep <ClassName>
```
Check the dependencies in pom.xml.
If needed, use the Shade method to repackage and submit.

What do I do if "The task is not in release range" occurs?

The task is not in release range: CUPID

Spark on MaxCompute is not enabled in your region. Select a region where the service is available.

What do I do if a "java.io.UTFDataFormatException" error occurs?

java.io.UTFDataFormatException: encoded string too long: 2818545 bytes

Increase the value of spark.hadoop.odps.cupid.disk.driver.device_size in spark-defaults.conf. The default is 20 GB and the maximum is 100 GB.

What do I do if Chinese characters appear garbled in Spark output?

Add the following to spark-defaults.conf or DataWorks configuration items:

"--conf" "spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8"
"--conf" "spark.driver.extraJavaOptions=-Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8"

What do I do if an error occurs when Spark calls a third-party service over the Internet?

Spark on MaxCompute doesn't have a direct Internet connection, so outbound Internet calls fail. Build an Nginx reverse proxy in a VPC to route the traffic, then use Spark's ENI-based VPC access to reach the proxy. See Access VPC-connected instances from Spark.