Use the MaxCompute Spark node to schedule and run Spark on MaxCompute batch jobs in DataWorks and integrate them with other node types. DataWorks always runs these jobs in cluster mode, so you must specify a main class as the entry point.
Spark on MaxCompute is compatible with open-source Spark and supports Java, Scala, and Python development. For a full description of run modes, see Run modes.
Limitations
If you select Spark 3.x for a MaxCompute Spark node and the job submission fails, purchase and use a Serverless resource group. For more information, see Purchase and use a Serverless resource group.
Prerequisites
Before you begin, ensure that you have:
(Optional, for RAM users) A DataWorks workspace with your RAM (Resource Access Management) user assigned the Developer or Workspace Administrator role. The Workspace Administrator role has extensive permissions, so assign it with caution. If you are using a main account, skip this requirement. For details, see Add members to a workspace and assign roles to them.
A Spark on MaxCompute job resource uploaded to DataWorks — a JAR file for Java/Scala, or a Python resource for Python. See the language-specific setup below.
Java/Scala: prepare a JAR resource
Before configuring a MaxCompute Spark node for Java or Scala, develop and package your code locally, then upload it to DataWorks as a MaxCompute resource.
Set up your local development environment. See Prepare a Java development environment or Prepare a Scala development environment.
Develop your Spark on MaxCompute code. Use the sample project template as a starting point.
Package the code as a JAR and upload it to DataWorks as a MaxCompute resource. See Create and use a MaxCompute resource.
Python (default environment): prepare a Python resource
Write your PySpark job directly in a Python resource in DataWorks, then reference it from the MaxCompute Spark node. For PySpark examples, see PySpark development examples.
The default Python environment has a limited set of pre-installed third-party packages. If your job requires additional dependencies, use a custom Python environment (see below) or switch to a PyODPS 2 or PyODPS 3 node.
Python (custom environment): prepare a ZIP archive
If the default Python environment does not meet your dependency requirements, package a custom environment and upload it as a MaxCompute resource.
Configure a Python environment locally that meets your job's requirements. See Python versions and dependencies supported by PySpark.
Compress the Python environment into a ZIP package and upload it to DataWorks as a MaxCompute resource. This resource serves as the execution environment for your Spark on MaxCompute task.
Node parameters
DataWorks runs Spark on MaxCompute batch jobs in cluster mode. In cluster mode, the driver runs on the cluster, so you must specify a main class as the entry point. The job finishes when the main method completes with a status of Success or Fail.
Do not upload the spark-defaults.conf file. Instead, add each configuration entry from spark-defaults.conf to the Configuration Properties field of the node.
You do not need to configurespark.hadoop.odps.access.id,spark.hadoop.odps.access.key, orspark.hadoop.odps.end.point. These default to the values of the MaxCompute project. Configure them explicitly only to override the defaults.
Java/Scala job

The Spark-submit command column shows the corresponding spark-submit flag for each parameter, helping you map existing CLI configurations to the DataWorks UI.
| Parameter | Description | Spark-submit command |
|---|---|---|
| Spark version | Select Spark 1.x, Spark 2.x, or Spark 3.x. If you select Spark 3.x and the job submission fails, see Limitations. | — |
| Language | Select Java/Scala. | — |
| Main JAR | The main JAR resource file for the task. Upload the resource to DataWorks first. See Create and use a MaxCompute resource. | app jar or Python file |
| Main class | The name of the main class. This parameter is required for Java/Scala. | --class CLASS_NAME |
| Configuration Properties | Configuration entries for job submission. Add all entries from your spark-defaults.conf here, including executor count, memory size, and spark.hadoop.odps.runtime.end.point. | --conf PROP=VALUE |
| Arguments | Arguments to pass to the application, separated by spaces. Supports scheduling parameters in ${variable_name} format. Assign values in the Scheduling Parameters field on the Schedule tab. See Configure scheduling parameters. | [app arguments] |
| JAR resource | Additional JAR dependencies. This parameter applies only to Java/Scala. Upload resources to DataWorks first. See Create and use a MaxCompute resource. | --jars JARS |
| File resource | Specifies file resources. | --files FILES |
| Archives resource | Specifies archive resources. Only ZIP format is supported. | --archives ARCHIVES |
Python job

The Spark-submit command column shows the corresponding spark-submit flag for each parameter, helping you map existing CLI configurations to the DataWorks UI.
| Parameter | Description | Spark-submit command |
|---|---|---|
| Spark version | Select Spark 1.x, Spark 2.x, or Spark 3.x. If you select Spark 3.x and the job submission fails, see Limitations. | — |
| Language | Select Python. | — |
| Main Python resource | The main Python resource file for the task. Upload the resource to DataWorks first. See Create and use a MaxCompute resource. | app jar or Python file |
| Configuration Properties | Configuration entries for job submission. Add all entries from your spark-defaults.conf here, including executor count, memory size, and spark.hadoop.odps.runtime.end.point. | --conf PROP=VALUE |
| Arguments | Arguments to pass to the application, separated by spaces. Supports scheduling parameters in ${variable_name} format. Assign values in the Scheduling Parameters field on the Schedule tab. See Configure scheduling parameters. | [app arguments] |
| Python resource | Additional Python dependencies. This parameter applies only to Python. Upload resources to DataWorks first. See Create and use a MaxCompute resource. | --py-files PY_FILES |
| File resource | Specifies file resources. | --files FILES |
| Archives resource | Specifies archive resources. Only ZIP format is supported. | --archives ARCHIVES |
Configure and run a MaxCompute Spark node
The following example walks through a Python job to verify your node configuration. If you use Java/Scala, skip step 1 and reference your JAR resource in the node configuration.
Create a Python resource.
On the Data Development page, find Resource Management in the left-side navigation pane. Click Create and select MaxCompute Spark Python to create a resource. Name the resource
spark_is_number.py. For more information, see Create and use a MaxCompute resource. The following is a minimal working example to verify your configuration:# -*- coding: utf-8 -*- import sys from pyspark.sql import SparkSession try: # For Python 2 reload(sys) sys.setdefaultencoding('utf8') except: # Not needed for Python 3 pass if __name__ == '__main__': spark = SparkSession.builder\ .appName("spark sql")\ .config("spark.sql.broadcastTimeout", 20 * 60)\ .config("spark.sql.crossJoin.enabled", True)\ .config("odps.exec.dynamic.partition.mode", "nonstrict")\ .config("spark.sql.catalogImplementation", "odps")\ .getOrCreate() def is_number(s): try: float(s) return True except ValueError: pass try: import unicodedata unicodedata.numeric(s) return True except (TypeError, ValueError): pass return False print(is_number('foo')) print(is_number('1')) print(is_number('1.3')) print(is_number('-1.37')) print(is_number('1e3'))Save the resource.
Configure the MaxCompute Spark node. In the node editor, set the node and scheduling parameters as described in Node parameters.
If the node runs on a schedule, configure its scheduling properties. See Configure scheduling properties.
Deploy the node. See Deploy tasks.
Monitor the run status in Operation and Maintenance Center. See View and manage auto-triggered tasks.
MaxCompute Spark nodes cannot be run directly from the node editor on the Data Development page. Run the task from Operation and Maintenance Center.
After a Backfill Instance runs successfully, open the tracking URL from the Run Log to view results.
What's next
Development examples: Java/Scala example: Spark 1.x, Java/Scala example: Spark 2.x, and Python example: PySpark development
Integration use cases: Access a VPC instance from Spark and Access OSS from Spark
Troubleshooting: FAQ about Spark and Diagnose Spark jobs