All Products
Search
Document Center

DataWorks:MaxCompute Spark node

Last Updated:Mar 26, 2026

Use the MaxCompute Spark node to schedule and run Spark on MaxCompute batch jobs in DataWorks and integrate them with other node types. DataWorks always runs these jobs in cluster mode, so you must specify a main class as the entry point.

Spark on MaxCompute is compatible with open-source Spark and supports Java, Scala, and Python development. For a full description of run modes, see Run modes.

Limitations

If you select Spark 3.x for a MaxCompute Spark node and the job submission fails, purchase and use a Serverless resource group. For more information, see Purchase and use a Serverless resource group.

Prerequisites

Before you begin, ensure that you have:

  • (Optional, for RAM users) A DataWorks workspace with your RAM (Resource Access Management) user assigned the Developer or Workspace Administrator role. The Workspace Administrator role has extensive permissions, so assign it with caution. If you are using a main account, skip this requirement. For details, see Add members to a workspace and assign roles to them.

  • A Spark on MaxCompute job resource uploaded to DataWorks — a JAR file for Java/Scala, or a Python resource for Python. See the language-specific setup below.

Java/Scala: prepare a JAR resource

Before configuring a MaxCompute Spark node for Java or Scala, develop and package your code locally, then upload it to DataWorks as a MaxCompute resource.

  1. Set up your local development environment. See Prepare a Java development environment or Prepare a Scala development environment.

  2. Develop your Spark on MaxCompute code. Use the sample project template as a starting point.

  3. Package the code as a JAR and upload it to DataWorks as a MaxCompute resource. See Create and use a MaxCompute resource.

Python (default environment): prepare a Python resource

Write your PySpark job directly in a Python resource in DataWorks, then reference it from the MaxCompute Spark node. For PySpark examples, see PySpark development examples.

The default Python environment has a limited set of pre-installed third-party packages. If your job requires additional dependencies, use a custom Python environment (see below) or switch to a PyODPS 2 or PyODPS 3 node.

Python (custom environment): prepare a ZIP archive

If the default Python environment does not meet your dependency requirements, package a custom environment and upload it as a MaxCompute resource.

  1. Configure a Python environment locally that meets your job's requirements. See Python versions and dependencies supported by PySpark.

  2. Compress the Python environment into a ZIP package and upload it to DataWorks as a MaxCompute resource. This resource serves as the execution environment for your Spark on MaxCompute task.

Node parameters

DataWorks runs Spark on MaxCompute batch jobs in cluster mode. In cluster mode, the driver runs on the cluster, so you must specify a main class as the entry point. The job finishes when the main method completes with a status of Success or Fail.

Do not upload the spark-defaults.conf file. Instead, add each configuration entry from spark-defaults.conf to the Configuration Properties field of the node.

You do not need to configure spark.hadoop.odps.access.id, spark.hadoop.odps.access.key, or spark.hadoop.odps.end.point. These default to the values of the MaxCompute project. Configure them explicitly only to override the defaults.

Java/Scala job

image

The Spark-submit command column shows the corresponding spark-submit flag for each parameter, helping you map existing CLI configurations to the DataWorks UI.

ParameterDescriptionSpark-submit command
Spark versionSelect Spark 1.x, Spark 2.x, or Spark 3.x. If you select Spark 3.x and the job submission fails, see Limitations.
LanguageSelect Java/Scala.
Main JARThe main JAR resource file for the task. Upload the resource to DataWorks first. See Create and use a MaxCompute resource.app jar or Python file
Main classThe name of the main class. This parameter is required for Java/Scala.--class CLASS_NAME
Configuration PropertiesConfiguration entries for job submission. Add all entries from your spark-defaults.conf here, including executor count, memory size, and spark.hadoop.odps.runtime.end.point.--conf PROP=VALUE
ArgumentsArguments to pass to the application, separated by spaces. Supports scheduling parameters in ${variable_name} format. Assign values in the Scheduling Parameters field on the Schedule tab. See Configure scheduling parameters.[app arguments]
JAR resourceAdditional JAR dependencies. This parameter applies only to Java/Scala. Upload resources to DataWorks first. See Create and use a MaxCompute resource.--jars JARS
File resourceSpecifies file resources.--files FILES
Archives resourceSpecifies archive resources. Only ZIP format is supported.--archives ARCHIVES

Python job

image

The Spark-submit command column shows the corresponding spark-submit flag for each parameter, helping you map existing CLI configurations to the DataWorks UI.

ParameterDescriptionSpark-submit command
Spark versionSelect Spark 1.x, Spark 2.x, or Spark 3.x. If you select Spark 3.x and the job submission fails, see Limitations.
LanguageSelect Python.
Main Python resourceThe main Python resource file for the task. Upload the resource to DataWorks first. See Create and use a MaxCompute resource.app jar or Python file
Configuration PropertiesConfiguration entries for job submission. Add all entries from your spark-defaults.conf here, including executor count, memory size, and spark.hadoop.odps.runtime.end.point.--conf PROP=VALUE
ArgumentsArguments to pass to the application, separated by spaces. Supports scheduling parameters in ${variable_name} format. Assign values in the Scheduling Parameters field on the Schedule tab. See Configure scheduling parameters.[app arguments]
Python resourceAdditional Python dependencies. This parameter applies only to Python. Upload resources to DataWorks first. See Create and use a MaxCompute resource.--py-files PY_FILES
File resourceSpecifies file resources.--files FILES
Archives resourceSpecifies archive resources. Only ZIP format is supported.--archives ARCHIVES

Configure and run a MaxCompute Spark node

The following example walks through a Python job to verify your node configuration. If you use Java/Scala, skip step 1 and reference your JAR resource in the node configuration.

  1. Create a Python resource.

    1. On the Data Development page, find Resource Management in the left-side navigation pane. Click Create and select MaxCompute Spark Python to create a resource. Name the resource spark_is_number.py. For more information, see Create and use a MaxCompute resource. The following is a minimal working example to verify your configuration:

      # -*- coding: utf-8 -*-
      import sys
      from pyspark.sql import SparkSession
      
      try:
          # For Python 2
          reload(sys)
          sys.setdefaultencoding('utf8')
      except:
          # Not needed for Python 3
          pass
      
      if __name__ == '__main__':
          spark = SparkSession.builder\
              .appName("spark sql")\
              .config("spark.sql.broadcastTimeout", 20 * 60)\
              .config("spark.sql.crossJoin.enabled", True)\
              .config("odps.exec.dynamic.partition.mode", "nonstrict")\
              .config("spark.sql.catalogImplementation", "odps")\
              .getOrCreate()
      
      def is_number(s):
          try:
              float(s)
              return True
          except ValueError:
              pass
      
          try:
              import unicodedata
              unicodedata.numeric(s)
              return True
          except (TypeError, ValueError):
              pass
      
          return False
      
      print(is_number('foo'))
      print(is_number('1'))
      print(is_number('1.3'))
      print(is_number('-1.37'))
      print(is_number('1e3'))
    2. Save the resource.

  2. Configure the MaxCompute Spark node. In the node editor, set the node and scheduling parameters as described in Node parameters.

  3. If the node runs on a schedule, configure its scheduling properties. See Configure scheduling properties.

  4. Deploy the node. See Deploy tasks.

  5. Monitor the run status in Operation and Maintenance Center. See View and manage auto-triggered tasks.

MaxCompute Spark nodes cannot be run directly from the node editor on the Data Development page. Run the task from Operation and Maintenance Center.
After a Backfill Instance runs successfully, open the tracking URL from the Run Log to view results.

What's next