All Products
Search
Document Center

MaxCompute:Develop ODPS Spark tasks

Last Updated:Mar 26, 2026

Use an ODPS Spark node to schedule and run Spark on MaxCompute tasks in DataWorks. Spark on MaxCompute tasks can run in local or cluster mode. This topic explains how to prepare your code, configure the node parameters, and run the task in cluster mode in DataWorks.

Prerequisites

Before you begin, ensure that you have:

Limitations

If you commit an ODPS Spark node that uses Spark 3.x and an error is reported, submit a ticket to contact technical support. The support team will update the version of the exclusive resource group for scheduling used to run the node.

Prepare your code

Choose the language for your Spark on MaxCompute task. The preparation steps differ depending on whether you use Java/Scala or Python.

Java or Scala

Complete the following steps on your local machine before configuring the node:

  1. Set up a development environment. Prepare a development environment based on your operating system:

  2. Develop your code. Write your Spark on MaxCompute application code. Start from the sample project template to get a pre-configured project structure with the correct Spark and MaxCompute dependencies.

  3. Package and upload the JAR. Package your code as a JAR and upload it to DataWorks as a MaxCompute resource. For details, see Create and use MaxCompute resources.

Python (default environment)

Write your PySpark code directly in DataWorks as a Python resource, then commit it. No local setup is needed. For examples and instructions, see Create and use MaxCompute resources and Develop a Spark on MaxCompute application by using PySpark.

If the default Python environment does not include the third-party packages your task needs, either prepare a custom Python environment (see below), or use PyODPS 2 nodes or PyODPS 3 nodes, which support a broader set of Python libraries.

Python (custom environment)

If the default Python environment does not meet your requirements:

  1. Prepare a custom Python environment on your local machine. Refer to PySpark Python versions and supported dependencies to configure the environment based on your dependency requirements.

  2. Package and upload the environment. Package the Python environment as a ZIP file and upload it to DataWorks as a MaxCompute resource. For details, see Create and use MaxCompute resources.

Configure the node

In cluster mode, the node runs your application by calling its Main method as the entry point. The task is considered complete when Main reaches either Success or Fail state.

Do not upload the spark-defaults.conf file. Instead, add each configuration item from spark-defaults.conf individually in the Configuration Items field of the node.
Spark task configuration

The following table describes each parameter. Parameters marked as auto-configured are pre-populated from your MaxCompute project settings — override them in Configuration Items only if your task requires different values.

Parameter Required Description Equivalent spark-submit option
Spark version Yes The Spark version to use. Options: Spark1.x, Spark2.x, Spark3.x.
Language Yes The programming language. Options: Java/Scala, Python.
Main JAR resource Yes The main JAR file (Java/Scala) or Python script uploaded as a MaxCompute resource. Upload and commit the resource before configuring this field. See Create and use MaxCompute resources. app jar or Python file
Configuration items Conditional Spark configuration properties added one per line — equivalent to --conf in spark-submit. Add items such as the number of executors, memory size, and spark.hadoop.odps.runtime.end.point as needed. --conf PROP=VALUE
Main class Java/Scala only The fully qualified name of the main class. Not required for Python tasks. --class CLASS_NAME
Parameters No Arguments passed to your application, separated by spaces. Use ${Variable name} format for scheduling parameters, then assign values in the Scheduling parameter section of the Properties tab. For supported formats, see Supported formats of scheduling parameters. [app arguments]
Other resources No Additional resource files required by the task. Supported types and their applicable languages: Jar resource (Java/Scala only), Python resource (Python only), File resource (all), Archive resource (all, compressed files only). Upload and commit resources first. --jars, --py-files, --files, --archives

Auto-configured items — the following configuration properties are set automatically to match your MaxCompute project values. Override them in Configuration items only if your task requires different values:

  • spark.hadoop.odps.access.id

  • spark.hadoop.odps.access.key

  • spark.hadoop.odps.end.point

Example: run a string-to-number check

This example creates a PySpark task that checks whether strings can be converted to numbers.

Step 1: Create and commit the Python resource

  1. In the DataWorks console, go to DataStudio and create a Python resource named spark_is_number.py. For details on creating resources, see Create and use MaxCompute resources. Paste the following code into the resource:

    # -*- coding: utf-8 -*-
    import sys
    from pyspark.sql import SparkSession
    
    try:
        # Python 2 only
        reload(sys)
        sys.setdefaultencoding('utf8')
    except:
        # Python 3 does not need this
        pass
    
    if __name__ == '__main__':
        spark = SparkSession.builder\
            .appName("spark sql")\
            .config("spark.sql.broadcastTimeout", 20 * 60)\
            .config("spark.sql.crossJoin.enabled", True)\
            .config("odps.exec.dynamic.partition.mode", "nonstrict")\
            .config("spark.sql.catalogImplementation", "odps")\
            .getOrCreate()
    
    def is_number(s):
        try:
            float(s)
            return True
        except ValueError:
            pass
    
        try:
            import unicodedata
            unicodedata.numeric(s)
            return True
        except (TypeError, ValueError):
            pass
    
        return False
    
    print(is_number('foo'))
    print(is_number('1'))
    print(is_number('1.3'))
    print(is_number('-1.37'))
    print(is_number('1e3'))
  2. Save and commit the resource.

Step 2: Configure the ODPS Spark node

In the ODPS Spark node, set the following parameters:

Parameter Value
Spark version Spark2.x
Language Python
Main Python resource spark_is_number.py (the resource you created)

Save and commit the node.

Step 3: Run the node in Operation Center

ODPS Spark nodes cannot be run from DataStudio. Go to Operation Center in the development environment to run the node.

In Operation Center, trigger a backfill for the ODPS Spark node. For details, see Backfill data and view data backfill instances (new version).

Step 4: View the results

After the backfill instance completes successfully, click the tracking URL in the run logs to view the output:

False
True
True
True
True

More examples

For additional Spark on MaxCompute development scenarios:

What's next

After developing and running your Spark on MaxCompute task, you can:

  • Configure scheduling properties: Set up periodic scheduling for the node, including rerun settings and scheduling dependencies, so the system runs the task automatically. See Overview.

  • Debug the node: Test the node code to verify the logic works as expected before going to production. See Debugging procedure.

  • Deploy the node: Deploy the node to make it active for scheduling. After deployment, the system schedules and runs the node automatically based on the scheduling properties you configured. See Deploy nodes.

  • Diagnose task issues: Use the Logview tool and Spark Web UI to inspect logs and verify that tasks are submitted and running as expected. See Enable the system to diagnose Spark tasks.