Use an ODPS Spark node to schedule and run Spark on MaxCompute tasks in DataWorks. Spark on MaxCompute tasks can run in local or cluster mode. This topic explains how to prepare your code, configure the node parameters, and run the task in cluster mode in DataWorks.
Prerequisites
Before you begin, ensure that you have:
-
An ODPS Spark node created in DataWorks. For details, see Create and manage ODPS nodes.
Limitations
If you commit an ODPS Spark node that uses Spark 3.x and an error is reported, submit a ticket to contact technical support. The support team will update the version of the exclusive resource group for scheduling used to run the node.
Prepare your code
Choose the language for your Spark on MaxCompute task. The preparation steps differ depending on whether you use Java/Scala or Python.
Java or Scala
Complete the following steps on your local machine before configuring the node:
-
Set up a development environment. Prepare a development environment based on your operating system:
-
Develop your code. Write your Spark on MaxCompute application code. Start from the sample project template to get a pre-configured project structure with the correct Spark and MaxCompute dependencies.
-
Package and upload the JAR. Package your code as a JAR and upload it to DataWorks as a MaxCompute resource. For details, see Create and use MaxCompute resources.
Python (default environment)
Write your PySpark code directly in DataWorks as a Python resource, then commit it. No local setup is needed. For examples and instructions, see Create and use MaxCompute resources and Develop a Spark on MaxCompute application by using PySpark.
If the default Python environment does not include the third-party packages your task needs, either prepare a custom Python environment (see below), or use PyODPS 2 nodes or PyODPS 3 nodes, which support a broader set of Python libraries.
Python (custom environment)
If the default Python environment does not meet your requirements:
-
Prepare a custom Python environment on your local machine. Refer to PySpark Python versions and supported dependencies to configure the environment based on your dependency requirements.
-
Package and upload the environment. Package the Python environment as a ZIP file and upload it to DataWorks as a MaxCompute resource. For details, see Create and use MaxCompute resources.
Configure the node
In cluster mode, the node runs your application by calling its Main method as the entry point. The task is considered complete when Main reaches either Success or Fail state.
Do not upload thespark-defaults.conffile. Instead, add each configuration item fromspark-defaults.confindividually in the Configuration Items field of the node.
The following table describes each parameter. Parameters marked as auto-configured are pre-populated from your MaxCompute project settings — override them in Configuration Items only if your task requires different values.
| Parameter | Required | Description | Equivalent spark-submit option |
|---|---|---|---|
| Spark version | Yes | The Spark version to use. Options: Spark1.x, Spark2.x, Spark3.x. | — |
| Language | Yes | The programming language. Options: Java/Scala, Python. | — |
| Main JAR resource | Yes | The main JAR file (Java/Scala) or Python script uploaded as a MaxCompute resource. Upload and commit the resource before configuring this field. See Create and use MaxCompute resources. | app jar or Python file |
| Configuration items | Conditional | Spark configuration properties added one per line — equivalent to --conf in spark-submit. Add items such as the number of executors, memory size, and spark.hadoop.odps.runtime.end.point as needed. |
--conf PROP=VALUE |
| Main class | Java/Scala only | The fully qualified name of the main class. Not required for Python tasks. | --class CLASS_NAME |
| Parameters | No | Arguments passed to your application, separated by spaces. Use ${Variable name} format for scheduling parameters, then assign values in the Scheduling parameter section of the Properties tab. For supported formats, see Supported formats of scheduling parameters. |
[app arguments] |
| Other resources | No | Additional resource files required by the task. Supported types and their applicable languages: Jar resource (Java/Scala only), Python resource (Python only), File resource (all), Archive resource (all, compressed files only). Upload and commit resources first. | --jars, --py-files, --files, --archives |
Auto-configured items — the following configuration properties are set automatically to match your MaxCompute project values. Override them in Configuration items only if your task requires different values:
-
spark.hadoop.odps.access.id -
spark.hadoop.odps.access.key -
spark.hadoop.odps.end.point
Example: run a string-to-number check
This example creates a PySpark task that checks whether strings can be converted to numbers.
Step 1: Create and commit the Python resource
-
In the DataWorks console, go to DataStudio and create a Python resource named spark_is_number.py. For details on creating resources, see Create and use MaxCompute resources. Paste the following code into the resource:
# -*- coding: utf-8 -*- import sys from pyspark.sql import SparkSession try: # Python 2 only reload(sys) sys.setdefaultencoding('utf8') except: # Python 3 does not need this pass if __name__ == '__main__': spark = SparkSession.builder\ .appName("spark sql")\ .config("spark.sql.broadcastTimeout", 20 * 60)\ .config("spark.sql.crossJoin.enabled", True)\ .config("odps.exec.dynamic.partition.mode", "nonstrict")\ .config("spark.sql.catalogImplementation", "odps")\ .getOrCreate() def is_number(s): try: float(s) return True except ValueError: pass try: import unicodedata unicodedata.numeric(s) return True except (TypeError, ValueError): pass return False print(is_number('foo')) print(is_number('1')) print(is_number('1.3')) print(is_number('-1.37')) print(is_number('1e3')) -
Save and commit the resource.
Step 2: Configure the ODPS Spark node
In the ODPS Spark node, set the following parameters:
| Parameter | Value |
|---|---|
| Spark version | Spark2.x |
| Language | Python |
| Main Python resource | spark_is_number.py (the resource you created) |
Save and commit the node.
Step 3: Run the node in Operation Center
ODPS Spark nodes cannot be run from DataStudio. Go to Operation Center in the development environment to run the node.
In Operation Center, trigger a backfill for the ODPS Spark node. For details, see Backfill data and view data backfill instances (new version).
Step 4: View the results
After the backfill instance completes successfully, click the tracking URL in the run logs to view the output:
False
True
True
True
True
More examples
For additional Spark on MaxCompute development scenarios:
What's next
After developing and running your Spark on MaxCompute task, you can:
-
Configure scheduling properties: Set up periodic scheduling for the node, including rerun settings and scheduling dependencies, so the system runs the task automatically. See Overview.
-
Debug the node: Test the node code to verify the logic works as expected before going to production. See Debugging procedure.
-
Deploy the node: Deploy the node to make it active for scheduling. After deployment, the system schedules and runs the node automatically based on the scheduling properties you configured. See Deploy nodes.
-
Diagnose task issues: Use the Logview tool and Spark Web UI to inspect logs and verify that tasks are submitted and running as expected. See Enable the system to diagnose Spark tasks.