You can run Spark on MaxCompute tasks in local or cluster mode. You can also run offline Spark on MaxCompute tasks in cluster mode in DataWorks to integrate the tasks with other types of nodes for scheduling. This topic describes how to configure and schedule a Spark on MaxCompute task in DataWorks.
Prerequisites
(Required if you use a RAM user to develop tasks) The desired RAM user is added to your DataWorks workspace as a member and is assigned the Develop or Workspace Administrator role. The Workspace Administrator role has extensive permissions. We recommend that you assign the Workspace Administrator role to a user only when necessary. For more information about how to add a member, see Add workspace members and assign roles to them.
NoteIf you use an Alibaba Cloud account, ignore this prerequisite.
A MaxCompute Spark node is created. For more information, see Create a node.
Limits
If an error is reported when you commit a MaxCompute Spark node that uses the Spark 3.X version, purchase a serverless resource group. For more information, see Create and use a serverless resource group.
Background information
Spark on MaxCompute is a computing service that is provided by MaxCompute and is compatible with open source Spark. Spark on MaxCompute provides a Spark computing framework based on unified computing resource and dataset permission systems. Spark on MaxCompute allows you to use your preferred development method to submit and run Spark tasks. Spark on MaxCompute can meet diverse data processing and analytics requirements. In DataWorks, you can use MaxCompute Spark nodes to schedule and run Spark on MaxCompute tasks and integrate Spark on MaxCompute tasks with other types of tasks.
Spark on MaxCompute allows you to use Java, Scala, or Python to develop tasks and run the tasks in local or cluster mode. Spark on MaxCompute also allows you to run offline Spark on MaxCompute tasks in cluster mode in DataWorks. For more information about the running modes of Spark on MaxCompute tasks, see Running modes.
Preparations
MaxCompute Spark nodes allow you to use Java, Scala, or Python to develop and run offline Spark on MaxCompute tasks. The operations and parameters that are required for developing the offline Spark on MaxCompute tasks vary based on the programming language that you use. You can select a programming language based on your business requirements.
Java/Scala
Before you run Java or Scala code in a MaxCompute Spark node, you must complete the development of code for a Spark on MaxCompute task on your on-premises machine and upload the code to DataWorks as a MaxCompute resource. You must perform the following steps:
Prepare a development environment.
You must prepare the development environment in which you want to run a Spark on MaxCompute task based on the operating system that you use. For more information, see Set up a Linux development environment or Set up a Windows development environment.
Develop Java or Scala code.
Before you run Java or Scala code in a MaxCompute Spark node, you must complete the development of code for a Spark on MaxCompute task on your on-premises machine or in the prepared development environment. We recommend that you use the sample project template provided by Spark on MaxCompute.
Package the developed code and upload the code to DataWorks.
After the code is developed, you must package the code and upload the package to DataWorks as a MaxCompute resource.
Programming language: Python (Use the default Python environment)
DataWorks allows you to develop a PySpark task by writing code to a Python resource online in DataWorks and commit and run the code by using a MaxCompute Spark node. You need to create a Python resource in DataWorks. For information about how to view examples for developing Spark on MaxCompute applications by using PySpark, see Develop a Spark on MaxCompute application by using PySpark.
You can use the default Python environment provided by DataWorks to develop code. If third-party packages that are supported by the default Python environment cannot meet the requirements of the PySpark task, you can refer to Programming language: Python (Use a custom Python environment) to prepare a custom Python environment. You can also use PyODPS 2 nodes or PyODPS 3 nodes, which support more Python resources for the development.
Programming language: Python (Use a custom Python environment)
If the default Python environment cannot meet your business requirements, you can perform the following steps to prepare a custom Python environment to run your Spark on MaxCompute task.
Prepare a Python environment on your on-premises machine.
You can refer to PySpark Python versions and supported dependencies to configure a Python environment based on your business requirements.
Package the code for the Python environment and upload the package to DataWorks.
You must package the code for the Python environment in the ZIP format and upload the package to DataWorks as a MaxCompute resource. This way, you can run the Spark on MaxCompute task in the environment.
Parameter descriptions
You can run offline Spark on MaxCompute tasks in cluster mode in DataWorks. In this mode, you must specify the Main method as the entry point of a custom application. A Spark job ends when Main succeeds or fails. You must add the configuration items in the spark-defaults.conf file to the configurations of the MaxCompute Spark node. For example, you must add the configuration items such as the number of executors, the memory size, and spark.hadoop.odps.runtime.end.point.
You do not need to upload the spark-defaults.conf file. Instead, you must add the configuration items in the spark-defaults.conf file to the configurations of the MaxCompute Spark node one by one.
Parameters for Java/Scala

Parameter | Description | spark-submit command |
Spark Version | The version of Spark. Valid values: Spark1.x, Spark2.x, and Spark3.x. Note If an error is reported when you commit a MaxCompute Spark node that uses the Spark 3.X version, purchase a serverless resource group. For more information, see Create and use a serverless resource group. | None |
Language | The programming language. Valid values: Java/Scala and Python. You can select a programming language based on your business requirements. | None |
Main JAR Resource | The main JAR or Python resource file. You must upload the required resource file to DataWorks and commit the resource file in advance. |
|
Configuration Items | The configuration items that are required to submit the Spark on MaxCompute task.
|
|
Main Class | The name of the main class. This parameter is required only if you set the Language parameter to |
|
Parameters | You can add parameters based on your business requirements. Separate multiple parameters with spaces. DataWorks allows you to add scheduling parameters in the |
|
JAR Resources | You can select this type of resource only if you set the Language parameter to You must upload the required resource file to DataWorks and commit the resource file in advance. | Resource command:
|
File Resources | File resources. |
|
Archive Resources | Only compressed resources are displayed. |
|
Parameters for Python

Parameter | Description | spark-submit command |
Spark Version | The version of Spark. Valid values: Spark1.x, Spark2.x, and Spark3.x. Note If an error is reported when you commit a MaxCompute Spark node that uses the Spark 3.X version, purchase a serverless resource group. For more information, see Create and use a serverless resource group. | None |
Language | The programming language used by the node. Set this parameter to Python. You can select a programming language based on your business requirements. | None |
Main Python Resource | The main JAR or Python resource file. You must upload the required resource file to DataWorks and commit the resource file in advance. |
|
Configuration Items | The configuration items that are required to submit the Spark on MaxCompute task.
|
|
Parameters | You can add parameters based on your business requirements. Separate multiple parameters with spaces. DataWorks allows you to add scheduling parameters in the |
|
Python Resources | You can select this type of resource only if you set the Language parameter to You must upload the required resource file to DataWorks and commit the resource file in advance. |
|
File Resources | File resources. |
|
Archive Resources | Only compressed resources are displayed. |
|
Procedure
Create a resource.
In the left-side navigation pane of the Data Studio page, click Resource Management. In the RESOURCE MANAGEMENT: ALL pane, click the plus icon and choose Create Resource > MaxCompute Python. In the popover that appears, enter
spark_is_number.pyas the name of the Python resource. Sample code:# -*- coding: utf-8 -*- import sys from pyspark.sql import SparkSession try: # for python 2 reload(sys) sys.setdefaultencoding('utf8') except: # python 3 not needed pass if __name__ == '__main__': spark = SparkSession.builder\ .appName("spark sql")\ .config("spark.sql.broadcastTimeout", 20 * 60)\ .config("spark.sql.crossJoin.enabled", True)\ .config("odps.exec.dynamic.partition.mode", "nonstrict")\ .config("spark.sql.catalogImplementation", "odps")\ .getOrCreate() def is_number(s): try: float(s) return True except ValueError: pass try: import unicodedata unicodedata.numeric(s) return True except (TypeError, ValueError): pass return False print(is_number('foo')) print(is_number('1')) print(is_number('1.3')) print(is_number('-1.37')) print(is_number('1e3'))Save the resource.
Configure parameters and scheduling properties for the created MaxCompute Spark node. For more information, see the Parameter descriptions section of this topic.
If you want to run the node on a regular basis, configure the scheduling information based on your business requirements.
After the node is configured, deploy the node. For more information, see Deploy nodes.
After you deploy the node, view the running status of the node in Operation Center. For more information, see Getting started with Operation Center.
NoteDataWorks does not provide entry points for you to run MaxCompute Spark nodes in Data Studio. You must run MaxCompute Spark nodes in Operation Center in the development environment.
After the data backfill instance of the MaxCompute Spark node is successfully run, click tracking URL in the generated run logs to view the result.
References
For more information about the development of Spark on MaxCompute tasks in other scenarios, see the following topics:
Learn the FAQ about Spark: You can learn the FAQ about Spark. This way, you can identify and troubleshoot issues in an efficient manner when exceptions occur. For more information, see FAQ about Spark on MaxCompute.
Enable the system to diagnose Spark tasks: MaxCompute provides the Logview tool and Spark Web UI. You can view the logs of Spark tasks to check whether the tasks are submitted and run as expected. For more information, see Perform job diagnostics.