All Products
Search
Document Center

DataWorks:MaxCompute Spark node

Last Updated:Feb 28, 2025

You can run Spark on MaxCompute tasks in local or cluster mode. You can also run offline Spark on MaxCompute tasks in cluster mode in DataWorks to integrate the tasks with other types of nodes for scheduling. This topic describes how to configure and schedule a Spark on MaxCompute task in DataWorks.

Prerequisites

  • (Required if you use a RAM user to develop tasks) The desired RAM user is added to your DataWorks workspace as a member and is assigned the Develop or Workspace Administrator role. The Workspace Administrator role has extensive permissions. We recommend that you assign the Workspace Administrator role to a user only when necessary. For more information about how to add a member, see Add workspace members and assign roles to them.

    Note

    If you use an Alibaba Cloud account, ignore this prerequisite.

  • A MaxCompute Spark node is created. For more information, see Create a node.

Limits

If an error is reported when you commit a MaxCompute Spark node that uses the Spark 3.X version, purchase a serverless resource group. For more information, see Create and use a serverless resource group.

Background information

Spark on MaxCompute is a computing service that is provided by MaxCompute and is compatible with open source Spark. Spark on MaxCompute provides a Spark computing framework based on unified computing resource and dataset permission systems. Spark on MaxCompute allows you to use your preferred development method to submit and run Spark tasks. Spark on MaxCompute can meet diverse data processing and analytics requirements. In DataWorks, you can use MaxCompute Spark nodes to schedule and run Spark on MaxCompute tasks and integrate Spark on MaxCompute tasks with other types of tasks.

Spark on MaxCompute allows you to use Java, Scala, or Python to develop tasks and run the tasks in local or cluster mode. Spark on MaxCompute also allows you to run offline Spark on MaxCompute tasks in cluster mode in DataWorks. For more information about the running modes of Spark on MaxCompute tasks, see Running modes.

Preparations

MaxCompute Spark nodes allow you to use Java, Scala, or Python to develop and run offline Spark on MaxCompute tasks. The operations and parameters that are required for developing the offline Spark on MaxCompute tasks vary based on the programming language that you use. You can select a programming language based on your business requirements.

Java/Scala

Before you run Java or Scala code in a MaxCompute Spark node, you must complete the development of code for a Spark on MaxCompute task on your on-premises machine and upload the code to DataWorks as a MaxCompute resource. You must perform the following steps:

  1. Prepare a development environment.

    You must prepare the development environment in which you want to run a Spark on MaxCompute task based on the operating system that you use. For more information, see Set up a Linux development environment or Set up a Windows development environment.

  2. Develop Java or Scala code.

    Before you run Java or Scala code in a MaxCompute Spark node, you must complete the development of code for a Spark on MaxCompute task on your on-premises machine or in the prepared development environment. We recommend that you use the sample project template provided by Spark on MaxCompute.

  3. Package the developed code and upload the code to DataWorks.

    After the code is developed, you must package the code and upload the package to DataWorks as a MaxCompute resource.

Programming language: Python (Use the default Python environment)

DataWorks allows you to develop a PySpark task by writing code to a Python resource online in DataWorks and commit and run the code by using a MaxCompute Spark node. You need to create a Python resource in DataWorks. For information about how to view examples for developing Spark on MaxCompute applications by using PySpark, see Develop a Spark on MaxCompute application by using PySpark.

Note

You can use the default Python environment provided by DataWorks to develop code. If third-party packages that are supported by the default Python environment cannot meet the requirements of the PySpark task, you can refer to Programming language: Python (Use a custom Python environment) to prepare a custom Python environment. You can also use PyODPS 2 nodes or PyODPS 3 nodes, which support more Python resources for the development.

Programming language: Python (Use a custom Python environment)

If the default Python environment cannot meet your business requirements, you can perform the following steps to prepare a custom Python environment to run your Spark on MaxCompute task.

  1. Prepare a Python environment on your on-premises machine.

    You can refer to PySpark Python versions and supported dependencies to configure a Python environment based on your business requirements.

  2. Package the code for the Python environment and upload the package to DataWorks.

    You must package the code for the Python environment in the ZIP format and upload the package to DataWorks as a MaxCompute resource. This way, you can run the Spark on MaxCompute task in the environment.

Parameter descriptions

You can run offline Spark on MaxCompute tasks in cluster mode in DataWorks. In this mode, you must specify the Main method as the entry point of a custom application. A Spark job ends when Main succeeds or fails. You must add the configuration items in the spark-defaults.conf file to the configurations of the MaxCompute Spark node. For example, you must add the configuration items such as the number of executors, the memory size, and spark.hadoop.odps.runtime.end.point.

Note

You do not need to upload the spark-defaults.conf file. Instead, you must add the configuration items in the spark-defaults.conf file to the configurations of the MaxCompute Spark node one by one.

Parameters for Java/Scala

image

Parameter

Description

spark-submit command

Spark Version

The version of Spark. Valid values: Spark1.x, Spark2.x, and Spark3.x.

Note

If an error is reported when you commit a MaxCompute Spark node that uses the Spark 3.X version, purchase a serverless resource group. For more information, see Create and use a serverless resource group.

None

Language

The programming language. Valid values: Java/Scala and Python. You can select a programming language based on your business requirements.

None

Main JAR Resource

The main JAR or Python resource file.

You must upload the required resource file to DataWorks and commit the resource file in advance.

app jar or Python file

Configuration Items

The configuration items that are required to submit the Spark on MaxCompute task.

  • You do not need to configure spark.hadoop.odps.access.id, spark.hadoop.odps.access.key, and spark.hadoop.odps.end.point. By default, the values of these configuration items are the same as those of the MaxCompute project. You can also explicitly configure these items to overwrite their default values if necessary.

  • You do not need to upload the spark-defaults.conf file. Instead, you must add the configuration items in the spark-defaults.conf file to the configurations of the MaxCompute Spark node one by one. For example, you must add the configuration items such as the number of executors, the memory size, and spark.hadoop.odps.runtime.end.point.

--conf PROP=VALUE

Main Class

The name of the main class. This parameter is required only if you set the Language parameter to Java/Scala.

--class CLASS_NAME

Parameters

You can add parameters based on your business requirements. Separate multiple parameters with spaces. DataWorks allows you to add scheduling parameters in the ${Variable name} format. After the parameters are added, you must click the Properties tab in the right-side navigation pane and assign values to the related variables in the Scheduling Parameters section.

[app arguments]

JAR Resources

You can select this type of resource only if you set the Language parameter to Java/Scala.

You must upload the required resource file to DataWorks and commit the resource file in advance.

Resource command:

--jars JARS

File Resources

File resources.

--files FILES

Archive Resources

Only compressed resources are displayed.

--archives ARCHIVES

Parameters for Python

image

Parameter

Description

spark-submit command

Spark Version

The version of Spark. Valid values: Spark1.x, Spark2.x, and Spark3.x.

Note

If an error is reported when you commit a MaxCompute Spark node that uses the Spark 3.X version, purchase a serverless resource group. For more information, see Create and use a serverless resource group.

None

Language

The programming language used by the node. Set this parameter to Python. You can select a programming language based on your business requirements.

None

Main Python Resource

The main JAR or Python resource file.

You must upload the required resource file to DataWorks and commit the resource file in advance.

app jar or Python file

Configuration Items

The configuration items that are required to submit the Spark on MaxCompute task.

  • You do not need to configure spark.hadoop.odps.access.id, spark.hadoop.odps.access.key, and spark.hadoop.odps.end.point. By default, the values of these configuration items are the same as those of the MaxCompute project. You can also explicitly configure these items to overwrite their default values if necessary.

  • You do not need to upload the spark-defaults.conf file. Instead, you must add the configuration items in the spark-defaults.conf file to the configurations of the MaxCompute Spark node one by one. For example, you must add the configuration items such as the number of executors, the memory size, and spark.hadoop.odps.runtime.end.point.

--conf PROP=VALUE

Parameters

You can add parameters based on your business requirements. Separate multiple parameters with spaces. DataWorks allows you to add scheduling parameters in the ${Variable name} format. After the parameters are added, you must click the Properties tab in the right-side navigation pane and assign values to the related variables in the Scheduling Parameters section.

[app arguments]

Python Resources

You can select this type of resource only if you set the Language parameter to Python.

You must upload the required resource file to DataWorks and commit the resource file in advance.

--py-files PY_FILES

File Resources

File resources.

--files FILES

Archive Resources

Only compressed resources are displayed.

--archives ARCHIVES

Procedure

  1. Create a resource.

    1. In the left-side navigation pane of the Data Studio page, click Resource Management. In the RESOURCE MANAGEMENT: ALL pane, click the plus icon and choose Create Resource > MaxCompute Python. In the popover that appears, enter spark_is_number.py as the name of the Python resource. Sample code:

      # -*- coding: utf-8 -*-
      import sys
      from pyspark.sql import SparkSession
      
      try:
          # for python 2
          reload(sys)
          sys.setdefaultencoding('utf8')
      except:
          # python 3 not needed
          pass
      
      if __name__ == '__main__':
          spark = SparkSession.builder\
              .appName("spark sql")\
              .config("spark.sql.broadcastTimeout", 20 * 60)\
              .config("spark.sql.crossJoin.enabled", True)\
              .config("odps.exec.dynamic.partition.mode", "nonstrict")\
              .config("spark.sql.catalogImplementation", "odps")\
              .getOrCreate()
      
      def is_number(s):
          try:
              float(s)
              return True
          except ValueError:
              pass
      
          try:
              import unicodedata
              unicodedata.numeric(s)
              return True
          except (TypeError, ValueError):
              pass
      
          return False
      
      print(is_number('foo'))
      print(is_number('1'))
      print(is_number('1.3'))
      print(is_number('-1.37'))
      print(is_number('1e3'))
    2. Save the resource.

  2. Configure parameters and scheduling properties for the created MaxCompute Spark node. For more information, see the Parameter descriptions section of this topic.

  3. If you want to run the node on a regular basis, configure the scheduling information based on your business requirements.

  4. After the node is configured, deploy the node. For more information, see Deploy nodes.

  5. After you deploy the node, view the running status of the node in Operation Center. For more information, see Getting started with Operation Center.

    Note
    • DataWorks does not provide entry points for you to run MaxCompute Spark nodes in Data Studio. You must run MaxCompute Spark nodes in Operation Center in the development environment.

    • After the data backfill instance of the MaxCompute Spark node is successfully run, click tracking URL in the generated run logs to view the result.

References