All Products
Search
Document Center

MaxCompute:Create an ODPS Spark node

Last Updated:Jul 27, 2023

Spark on MaxCompute allows you to run MaxCompute Spark jobs in local or cluster mode. You can also run offline MaxCompute Spark jobs in cluster mode in DataWorks to integrate the jobs with other types of nodes for scheduling. This topic describes how to configure and schedule an ODPS Spark node in DataWorks.

Prerequisites

  • A MaxCompute compute engine is associated with a workspace as a compute engine instance. The MaxCompute folder is displayed on the DataStudio page only after you associate a MaxCompute compute engine with a workspace on the Workspace page. For more information, see Associate a MaxCompute compute engine with a workspace.
  • A workflow is created. The workflow is used to store the node that you want to create. Before you create a node, you must create a workflow. For more information, see Create a workflow.

Background information

Spark on MaxCompute is a computing service that is provided by MaxCompute and is compatible with open source Spark. Spark on MaxCompute provides a Spark computing framework by integrating computing resources, datasets, and permission systems. Spark on MaxCompute allows you to use your preferred development method to submit and run Spark jobs. Spark on MaxCompute can meet diverse data processing and analytics requirements. In DataWorks, you can use ODPS Spark nodes to schedule and run MaxCompute Spark jobs and integrate MaxCompute Spark jobs with other types of jobs.

Spark on MaxCompute allows you to use Java, Scala, or Python to develop MaxCompute Spark jobs and run the jobs in local or cluster mode. Spark on MaxCompute also allows you to run offline MaxCompute Spark jobs in cluster mode in DataWorks. For more information about the running modes of MaxCompute Spark jobs, see Running modes.

Develop an ODPS Spark node

ODPS Spark nodes allow you to use Java, Scala, or Python to run offline MaxCompute Spark jobs. The operations and parameters that are required for development vary based on the programming language that you use. You can select a programming language based on your business requirements.

Programming language: Java or Scala

Before you run Java or Scala code in an ODPS Spark node, you must complete the development of code for a MaxCompute Spark job on your on-premises machine and upload the code to DataWorks as a MaxCompute resource. You must perform the following steps:
  1. Prepare a development environment.

    You must prepare the development environment in which you want to run an ODPS Spark node based on the operating system that you use. For more information, see Set up a Linux development environment or Set up a Windows development environment.

  2. Develop Java or Scala code.

    Before you run Java or Scala code in an ODPS Spark node, you must complete the development of code for a MaxCompute Spark job on your on-premises machine or in the prepared development environment. We recommend that you use the sample project template provided by Spark on MaxCompute.

  3. Package the developed code and upload the code to DataWorks.

    After the code is developed, you must package the code and upload the package to DataWorks as a MaxCompute resource. For more information, see Create and use MaxCompute resources.

Subsequent operation: Create an ODPS Spark node and run the node. For more information, see Create an ODPS Spark node and run the node.

Programming language: Python (Use the default Python environment)

DataWorks allows you to develop a PySpark job by writing code to a Python resource online in DataWorks and commit and run the code by using an ODPS Spark node. For information about how to create a Python resource in DataWorks and view examples for developing Spark on MaxCompute applications by using PySpark, see Create and use MaxCompute resources and Develop a Spark on MaxCompute application by using PySpark.
Note You can use the default Python environment provided by DataWorks to develop code. If third-party packages that are supported by the default Python environment cannot meet the requirements of the PySpark job, you can refer to Programming language: Python (Use a custom Python environment) to prepare a custom Python environment. You can also use PyODPS 2 nodes or PyODPS 3 nodes, which support more Python resources for the development.
Subsequent operation: Create an ODPS Spark node and run the node. For more information, see Create an ODPS Spark node and run the node.

Programming language: Python (Use a custom Python environment)

If the default Python environment cannot meet your business requirements, you can perform the following steps to prepare a custom Python environment to run your MaxCompute Spark job.
  1. Prepare a Python environment on your on-premises machine.

    You can refer to PySpark Python versions and supported dependencies to configure a Python environment based on your business requirements.

  2. Package the code for the Python environment and upload the package to DataWorks.

    You must package the code for the Python environment in the ZIP format and upload the package to DataWorks as a MaxCompute resource. This way, you can run the MaxCompute Spark job in the environment. For more information, see Create and use MaxCompute resources.

Subsequent operation: Create an ODPS Spark node and run the node. For more information, see Create an ODPS Spark node and run the node.

Create an ODPS Spark node and run the node

Step 1: Go to the entry point for creating an ODPS Spark node

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region where your workspace resides. Find your workspace and click DataStudio in the Actions column.
  2. Go to the entry point for creating an ODPS Spark node.
    In the Scheduled Workflow pane of the DataStudio page, find the required workflow and create an ODPS Spark node in the workflow. Configure basic information for the node, such as the name and storage path. The following figure shows the entry points for creating an ODPS Spark node and the creation process. Entry points for creating an ODPS Spark node

Step 2: Configure parameters for the ODPS Spark node

You can run offline MaxCompute Spark jobs in cluster mode. In this mode, you must specify the Main method as the entry point of a custom application. A Spark job ends when Main succeeds or fails. You must add the configuration items in the spark-defaults.conf file to the configurations of the ODPS Spark node one at a time. For example, you must add the configuration items such as the number of executors, the memory size, and spark.hadoop.odps.runtime.end.point.
Note You do not need to upload the spark-defaults.conf file. Instead, you must add the configuration items in the spark-defaults.conf file to the configurations of the ODPS Spark node one at a time.
Configure the ODPS Spark node
ParameterDescriptionspark-submit command
Spark Version

The version of Spark. Valid values: Spark1.x, Spark2.x, and Spark3.x.

None
LanguageThe programming language. Valid values: Java/Scala and Python. You can select a programming language based on your business requirements. None
Main JAR ResourceThe main JAR or Python resource file.

You must upload the required resource file to DataWorks and commit the resource file in advance. For more information, see Create and use MaxCompute resources.

app jar or Python file
Configuration Items
The configuration items that are required to submit the MaxCompute Spark job.
  • You do not need to configure spark.hadoop.odps.access.id, spark.hadoop.odps.access.key, and spark.hadoop.odps.end.point. By default, the values of these configuration items are the same as those of the MaxCompute project. You can also explicitly configure these items to overwrite their default values if necessary.
  • You do not need to upload the spark-defaults.conf file. Instead, you must add the configuration items in the spark-defaults.conf file to the configurations of the ODPS Spark node one at a time. For example, you must add the configuration items such as the number of executors, the memory size, and spark.hadoop.odps.runtime.end.point.
--conf PROP=VALUE
Main ClassThe name of the main class. This parameter is required only if you set the Language parameter to Java/Scala. --class CLASS_NAME
Parameters
You can add parameters based on your business requirements. Separate multiple parameters with spaces. DataWorks allows you to add scheduling parameters in the ${Variable name} format. After the parameters are added, you must click the Properties tab in the right-side navigation pane and assign values to the related variables in the Parameters section.
Note For information about the supported formats of scheduling parameters, see Supported formats of scheduling parameters.
[app arguments]
Other resources
The following types of resources are also supported. You can select resources based on your business requirements.
  • Jar resource: You can select this type of resource only if you set the Language parameter to Java/Scala.
  • Python resource: You can select this type of resource only if you set the Language parameter to Python.
  • File resource
  • Archive resource: Only compressed resources are displayed.
You must upload the required resource file to DataWorks and commit the resource file in advance. For more information, see Create and use MaxCompute resources.
Commands for different types of resources:
  • --jars JARS
  • --py-files PY_FILES
  • --files FILES
  • --archives ARCHIVES

Step 3: Commit and deploy the ODPS Spark node

After you deploy the ODPS Spark node to the production environment, DataWorks automatically schedules the node.

  1. On the configuration tab of the ODPS Spark node, click the Properties tab in the right-side navigation pane to configure scheduling properties for the node. For more information, see Overview.
  2. Save and commit the ODPS Spark node.
    Important Before you commit the node, you must configure the Rerun and Parent Nodes parameters on the Properties tab.
    1. Click the Save icon in the top toolbar to save the node.
    2. Click the Submit icon in the top toolbar. In the Commit Node dialog box, enter information in the Change description field and click OK to commit the ODPS Spark node.
      After the ODPS Spark node is committed, DataWorks automatically schedules the node based on the configurations.
    If the workspace that you use is in standard mode, you must click Deploy in the upper part to deploy the ODPS Spark node after you commit the node. For more information, see Deploy nodes.

Best practices

For information about the development of MaxCompute Spark jobs in other scenarios, see the following topics:

What to do next

  • Perform O&M on the ODPS Spark node: After the ODPS Spark node is deployed to Operation Center in the production environment, you can perform O&M operations on the node in Operation Center.
  • Enable the system to diagnose the Spark job: MaxCompute provides the Logview tool and Spark Web UI. You can view the logs of the Spark job to check whether the job is submitted and run as expected.