Develop a MaxCompute Spark task

You can run Spark on MaxCompute tasks in local or cluster mode. You can also run offline Spark on MaxCompute tasks in cluster mode in DataWorks to integrate the tasks with other types of nodes for scheduling. This topic describes how to configure and schedule a Spark on MaxCompute task in DataWorks.

Overview

Spark on MaxCompute is a MaxCompute computing service compatible with open-source Spark. It provides a Spark computing framework on top of unified compute resources and a dataset permission system. This lets you submit and run Spark tasks by using familiar development methods to meet a wider range of data processing and analysis needs. In DataWorks, you can use a MaxCompute Spark node to schedule and run Spark on MaxCompute tasks and integrate them with other tasks.

Spark on MaxCompute supports development in Java, Scala, and Python, and runs tasks in either local mode or cluster mode. In DataWorks, Spark on MaxCompute offline tasks run in cluster mode. For more information about the run modes, see Run modes.

Limits

If an error is reported when you commit an ODPS Spark node that uses the Spark 3.X version, purchase a serverless resource group. For more information, see Create and use a serverless resource group.

Preparations

You can use a MaxCompute Spark node to run a Spark on MaxCompute offline task in Java/Scala or Python. The development steps and configuration process differ for each language. Choose a language based on your business requirements.

Java/Scala

Before you run Java or Scala code in an ODPS Spark node, you must complete the development of code for a Spark on MaxCompute task on your on-premises machine and upload the code to DataWorks as a MaxCompute resource. You must perform the following steps:

Prepare a development environment.

You must prepare the development environment in which you want to run a Spark on MaxCompute task based on the operating system that you use. For more information, see Set up a Linux development environment or Set up a Windows development environment.
Develop Java or Scala code.

Before you run Java or Scala code in an ODPS Spark node, you must complete the development of code for a Spark on MaxCompute task on your on-premises machine or in the prepared development environment. We recommend that you use the sample project template provided by Spark on MaxCompute.
Package the developed code and upload the code to DataWorks.

After the code is developed, you must package the code and upload the package to DataWorks as a MaxCompute resource. For more information, see Create and use MaxCompute resources.

Programming language: Python (Use the default Python environment)

DataWorks allows you to develop a PySpark task by writing code to a Python resource online in DataWorks and commit and run the code by using an ODPS Spark node. For information about how to create a Python resource in DataWorks and view examples for developing Spark on MaxCompute applications by using PySpark, see Create and use MaxCompute resources and Develop a Spark on MaxCompute application by using PySpark.

Note

You can use the default Python environment provided by DataWorks to develop code. If third-party packages that are supported by the default Python environment cannot meet the requirements of the PySpark task, you can refer to Programming language: Python (Use a custom Python environment) to prepare a custom Python environment. You can also use PyODPS 2 nodes or PyODPS 3 nodes, which support more Python resources for the development.

Programming language: Python (Use a custom Python environment)

If the default Python environment cannot meet your business requirements, you can perform the following steps to prepare a custom Python environment to run your Spark on MaxCompute task.

Prepare a Python environment on your on-premises machine.

You can refer to PySpark Python versions and supported dependencies to configure a Python environment based on your business requirements.
Package the code for the Python environment and upload the package to DataWorks.

You must package the code for the Python environment in the ZIP format and upload the package to DataWorks as a MaxCompute resource. This way, you can run the Spark on MaxCompute task in the environment. For more information, see Create and use MaxCompute resources.

Descriptions of parameters

DataWorks runs Spark on MaxCompute offline tasks in cluster mode. In this mode, you must specify the entry point of your custom application in the main method. The Spark task ends when the main method finishes execution, with a status of either Success or Fail. Additionally, you must add the configurations from the spark-defaults.conf file individually to the configuration items of the MaxCompute Spark node. Examples include the number of executor instances, memory size, and the spark.hadoop.odps.runtime.end.point configuration.

Note

You do not need to upload the spark-defaults.conf file. Instead, add each configuration from the spark-defaults.conf file individually as a configuration item for the MaxCompute Spark node.

Parameter	Description	spark-submit command
Spark Version	The version of Spark. Valid values: Spark1.x, Spark2.x, and Spark3.x. Note If an error is reported when you commit an ODPS Spark node that uses the Spark 3.X version, purchase a serverless resource group. For more information, see Create and use a serverless resource group.	None
Language	Select Java/Scala or Python based on the development language of your Spark on MaxCompute task.	None
Main JAR Resource	The main JAR or Python resource file. You must upload the required resource file to DataWorks and commit the resource file in advance. For more information, see Create and use MaxCompute resources.	`app jar or Python file`
Configuration Item	The configuration items that are required to submit the Spark on MaxCompute task. You do not need to configure `spark.hadoop.odps.access.id` (AccessKey ID), `spark.hadoop.odps.access.key` (AccessKey Secret), or `spark.hadoop.odps.end.point` (endpoint). The platform injects the values from the current MaxCompute project by default. You can explicitly configure them to override the default values. You do not need to upload a `spark-defaults.conf` file. Instead, add each configuration from the `spark-defaults.conf` file individually as a configuration item for the MaxCompute Spark node, such as the number of executors, memory size, and the `spark.hadoop.odps.runtime.end.point` configuration.	`--conf PROP=VALUE`
Main Class	The name of the main class. This parameter is required when the Language is set to `Java/Scala`.	`--class CLASS_NAME`
Parameter	Add arguments for your application as needed, separated by spaces. DataWorks supports scheduling parameters in the format ${variable_name}. After configuring variables in the Parameter field, you must assign values to them in the right-side navigation pane under Scheduling Settings > Parameter. Note For information about the supported formats of scheduling parameters, see Supported formats of scheduling parameters.	`[app arguments]`
Other resources	The following types of resources are also supported. You can select the following types of resources based on your business requirements. JAR resource: Available only when the Language is `Java/Scala`. Python resource: Available only when the Language is `Python`. File resource Archive resource: Displays only compressed resources in ZIP format. You must upload the required resource file to DataWorks and commit the resource file in advance. For more information, see Create and use MaxCompute resources.	Commands for different types of resources: `--jars JARS` `--py-files PY_FILES` `--files FILES` `--archives ARCHIVES`

Simple code editing example

This section provides a simple example to show how to use an ODPS Spark node to develop a Spark on MaxCompute task. In this example, a Spark on MaxCompute task is developed to determine whether a string can be converted into digits.

Create a resource.

On the DataStudio page, create a new Python resource and name it spark_is_number.py. For more information, see Create and use MaxCompute resources. Use the following code:

# -*- coding: utf-8 -*-
import sys
from pyspark.sql import SparkSession
try:
    # for python 2
    reload(sys)
    sys.setdefaultencoding('utf8')
except:
    # python 3 not needed
    pass
if __name__ == '__main__':
    spark = SparkSession.builder\
        .appName("spark sql")\
        .config("spark.sql.broadcastTimeout", 20 * 60)\
        .config("spark.sql.crossJoin.enabled", True)\
        .config("odps.exec.dynamic.partition.mode", "nonstrict")\
        .config("spark.sql.catalogImplementation", "odps")\
        .getOrCreate()
def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        pass
    try:
        import unicodedata
        unicodedata.numeric(s)
        return True
    except (TypeError, ValueError):
        pass
    return False
print(is_number('foo'))
print(is_number('1'))
print(is_number('1.3'))
print(is_number('-1.37'))
print(is_number('1e3'))

Save and commit the resource.

In the created ODPS Spark node, configure parameters and scheduling properties for the MaxCompute Spark task by referring to the Descriptions of parameters section in this topic, and save and commit the node.

Parameter	Description
Spark Version	Select Spark2.x.
Language	Select Python.
Main Python Resource	The Python resource spark_is_number.py that you created.

Go to Operation Center in the development environment to backfill data for the ODPS Spark node. For more information, see Backfill data and view data backfill instances (new version).

Note
DataWorks does not provide entry points for you to run ODPS Spark nodes in DataStudio. You must run ODPS Spark nodes in Operation Center in the development environment.
View the result.

After the data backfill instance is successfully run, click tracking URL in the run logs that are generated to view the result. The following information is returned:
```
False
True
True
True
True
```

Advanced code editing examples

For more information about the development of Spark on MaxCompute tasks in other scenarios, see the following topics:

What to do next

After you complete the development of the Spark on MaxCompute task, you can perform the following operations:

Scheduling configuration: Configure periodic scheduling properties such as rerun settings and dependencies for tasks that run regularly. Overview of task scheduling configuration.
Task debugging: Test and run the node code to verify its logic. Task debugging process.
Task deployment: Deploy nodes to run them periodically based on their scheduling configurations. Deploy tasks.

Enable the system to diagnose Spark tasks: MaxCompute provides the Logview tool and Spark Web UI. You can view the logs of Spark tasks to check whether the tasks are submitted and run as expected.