An ODPS Spark node processes data by using Java and Python. This topic describes how to create and configure an ODPS Spark node.

Background information

Python resources are referenced in the user-defined functions (UDFs) of Python. However, Python resources have limited usage because you can obtain only a few dependent third-party packages.

PyODPS 2 and PyODPS 3 nodes support Python resources. For more information, see Create a PyODPS 2 node and Create a PyODPS 3 node.

This topic describes how to create JAR and Python resources and upload them based on your business requirements after you create an ODPS Spark node.

Create a JAR resource

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region in which the workspace that you want to manage resides. Find the workspace and click DataStudio in the Actions column.
  2. On the DataStudio page, move the pointer over the Create icon and choose MaxCompute > Resource > JAR.
    Alternatively, you can find your desired workflow, right-click the workflow name, and then choose Create > MaxCompute > Resource > JAR.
  3. In the Create Resource dialog box, specify Resource Name and Location.
    Note
    • If multiple MaxCompute compute engine instances are bound to the current workspace, you must select one from the Engine Instance MaxCompute drop-down list.
    • If the selected JAR package has been uploaded from the MaxCompute client, clear Upload to MaxCompute. If you do not clear it, an error occurs during the upload process.
    • The resource name can be different from the name of the uploaded file.
    • The resource name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.). The name is not case-sensitive. A JAR resource name must end with .jar, and a Python resource name must end with .py.
  4. Click Upload and select the file that you want to upload.
  5. Click Create.
  6. Click the Submit icon in the top toolbar to upload the file.
  7. In the Commit Node dialog box, enter your comments in the Change description field and click OK.

Create a Python resource

  1. On the DataStudio page, move the pointer over the Create icon and choose MaxCompute > Resource > Python.
    Alternatively, you can find your desired workflow, right-click the workflow name, and then choose Create > MaxCompute > Resource > Python.
  2. In the Create Resource dialog box, specify Resource Name and Location.
    Note
    • If multiple MaxCompute compute engines are associated with the current workspace, you must select one from the Engine Instance MaxCompute drop-down list.
    • The resource name can contain only letters, digits, periods (.), underscores (_), and hyphens (-), and must end with .py.
    • The created Python resource can be run by using only the code of Python 2.x or 3.x.
  3. Click Create.
  4. On the node configuration tab, enter the Python code.
    In the following example, the Python code defines the logic for checking whether parameter values are correct instead of the logic for processing data.
    # -*- coding: utf-8 -*-
    import sys
    from pyspark.sql import SparkSession
    
    try:
        # for python 2
        reload(sys)
        sys.setdefaultencoding('utf8')
    except:
        # python 3 not needed
        pass
    
    if __name__ == '__main__':
        spark = SparkSession.builder\
            .appName("spark sql")\
            .config("spark.sql.broadcastTimeout", 20 * 60)\
            .config("spark.sql.crossJoin.enabled", True)\
            .config("odps.exec.dynamic.partition.mode", "nonstrict")\
            .config("spark.sql.catalogImplementation", "odps")\
            .getOrCreate()
    
    def is_number(s):
        try:
            float(s)
            return True
        except ValueError:
            pass
    
        try:
            import unicodedata
            unicodedata.numeric(s)
            return True
        except (TypeError, ValueError):
            pass
    
        return False
    
    print(is_number('foo'))
    print(is_number('1'))
    print(is_number('1.3'))
    print(is_number('-1.37'))
    print(is_number('1e3'))
  5. Click the Submit icon in the top toolbar to upload the file.
  6. In the Commit Node dialog box, enter your comments in the Change description field and click OK.

Create a MaxCompute Spark node

  1. On the DataStudio page, move the pointer over the Create icon and choose MaxCompute > ODPS Spark.
    Alternatively, you can find your desired workflow, right-click the workflow name, and then choose Create > MaxCompute > ODPS Spark.
  2. In the Create Node dialog box, configure the Name and Path parameters.
    Note The node name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
  3. Click Commit.
  4. On the configuration tab of the ODPS Spark node, configure the parameters. For information about ODPS Spark nodes, see Overview.
    Three options are available for the Spark Version parameter, and two options are available for the Language parameter. The parameters displayed on the configuration tab vary based on the value of the Language parameter. You can configure the parameters as prompted.
    • The following table describes the parameters that are displayed on the configuration tab after you set Language to Java/Scala. Configure an ODPS Spark node
      ParameterDescription
      Spark VersionThe Spark version of the node. Valid values: Spark1.x, Spark2.x, and Spark3.x.
      LanguageThe programming language used by the node. Set this parameter to Java/Scala.
      Main JAR ResourceThe main JAR resource that you want to reference for the node. Select the desired JAR resource from the drop-down list.
      Configuration ItemsThe additional items that you want to configure for the node. Click Add, and enter a key and a value to add a configuration item. Alternatively, you can directly select a key from the field that appears. In this case, a value is automatically entered for the key.
      Main ClassThe class name of the node.
      ParametersThe additional parameters that you want to configure for the node. Separate multiple parameters with spaces. You can configure variables for the node in the ${Variable name} format in the Parameters field of the configuration tab, and assign scheduling parameters to the variables as values in the Parameters section of the Properties tab. For information about the value assignment formats of scheduling parameters, see Supported formats of scheduling parameters.
      Note After you configure the scheduling parameters, you must continue to configure the parameters for the node. The scheduling parameters and node parameters are run in sequence.
      JAR ResourcesThe JAR resource that you want to reference for the node. The system displays all the uploaded JAR resources. Select the desired JAR resource from the drop-down list.
      File ResourcesThe file resource that you want to reference for the node. The system displays all the uploaded file resources. Select the desired file resource from the drop-down list.
      Archive ResourcesThe archive resource that you want to reference for the node. The system displays all the uploaded archive resources that are compressed. Select the desired archive resource from the drop-down list.
    • The following table describes the parameters that are displayed after you set Language to Python. Configure the ODPS Spark node
      ParameterDescription
      Spark VersionThe Spark version of the node. Valid values: Spark1.x, Spark2.x, and Spark3.x.
      LanguageThe programming language used by the node. Set this parameter to Python.
      Main Python ResourceThe main Python resource that you want to reference for the node. Select the desired Python resource from the drop-down list.
      Configuration ItemsThe additional items that you want to configure for the node. Click Add, and enter a key and a value to add a configuration item. Alternatively, you can directly select a key from the field that appears. In this case, a value is automatically entered for the key.
      ParametersThe additional parameters that you want to configure for the node. Separate multiple parameters with spaces. You can configure variables for the node in the ${Variable name} format in the Parameters field of the configuration tab, and assign scheduling parameters to the variables as values in the Parameters section of the Properties tab. For information about the value assignment formats of scheduling parameters, see Supported formats of scheduling parameters.
      Python ResourcesThe Python resource that you want to reference for the node. The system displays all the uploaded Python resources. Select the desired Python resource from the drop-down list.
      File ResourcesThe file resource that you want to reference for the node. The system displays all the uploaded file resources. Select the desired file resource from the drop-down list.
      Archive ResourcesThe archive resource that you want to reference for the node. The system displays all the uploaded archive resources that are compressed. Select the desired archive resource from the drop-down list.
  5. On the node configuration tab, click Properties in the right-side navigation pane. On the Properties tab, configure the scheduling properties for the node. For more information, see Configure basic properties.
  6. Save and commit the node.
    Important You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the node.
    1. Click the Save icon in the top toolbar to save the node.
    2. Click the Submit icon in the toolbar.
    3. In the Commit Node dialog box, configure the Change description parameter.
    4. Click OK.
    If the workspace that you use is in standard mode, you must click Deploy in the upper-right corner to deploy the node after you commit it. For more information, see Deploy nodes.
  7. Perform O&M operations on the node. For more information, see Perform basic O&M operations on auto triggered nodes.