A MaxCompute Spark node processes data by using Java and Python. This topic describes how to create and configure a MaxCompute Spark node.

Background information

Python resources are referenced in the user-defined functions (UDFs) of Python. However, Python resources have limited usage because you can obtain only a few dependent third-party packages.

PyODPS 2 and PyODPS 3 nodes support Python resources. For more information, see Create a PyODPS 2 node and Create a PyODPS 3 node.

This topic describes how to create JAR and Python resources and upload them based on your business requirements after you create a MaxCompute Spark node.

Create a JAR resource

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region where your workspace resides, find the workspace, and then click Data Analytics in the Actions column.
  2. On the DataStudio page, move the pointer over the Create icon and choose MaxCompute > Resource > JAR.
    Alternatively, you can find your desired workflow, right-click the workflow name, and then choose Create > MaxCompute > Resource > JAR.
  3. In the Create Resource dialog box, specify Resource Name and Location.
    Note
    • If multiple MaxCompute compute engine instances are bound to the current workspace, you must select one from the Engine Instance MaxCompute drop-down list.
    • If the selected JAR package has been uploaded from the MaxCompute client, clear Upload to MaxCompute. If you do not clear it, an error occurs during the upload process.
    • The resource name can be different from the name of the uploaded file.
    • The resource name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.). The name is not case-sensitive. A JAR resource name must end with .jar, and a Python resource name must end with .py.
  4. Click Upload and select the file to upload.
  5. Click Create.
  6. Click the Submit icon in the top toolbar to commit the code.
  7. In the Commit Node dialog box, enter your comments in the Change description field and click OK.

Create a Python resource

  1. On the DataStudio page, move the pointer over the Create icon and choose MaxCompute > Resource > Python.
    Alternatively, you can find your desired workflow, right-click the workflow name, and then choose Create > MaxCompute > Resource > Python.
  2. In the Create Resource dialog box, specify Resource Name and Location.
    Note
    • If multiple MaxCompute compute engine instances are bound to the current workspace, you must select one from the Engine Instance MaxCompute drop-down list.
    • The resource name can contain letters, digits, periods (.), underscores (_), and hyphens (-), and must end with .py.
    • The created Python resource can be run by using only the code of Python 2.X or 3.X.
  3. Click Create.
  4. On the node configuration tab, enter the Python code.
    In the following example, the Python code defines the logic for checking whether parameter values are correct instead of the logic for processing data.
    # -*- coding: utf-8 -*-
    import sys
    from pyspark.sql import SparkSession
    
    try:
        # for python 2
        reload(sys)
        sys.setdefaultencoding('utf8')
    except:
        # python 3 not needed
        pass
    
    if __name__ == '__main__':
        spark = SparkSession.builder\
            .appName("spark sql")\
            .config("spark.sql.broadcastTimeout", 20 * 60)\
            .config("spark.sql.crossJoin.enabled", True)\
            .config("odps.exec.dynamic.partition.mode", "nonstrict")\
            .config("spark.sql.catalogImplementation", "odps")\
            .getOrCreate()
    
    def is_number(s):
        try:
            float(s)
            return True
        except ValueError:
            pass
    
        try:
            import unicodedata
            unicodedata.numeric(s)
            return True
        except (TypeError, ValueError):
            pass
    
        return False
    
    print(is_number('foo'))
    print(is_number('1'))
    print(is_number('1.3'))
    print(is_number('-1.37'))
    print(is_number('1e3'))
  5. Click the Submit icon in the top toolbar to commit the code.
  6. In the Commit Node dialog box, enter your comments in the Change description field and click OK.

Create a MaxCompute Spark node

  1. On the DataStudio page, move the pointer over the Create icon and choose MaxCompute > ODPS Spark.
    Alternatively, you can find your desired workflow, right-click the workflow name, and then choose Create > MaxCompute > ODPS Spark.
  2. In the Create Node dialog box, set the Node Name and Location parameters.
    Note The node name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
  3. Click Commit.
  4. On the configuration tab of the MaxCompute Spark node, configure the parameters. For more information about MaxCompute Spark nodes, see Overview.
    Two options are available for the Spark Version and Language parameters of the MaxCompute Spark node. The parameters on this tab vary based on the value of the Language parameter. You can configure the parameters as prompted.
    • The following table describes the parameters that appear on this tab when you set Language to Java/Scala. Configure the parameters
      Parameter Description
      Spark Version The Spark version of the node. Valid values: Spark1.x and Spark2.x.
      Language The programming language used by the node. Select Java/Scala.
      Main JAR Resource The main JAR resource referenced by the node. Select the desired JAR resource from the drop-down list.
      Configuration Items The configuration items of the node. Click Add, and enter a key and a value to add a configuration item. Alternatively, you can directly select a key from the field that appears. In this case, a value is automatically entered for the key.
      Main Class The class name of the node.
      Arguments The parameters of the node. Separate multiple parameters with spaces. You can configure scheduling parameters. If you want to perform such an operation, click Properties in the right-side navigation pane. For more information, see Scheduling parameters.
      Note After you configure the scheduling parameters, you must continue to configure the parameters for the node. The scheduling parameters and node parameters are run in sequence.
      JAR Resources The JAR resource referenced by the node. The system displays all the uploaded JAR resources. Select the desired JAR resource from the drop-down list.
      File Resources The file resource referenced by the node. The system displays all the uploaded file resources. Select the desired file resource from the drop-down list.
      Archive Resources The archive resource referenced by the node. The system displays all the uploaded archive resources that are compressed. Select the desired archive resource from the drop-down list.
    • The following table describes the parameters that appear when you set Language to Python. Python
      Parameter Description
      Spark Version The Spark version of the node. Valid values: Spark1.x and Spark2.x.
      Language The programming language used by the node. Select Python.
      Main Python Resource The main Python resource referenced by the node. Select the desired Python resource from the drop-down list.
      Configuration Items The configuration items of the node. Click Add, and enter a key and a value to add a configuration item. Alternatively, you can directly select a key from the field that appears. In this case, a value is automatically entered for the key.
      Arguments The parameters of the node. Separate multiple parameters with spaces.
      Python Resources The Python resource referenced by the node. The system displays all the uploaded Python resources. Select the desired Python resource from the drop-down list.
      File Resources The file resource referenced by the node. The system displays all the uploaded file resources. Select the desired file resource from the drop-down list.
      Archive Resources The archive resource referenced by the node. The system displays all the uploaded archive resources that are compressed. Select the desired archive resource from the drop-down list.
  5. On the node configuration tab, click Properties in the right-side navigation pane. On the Properties tab, configure the scheduling properties for the node. For more information, see Basic properties.
  6. Save and commit the node.
    Notice You must set the Rerun and Parent Nodes parameters before you can commit the node.
    1. Click the Save icon in the toolbar to save the node.
    2. Click the Commit icon in the toolbar.
    3. In the Commit Node dialog box, enter your comments in the Change description field.
    4. Click OK.
    In a workspace in standard mode, you must click Deploy in the upper-right corner after you commit the node. For more information, see Deploy nodes.
  7. Test the node. For more information, see View auto triggered nodes.