An ODPS Spark node processes data by using Java and Python. This topic describes how to create and configure an ODPS Spark node.

Background information

Python resources are referenced in user defined functions (UDFs) of Python. However, Python resources have limited usage because you can obtain only a few dependent third-party packages.

PyODPS 2 and PyODPS 3 nodes support Python resources. For more information, see Create a PyODPS 2 node and Create a PyODPS 3 node.

This topic describes how to create JAR resources and Python resources and upload JAR resources or Python resources based on business needs after you create an ODPS Spark node.

Create a JAR resource

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region where your workspace resides, find the workspace, and then click Data Analytics in the Actions column.
  2. On the Data Development tab, move the pointer over the Create icon icon and choose MaxCompute > Resources > JAR.
    Alternatively, you can click a workflow in the Business process section, right-click MaxCompute, and then choose New > Resource > JAR.
  3. In the New resource dialog box, set the Resource Name and Destination folder parameters.
    Note
    • If multiple MaxCompute compute engines are bound to the current workspace, you must select one MaxCompute compute engine before you create the resource.
    • If the selected JAR package has been uploaded from the MaxCompute client, clear Upload as an ODPS resource. Otherwise, the system reports an error during the upload process.
    • The resource name can be different from the name of the uploaded file.
    • The resource name can contain letters, digits, underscores (_), and periods (.), and is not case-sensitive. It must be 1 to 128 characters in length. A JAR resource name must end with .jar, and a Python resource name must end with .py.
  4. Click Click Upload and select the file to upload.
  5. Click Confirm.
  6. Click the Save icon and Submit icon icons in the toolbar to save and commit the resource.
  7. In the Submit New version dialog box, enter your comments in the Change description field and click OK.

Create a Python resource

  1. On the Data Development tab, move the pointer over the Create icon icon and choose MaxCompute > Resources > Python.
    Alternatively, you can click a workflow in the Business process section, right-click MaxCompute, and then choose New > Resource > Python.
  2. In the New resource dialog box, set the Resource Name and Destination folder parameters.
    Note
    • If multiple MaxCompute compute engines are bound to the current workspace, you must select one MaxCompute compute engine before you create the resource.
    • The resource name can contain letters, digits, periods (.), underscores (_), and hyphens (-), and must end with .py.
  3. Click Confirm.
  4. On the resource configuration tab, enter the Python code.
    In the following example, the Python code defines the logic for verifying whether parameter values are correct, instead of the logic for processing data.
    # -*- coding: utf-8 -*-
    import sys
    from pyspark.sql import SparkSession
    
    try:
        # for python 2
        reload(sys)
        sys.setdefaultencoding('utf8')
    except:
        # python 3 not needed
        pass
    
    if __name__ == '__main__':
        spark = SparkSession.builder\
            .appName("spark sql")\
            .config("spark.sql.broadcastTimeout", 20 * 60)\
            .config("spark.sql.crossJoin.enabled", True)\
            .config("odps.exec.dynamic.partition.mode", "nonstrict")\
            .config("spark.sql.catalogImplementation", "odps")\
            .getOrCreate()
    
    def is_number(s):
        try:
            float(s)
            return True
        except ValueError:
            pass
    
        try:
            import unicodedata
            unicodedata.numeric(s)
            return True
        except (TypeError, ValueError):
            pass
    
        return False
    
    print(is_number('foo'))
    print(is_number('1'))
    print(is_number('1.3'))
    print(is_number('-1.37'))
    print(is_number('1e3'))
  5. Click the Save icon and Submit icon icons in the toolbar to save and commit the resource.
  6. In the Submit New version dialog box, enter your comments in the Change description field and click OK.

Create an ODPS Spark node

  1. On the Data Development tab, move the pointer over the Create icon icon and choose MaxCompute > ODPS Spark.
    Alternatively, you can click a workflow in the Business process section, right-click MaxCompute, and then choose New > ODPS Spark.
  2. In the Create Node dialog box, set the Node Name and Location parameters.
    Note The node name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
  3. Click Commit.
  4. On the configuration tab of the ODPS Spark node, set the parameters as required. For more information about how to create an ODPS Spark node, see Spark on MaxCompute overview.
    Two options are available for the spark version and Language parameters of the ODPS Spark node. The parameters vary based on the value of the Language parameter. You can set the parameters as required.
    • The following table describes the parameters when you set the Language parameter to Java/Scala. ODPS Spark
      Parameter Description
      spark version The Spark version of the node. Valid values: Spark1.x and Spark2.x.
      Language The language used by the node. In this example, select Java/Scala.
      Select the main jar resource The main JAR resource referenced by the node. Select a JAR resource that you uploaded from the drop-down list.
      Configuration Items The configuration items of the node. Click Add a and enter a key and a value to add a configuration item.
      Main Class The class name of the node.
      Parameters The parameters of the node. Multiple parameters are separated with spaces. If you need to set scheduling parameters, click the Scheduling configuration tab in the right-side navigation pane. For more information, see Scheduling parameters.
      Note After you set scheduling parameters, set parameters for the node. The scheduling parameters and node parameters are executed in sequence.
      Select jar resources The JAR resource referenced by the node. Select a JAR resource that you uploaded from the drop-down list. The ODPS Spark node automatically finds the uploaded JAR resources based on the resource type.
      Select file resource The file referenced by the node. Select a file that you uploaded from the drop-down list. The ODPS Spark node automatically finds the uploaded files based on the file type.
      Select archives Resources The archive resource referenced by the node. Select an archive resource that you uploaded from the drop-down list. The ODPS Spark node automatically finds the uploaded archive resources based on the resource type. Only compressed resources appear.
    • The following table describes the parameters when you set the Language parameter to Python.Python
      Parameter Description
      spark version The Spark version of the node. Valid values: Spark1.x and Spark2.x.
      Language The language used by the node. In this example, select Python.
      Select the main python resource The main Python resource referenced by the node. Select a Python resource that you uploaded from the drop-down list.
      Configuration Items The configuration items of the node. Click Add a and enter a key and a value to add a configuration item.
      Parameters The parameters of the node. Multiple parameters are separated with spaces.
      Select python resources The Python resource referenced by the node. Select a Python resource that you uploaded from the drop-down list. The ODPS Spark node automatically finds the uploaded Python resources based on the resource type.
      Select file resource The file referenced by the node. Select a file that you uploaded from the drop-down list. The ODPS Spark node automatically finds the uploaded files based on the file type.
      Select archives Resources The archive resource referenced by the node. Select an archive resource that you uploaded from the drop-down list. The ODPS Spark node automatically finds the uploaded archive resources based on the resource type. Only compressed resources appear.
  5. On the node configuration tab, click the Scheduling configuration tab in the right-side navigation pane. On the Scheduling configuration tab, set the scheduling properties for the node. For more information, see Basic properties.
  6. Save and commit the node.
    Notice You must set the Rerun and Parent Nodes parameters before you can commit the node.
    1. Click the Save icon in the toolbar to save the node.
    2. Click the Commit icon in the toolbar.
    3. In the Commit Node dialog box, enter your comments in the Change description field.
    4. Click OK.
    In a workspace in standard mode, you must click Deploy in the upper-right corner after you commit the node. For more information, see Deploy nodes.
  7. Test the node. For more information, see View auto triggered nodes.