DataWorks supports PyODPS nodes, which are integrated with the Python SDK of MaxCompute. You can edit Python code in PyODPS nodes of DataWorks to process data in MaxCompute.

You can also use the Python SDK of MaxCompute to process data in MaxCompute.

Note The Python version of PyODPS nodes is 2.7.

PyODPS nodes are designed to use the Python SDK of MaxCompute. If you want to run pure Python code, you can create a Shell node to run the Python scripts uploaded to DataWorks.

Each PyODPS node supports processing a maximum of 50 MB data and can occupy a maximum of 1 GB memory. Otherwise, DataWorks terminates the PyODPS node. Avoid writing too much data processing code for a PyODPS node.

Create a PyODPS node

  1. Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the target workspace and click Data Analytics in the Actions column.
  2. Move the pointer over the Create icon and choose MaxCompute > PyODPS.

    You can also find the target workflow, right-click MaxCompute, and choose Create > PyODPS.

  3. In the Create Node dialog box that appears, enter the node name, select the target folder, and click Commit.
    Note A node name can be up to 128 characters in length.
  4. Edit the code of the PyODPS node on the node configuration tab.

    After you edit the code of the PyODPS node, you can save the code and commit the node. For more information, see PyODPS node configuration tab.

    1. Use the MaxCompute entry.
      Each PyODPS node includes the global variable odps or o, which is the MaxCompute entry. Therefore, you do not need to manually specify the MaxCompute entry.
    2. Run SQL statements.
      In PyODPS nodes, you can run MaxCompute SQL statements to query data and obtain the query results. You can use the execute_sql or run_sql method to run MaxCompute job instances.
      Note To run statements that are not directly compatible with the MaxCompute console, you can use some methods. For example, you cannot directly run statements other than DDL and DML in the MaxCompute console.
      To run a GRANT or REVOKE statement, use the run_security_query method. To run a PAI command, use the run_xflow or execute_xflow method.
      o.execute_sql('select * from dual')  # Run the statement in synchronous mode. Other nodes are blocked until the SQL statement is run.
      instance = o.run_sql('select * from dual') # Run the statement in asynchronous mode.
      print(instance.get_logview_address()) # Obtain the Logview URL of an instance.
      instance.wait_for_success() # Other nodes are blocked until the SQL statement is run.
    3. Set runtime parameters.
      You can use the hints parameter to set the runtime parameters. The type of the hints parameter is DICT.
      o.execute_sql('select * from PyODPS_iris', hints={'odps.sql.mapper.split.size': 16})
      If you set the sql.settings parameter for the global configuration, you must set the runtime parameters each time you run the code.
      from odps import options
      options.sql.settings = {'odps.sql.mapper.split.size': 16}
      o.execute_sql('select * from PyODPS_iris')  # The hints parameter is automatically set based on global configuration.
    4. Obtain SQL query results.
      You can use the open_reader method to obtain query results in the following scenarios:
      • The SQL statement returns structured data.
        with o.execute_sql('select * from dual').open_reader() as reader:
        for record in reader:  # Process each record.
      • SQL statements such as DESC are run. In this case, you can use the reader.raw property to obtain raw query results.
        with o.execute_sql('desc dual').open_reader() as reader:
        Note If you use a custom time variable, you must fix the variable to a time. PyODPS nodes do not support relative time variables.
  5. Configure the node properties.

    Click the Properties tab in the right-side navigation pane. On the Properties tab that appears, set the relevant parameters. For more information, see Properties.

    1. If use a system-defined scheduling parameter in the node code, you can directly obtain the value of the scheduling parameter. Exclusive
      Note Public IP addresses cannot be accessed when a PyODPS node is running on the default resource group. If you want to access public IP addresses in a PyODPS node, we recommend that you use a custom resource group or exclusive resources for scheduling. You can only use custom resource groups in DataWorks Professional Edition, whereas you can use exclusive resources for scheduling in all DataWorks editions. For more information, see DataWorks exclusive resources.
    2. After assigning the value to a variable, commit the node and go to Operation Center. In Operation Center, test the node to view the assignment result.Test
    3. You can set custom parameters in the General section of the Properties tab.
      Note You must specify a custom parameter in the format of args['Parameter name'], for example, print (args['ds']).
  6. Commit the node.

    After the node properties are configured, click the Save icon in the upper-left corner. Then, commit or commit and unlock the node to the development environment.

  7. Deploy the node.

    For more information, see Deploy a node.

  8. Test the node in the production environment.

Built-in modules for PyODPS nodes

A PyODPS node contains the following built-in modules:
  • setuptools
  • cython
  • psutil
  • pytz
  • dateutil
  • requests
  • pyDes
  • numpy
  • pandas
  • scipy
  • scikit_learn
  • greenlet
  • six
  • Other built-in modules in Python 2.7, such as smtplib