DataWorks supports PyODPS 2 nodes, which are integrated with MaxCompute SDK for Python. You can edit Python code in PyODPS 2 nodes of DataWorks to process data in MaxCompute.

Background information

MaxCompute provides SDK for Python. You can use the SDK for Python to process data in MaxCompute. For more information, see SDK for Python.
Note
  • The Python version of PyODPS 2 nodes is 2.7.
  • Each PyODPS 2 node can process a maximum of 50 MB data and can occupy a maximum of 1 GB memory. Otherwise, the PyODPS 2 node stops running. Avoid writing excessive data processing code for a PyODPS 2 node.
  • For more information about the hints parameter, see SET operations.

PyODPS 2 nodes are designed to use MaxCompute SDK for Python. If you want to run pure Python code, you can create a Shell node to run the Python scripts that are uploaded to DataWorks. For information about how to reference a third-party package in a PyODPS 2 node, see Reference a third-party package in a PyODPS node.

Create a PyODPS 2 node

  1. Move the pointer over the Create icon and choose MaxCompute > PyODPS 2.
    Alternatively, you can click the required workflow in the Business Flow section, right-click MaxCompute, and then choose Create > PyODPS 2.

    For more information about how to create a workflow, see Manage workflows.

  2. Edit the PyODPS 2 node.
    1. Use the MaxCompute entry.
      In DataWorks, each PyODPS 2 node includes the global variable odps or o, which is the MaxCompute entry. Therefore, you do not need to manually specify the MaxCompute entry.
      print(odps.exist_table('PyODPS_iris'))
    2. Execute SQL statements.
      PyODPS 2 supports MaxCompute SQL queries and allows you to obtain the query results. The return value of the execute_sql or run_sql method is running instances.

      Not all SQL statements that you can execute on the MaxCompute client are supported by PyODPS 2. To execute statements other than data definition language (DDL) and data manipulation language (DML) statements, use other methods.

      For example, to execute a GRANT or REVOKE statement, use the run_security_query method. To run a Machine Learning Platform for AI (PAI) command, use the run_xflow or execute_xflow method.
      o.execute_sql('select * from dual')  # Execute the statement in synchronous mode. Other nodes are blocked until the SQL statement is executed.
      instance = o.run_sql('select * from dual') # Execute the statement in asynchronous mode.
      print(instance.get_logview_address()) # Obtain the Logview URL.
      instance.wait_for_success() # Block other nodes until the SQL statement is executed.
    3. Set runtime parameters.
      You can use the hints parameter to set the runtime parameters. The type of the hints parameter is dict.
      o.execute_sql('select * from PyODPS_iris', hints={'odps.sql.mapper.split.size': 16})
      If you set the sql.settings parameter for the global configuration, you must set the runtime parameters each time you run the code.
      from odps import options
      options.sql.settings = {'odps.sql.mapper.split.size': 16}
      o.execute_sql('select * from PyODPS_iris')  # The hints parameter is set based on global configuration.
    4. Obtain SQL query results.
      You can perform the open_reader operation on the instance that executes the SQL statement in the following scenarios:
      • The SQL statement returns structured data.
        with o.execute_sql('select * from dual').open_reader() as reader:
        for record in reader:  # Process each record.
      • SQL statements such as DESC are executed. In this case, you can use the reader.raw property to obtain raw SQL query results.
        with o.execute_sql('desc dual').open_reader() as reader:
        print(reader.raw)
        Note If you use a custom scheduling parameter and run the PyODPS 2 node on the page, you must fix the parameter to a time. PyODPS nodes do not support direct replacement.
  3. On the node configuration tab, click Properties in the right-side navigation pane. Set the scheduling properties for the node. For more information, see Basic properties.
    PyODPS 2 nodes can use built-in scheduling parameters and custom scheduling parameters:
    • If the PyODPS 2 node uses built-in scheduling parameters, you can assign values on the node configuration tab.Assign values
      Note The shared resource group cannot be connected to the Internet. If you want to connect to the Internet, we recommend that you use a custom resource group or exclusive resources for scheduling. Only DataWorks Professional Edition supports custom resource groups. You can purchase exclusive resources for scheduling in all DataWorks editions. For more information, see DataWorks exclusive resources.
    • You can also set custom scheduling parameters in the Properties > General section.
      Note You must specify a custom parameter in the format of args['Parameter name'], for example, print (args['ds']).
  4. Commit the node.
    Notice Before you commit the node, you must set the Rerun and Parent Nodes parameters.
    1. Click the Submit icon in the toolbar.
    2. In the Commit Node dialog box, enter your comments in the Change description field.
    3. Click OK.
    In a workspace in standard mode, you must click Deploy in the upper-right corner after you commit the node.
  5. Test the node. For more information, see View auto triggered nodes.

Built-in services for a PyODPS node

A PyODPS node contains the following built-in services:
  • setuptools
  • cython
  • psutil
  • pytz
  • dateutil
  • requests
  • pyDes
  • numpy
  • pandas
  • scipy
  • scikit_learn
  • greenlet
  • six
  • Other built-in services in Python 2.7, such as smtplib