DataWorks supports PyODPS 2 nodes, which are integrated with MaxCompute SDK for Python. You can edit Python code on PyODPS 2 nodes in the DataWorks console to process data in MaxCompute.

Background information

MaxCompute provides an SDK for Python. You can use the SDK for Python to process data in MaxCompute. For more information, see SDK for Python.
  • The Python version of PyODPS 2 nodes is Python 2.7.
  • Each PyODPS 2 node can process a maximum of 50 MB of data and can occupy a maximum of 1 GB of memory. If the data that needs to be processed on a PyODPS 2 node exceeds 50 MB or the memory that is occupied by the PyODPS 2 node exceeds 1 GB, the PyODPS 2 node is terminated. To avoid this issue, we recommend that you do not write excessive code for data processing on a PyODPS 2 node.
  • For more information about the hints parameter, see SET operations.

PyODPS 2 nodes are designed to use MaxCompute SDK for Python. If you want to run pure Python code, you can create a Shell node to run Python scripts that are uploaded to DataWorks. You can call a third-party package in a PyODPS 2 node. For more information, see Use a PyODPS node to reference a third-party package.

Create a PyODPS 2 node

  1. On the DataStudio page, move the pointer over the Create icon and choose Create Node > MaxCompute > PyODPS 2.
    Alternatively, you can click the required workflow in the Business Flow section, right-click MaxCompute, and then choose Create Node > PyODPS 2.

    For more information about how to create a workflow, see Create a workflow.

  2. In the Create Node dialog box, configure the Name and Path parameters.
    Note The node name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
  3. Click Commit.
  4. Edit the PyODPS 2 node.
    1. Use the MaxCompute entry point.
      In DataWorks, each PyODPS 2 node includes the global variable odps or o, which is the MaxCompute entry point. Therefore, you do not need to manually specify the MaxCompute entry point.
    2. Execute SQL statements.
      PyODPS 2 nodes allow you to execute MaxCompute SQL statements to query data and obtain query results. If you use the execute_sql or run_sql method, the instances that are running are returned.

      Not all SQL statements that you can execute on the MaxCompute client are supported by PyODPS 2 nodes. To execute statements other than DDL or DML statements, you must use alternative methods.

      For example, if you want to execute a GRANT or REVOKE statement, use the run_security_query method. If you want to run a Machine Learning Platform for AI (PAI) command, use the run_xflow or execute_xflow method.
      o.execute_sql('select * from dual') # Execute SQL statements in synchronous mode. Threads are blocked until the execution of the SQL statements is complete. 
      instance = o.run_sql('select * from dual') # Execute SQL statements in asynchronous mode. 
      print(instance.get_logview_address()) # Obtain the Logview URL of an instance. 
      instance.wait_for_success()  # Threads are blocked until the execution of the SQL statements is complete. 
    3. Configure runtime parameters.
      You can use the hints parameter to configure runtime parameters. The hints parameter is of the DICT type.
      o.execute_sql('select * from PyODPS_iris', hints={'odps.sql.mapper.split.size': 16})
      If you specify the sql.settings parameter for global configurations, you must configure runtime parameters each time you run the code.
      from odps import options
      options.sql.settings = {'odps.sql.mapper.split.size': 16}
      o.execute_sql('select * from PyODPS_iris') # Configure the hints parameter based on global configurations. 
    4. Obtain query results of SQL statements.
      You can use the open_reader method to obtain query results in the following scenarios:
      • The SQL statements return structured data.
        with o.execute_sql('select * from dual').open_reader() as reader:
        for record in reader: # Process each record. 
      • SQL statements such as DESC are executed. In this case, you can use the reader.raw property to obtain raw query results.
        with o.execute_sql('desc dual').open_reader() as reader:
        Note If you use a custom scheduling parameter and run the PyODPS 2 node on the configuration tab for the node, you must set the scheduling parameter to a constant value to specify a fixed time. The value of a custom scheduling parameter for PyODPS nodes cannot be automatically changed.
  5. On the right side of the configuration tab for the node, click the Properties tab. In the Properties panel, configure scheduling properties for the node. For more information, see Configure basic properties.
    PyODPS 2 nodes support built-in scheduling parameters and custom scheduling parameters:
    • If you want to configure built-in scheduling parameters for the PyODPS 2 node, you can assign values on the configuration tab for the node. Assign values
      Note The shared resource group cannot be connected to the Internet. If you want to connect to the Internet, we recommend that you use a custom resource group or exclusive resources for scheduling. You can use custom resource groups only in DataWorks Professional Edition. You can purchase exclusive resources for scheduling in all DataWorks editions. For more information, see Overview.
    • You can also configure custom scheduling parameters in the General section of the Properties panel.
      Note Custom parameters must be in the args['Parameter name'] format, such as print (args['ds']).
  6. Commit the node.
    Notice Before you commit the node, you must configure the Rerun and Parent Nodes parameters.
    1. Click the Submit icon on the toolbar to commit the node.
    2. In the Commit Node dialog box, enter information in the Change description field.
    3. Click OK.
    In a workspace in standard mode, you must click Deploy in the upper-right corner after you commit the node. For more information, see Deploy nodes.
  7. Perform O&M operations on the node. For more information, see O&M overview of auto triggered nodes.

FAQ: How do I determine whether a custom Python script is successfully run?

The logic for determining whether a custom Python script is successfully run is the same as the logic for determining whether a custom Shell script is successfully run. For more information, see Create a Shell node.

Built-in services for a PyODPS 2 node

A PyODPS 2 node contains the following built-in services:
  • setuptools
  • cython
  • psutil
  • pytz
  • dateutil
  • requests
  • pyDes
  • numpy
  • pandas
  • scipy
  • scikit_learn
  • greenlet
  • six
  • Other built-in services in Python 2.7, such as smtplib