Create a PyODPS 2 node

DataWorks provides PyODPS 2 nodes. PyODPS integrates with MaxCompute SDK for Python. Therefore, you can develop PyODPS nodes by using the PyODPS syntax. You can edit Python code on PyODPS 2 nodes in the DataWorks console to process data in MaxCompute.

Background information

DataWorks allows you to create Python resources in a visualized manner. If you want to use a PyODPS node to reference a third-party package, use an exclusive resource group for scheduling and install the package on the O&M Assistant page of the resource group.
A third-party package that you install on the O&M Assistant page can be referenced only when you run a PyODPS node on an exclusive resource group for scheduling. For information about how to reference third-party packages in MaxCompute Python user-defined functions (UDFs), see Reference third-party packages in Python UDFs.
If you want to use a PyODPS node to access a data source or service that is deployed in a special network environment, such as a virtual private cloud (VPC) or data center, use an exclusive resource group for scheduling to run the node, and establish a network connection between the resource group and the data source or service. For more information, see Establish a network connection between a resource group and a data source.
For more information about the PyODPS syntax, see PyODPS documentation.
PyODPS nodes are classified into PyODPS 2 nodes and PyODPS 3 nodes. The two types of PyODPS nodes use different Python versions at the underlying layer. PyODPS 2 nodes use Python 2, and PyODPS 3 nodes use Python 3. You can create a PyODPS node based on the Python version in use.

Precautions

Due to the specifications of resources in the resource group that is used to run a node, we recommend that you use a PyODPS node to process no more than 50 MB of on-premises data. If a PyODPS node processes more than 50 MB of on-premises data, an out-of-memory (OOM) exception may occur, and the system may report Got killed. We recommend that you do not write excessive data processing code for a PyODPS node. For more information, see Best practices for efficient use of PyODPS nodes.
If the system reports Got killed, the memory usage exceeds the limit, and the system terminates the related processes. We recommend that you do not perform local data operations. However, the limits on the memory usage and CPU utilization do not apply to SQL or DataFrame tasks that are initiated by PyODPS. Take note that to_pandas tasks are excluded.
You can use the NumPy and pandas libraries that are pre-installed in DataWorks to run functions other than UDFs. Third-party packages that contain binary code are not supported.
For compatibility reasons, options.tunnel.use_instance_tunnel is set to False in DataWorks by default. If you want to globally enable InstanceTunnel, you must set this parameter to True.
The Python version of PyODPS 2 nodes is Python 2.7.

For more information about the usage and limits of PyODPS nodes, see Installation guide and limits.

Procedure

For more information about how to use PyODPS 2 nodes, see Create a PyODPS 3 node.


Configuration	Description
Create a PyODPS 2 node	DataWorks provides PyODPS 2 nodes. PyODPS integrates with MaxCompute SDK for Python. You can create a PyODPS 2 node and edit Python code for the node.
Use the MaxCompute entry point	In DataWorks, each PyODPS 2 node includes the global variable odps or o, which is the MaxCompute entry point. Therefore, you do not need to manually specify the MaxCompute entry point.
Execute SQL statements	PyODPS 2 nodes allow you to execute SQL statements to query data.
Configure runtime parameters	You can use the hints parameter to configure runtime parameters. The hints parameter is of the DICT type.
Obtain query results of SQL statements	You can obtain query results of SQL statements.
Use DataFrame to process data	You can use DataFrame to process data.
Configure scheduling properties	If you want the system to periodically run a PyODPS node, you must configure the scheduling properties for the node.
Commit a node	You can commit a node. After you commit a node in a workspace in standard mode, the node takes effect only in the development environment, and the system does not automatically schedule the node in the development environment. You must deploy the node to the production environment before the node can be scheduled to run periodically.

Go to the DataStudio page.
1. Log on to the DataWorks console.
2. In the left-side navigation pane, click Workspaces.
3. In the top navigation bar, select the region where the workspace resides. On the Workspaces page, find the workspace in which you want to create tables, and click DataStudio in the Actions column.
On the DataStudio page, move the pointer over the and choose Create Node > MaxCompute > PyODPS 2.
Alternatively, you can click the required workflow in the Business Flow section, right-click MaxCompute, and then choose Create Node > PyODPS 2.
For more information about how to create a workflow, see Create an auto triggered workflow.
In the Create Node dialog box, configure the Name and Path parameters.
Note The node name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
Click Confirm.

Configure scheduling properties

In the right-side navigation pane of the configuration tab of the node, click the Properties tab. In the Parameters section of the Properties tab, configure scheduling parameters for the node. The method of configuring scheduling parameters for PyODPS nodes is different from that of configuring scheduling parameters for SQL nodes. For more information, see Configure scheduling parameters for different types of nodes.

Different from SQL nodes in DataWorks, strings such as ${param_name} are not replaced in the code of a PyODPS node. Instead, a dictionary named args is added to the PyODPS node as a global variable before the node code is run. You can obtain the scheduling parameters from the dictionary. This way, Python code is not affected. For example, if you set ds=${yyyymmdd} in the Parameters section of the Properties tab, you can run the following command to obtain the parameter value:

print('ds=' + args['ds'])
ds=20161116

Note You can run the following command to obtain the partition named ds=${yyyymmdd}:

o.get_table('table_name').get_partition('ds=' + args['ds'])

Commit a node

Important Before you commit the node, you must configure the Rerun and Parent Nodes parameters.

Click the icon in the top toolbar.
In the Submit dialog box, enter information in the Change description field.
Click Confirm.

In a workspace in standard mode, you must click Deploy in the upper-right corner after you commit the node. For more information, see Deploy nodes.

For more information about node O&M, see Perform basic O&M operations on auto triggered nodes.

FAQ: How do I determine whether a custom Python script is successfully run?

The logic for determining whether a custom Python script is successfully run is the same as the logic for determining whether a custom Shell script is successfully run. For more information, see Create a Shell node.

FAQ

FAQ about PyODPS