Run PyODPS Nodes in DataWorks Within Memory & Package Limits - MaxCompute

PyODPS is the Python SDK for MaxCompute. In DataWorks, you can create PyODPS nodes to write and run Python code against MaxCompute — with the MaxCompute entry point pre-configured, so no authentication setup is needed.

This topic covers the DataWorks-specific behaviors of PyODPS nodes: environment limits, key capabilities, and code examples.

Prerequisites

Before you begin, ensure that you have:

A DataWorks workspace with MaxCompute enabled
A PyODPS node created in DataStudio — either PyODPS 2 (Python 2) or PyODPS 3 (Python 3)

For instructions on creating a node, see Develop a PyODPS 2 task and Develop a PyODPS 3 task.

Limitations

Memory

If memory usage exceeds the node limit, the node reports Got killed and terminates. To avoid this, push data processing tasks to MaxCompute for distributed execution instead of downloading data to the DataWorks environment and processing it locally. For a comparison of the two approaches, see the "Precautions" section in Overview.

Data records

By default, options.tunnel.use_instance_tunnel is False in DataWorks PyODPS nodes. This means instance.open_reader uses the Result interface, which returns up to 10,000 data records and has limited support for complex data types. To read all records or handle complex types such as Arrays, enable InstanceTunnel. See Read SQL execution results for details.

Packages

Most common data science packages come pre-installed (see Pre-installed packages). The following restrictions apply:

atexit is not supported. Use try-finally instead.
matplotlib is not available, which affects the DataFrame plot function.
DataFrame UDFs run in a Python sandbox and can only use pure Python libraries and NumPy. Other third-party libraries such as pandas are not supported inside UDFs. For non-UDF operations, NumPy and pandas are available.
Third-party packages that contain binary code are not supported.

MaxCompute entry point

Each PyODPS node exposes odps and o as global variables — both refer to the MaxCompute entry point. DataWorks configures these automatically, so no authentication setup is needed in your code.

# Check whether the pyodps_iris table exists.
print(o.exist_table('pyodps_iris'))

The entry object o can access MaxCompute only. It cannot access other Alibaba Cloud services, and additional authentication cannot be obtained using methods such as o.from_global.

Execute SQL statements

Use execute_sql() or run_sql() to run DDL (data definition language) and DML (data manipulation language) statements.

o.execute_sql('select * from pyodps_iris')

For non-DDL/DML statements, use the appropriate method:

run_security_query — for GRANT and REVOKE statements
run_xflow or execute_xflow — for API operations

For full SQL documentation, see SQL.

Read SQL execution results

Because InstanceTunnel is not enabled by default in DataWorks, instance.open_reader is limited to 10,000 records and may not support complex data types. If your project does not have data protection enabled and you need to read all records or use complex types such as Arrays, enable InstanceTunnel.

Option 1: Enable globally

Applies to all subsequent open_reader calls in the node.

options.tunnel.use_instance_tunnel = True
options.tunnel.limit_instance_tunnel = False  # Remove the 10,000-record limit.

with instance.open_reader() as reader:
    # InstanceTunnel is active. Use reader.count to get the total number of records.
    pass

Option 2: Enable per reader

Applies only to the current open_reader call.

with instance.open_reader(tunnel=True, limit=False) as reader:
    # InstanceTunnel is active for this reader. All records are accessible.
    pass

For more information, see Obtain the execution results of SQL statements.

DataFrame

To run DataFrame operations in DataWorks, call an immediately executed method such as execute or persist explicitly. Without this, the operation is not triggered.

from odps.df import DataFrame

iris = DataFrame(o.get_table('pyodps_iris'))
# Use execute() to trigger the operation and iterate over results.
for record in iris[iris.sepalwidth < 3].execute():
    print(record)

By default, options.verbose is enabled in DataWorks, so execution details such as Logview URLs are printed automatically.

For more information, see DataFrame (not recommended).

Scheduling parameters

DataWorks injects scheduling parameters into PyODPS nodes differently from SQL nodes:

SQL nodes: ${param_name} is substituted directly into the SQL string.
PyODPS nodes: A global dictionary args is populated before the code runs. Read parameter values using args['param_name'], not ${param_name}.

This design avoids unintended string substitutions in Python code.

Example: On the Scheduling configuration tab of a PyODPS node, set ds=${yyyymmdd} in the Parameters field under Basic properties. Then read the value in code:

# Print the value of the ds scheduling parameter, for example ds=20161116.
print('ds=' + args['ds'])

To query data from the partition that ds points to:

o.get_table('table_name').get_partition('ds=' + args['ds'])

For more information, see Configure and use scheduling parameters.

Runtime hints

Use the hints parameter to pass runtime settings to execute_sql. The value must be a dict.

o.execute_sql('select * from pyodps_iris', hints={'odps.sql.mapper.split.size': 16})

To apply the same settings to all SQL executions in the node, configure options.sql.settings globally:

from odps import options

options.sql.settings = {'odps.sql.mapper.split.size': 16}
o.execute_sql('select * from pyodps_iris')  # The hints are applied automatically.

Third-party packages

Pre-installed packages

The following packages are pre-installed in DataWorks nodes:

Package	Python 2 node	Python 3 node
requests	2.11.1	2.26.0
numpy	1.16.6	1.18.1
pandas	0.24.2	1.0.5
scipy	0.19.0	1.3.0
scikit_learn	0.18.1	0.22.1
pyarrow	0.16.0	2.0.0
lz4	2.1.4	3.1.10
zstandard	0.14.1	0.17.0

Install a custom package

If the package you need is not pre-installed, use pyodps-pack to bundle it and load_resource_package to load it in the node.

Bundle the package. The following example bundles the ipaddress package:
```
pyodps-pack -o ipaddress-bundle.tar.gz ipaddress
```
For Python 2 nodes, add --dwpy27:
```
pyodps-pack --dwpy27 -o ipaddress-bundle.tar.gz ipaddress
```
To reduce bundle size, exclude packages that are already pre-installed in DataWorks:
```
pyodps-pack -o bundle.tar.gz --exclude numpy --exclude pandas <your-package>
```
The total size of downloaded packages cannot exceed 100 MB.
Upload and submit the .tar.gz file as a MaxCompute resource.

In the PyODPS node, load and import the package:

load_resource_package("ipaddress-bundle.tar.gz")
import ipaddress

For more information, see Generate a third-party package for PyODPS and Reference a third-party package in a PyODPS node.

Access MaxCompute with a different account

By default, the entry object o uses the credentials provided by DataWorks for the current workspace. To access a MaxCompute project using a different Alibaba Cloud account, use the as_account method to create a separate entry object.

Important

as_account requires PyODPS 0.11.3 or later.

Procedure

Grant the new account the required permissions on the project. See Appendix: Grant permissions to another account.

In the PyODPS node, create an entry object for the new account:

import os

# Store credentials in environment variables rather than hardcoding them.
new_odps = o.as_account(
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET')
)

Verify that the account switch succeeded by checking the current user:
```
print(new_odps.get_project().current_user)
```
If the output matches the AccessKey ID of the new account, the switch was successful.

Example

This example creates a table, queries it using a different account, and prints the results.

Create the pyodps_iris table and import sample data. For instructions, see Create tables and upload data.

CREATE TABLE IF NOT EXISTS pyodps_iris
(
  sepallength  DOUBLE COMMENT 'sepal length (cm)',
  sepalwidth   DOUBLE COMMENT 'sepal width (cm)',
  petallength  DOUBLE COMMENT 'petal length (cm)',
  petalwidth   DOUBLE COMMENT 'petal width (cm)',
  name         STRING COMMENT 'type'
);

Grant the new account permissions on the project and table. See Appendix: Grant permissions to another account.

Create a PyODPS 3 node and run the following code. For instructions, see Develop a PyODPS 3 task.

from odps import ODPS
import os

# Store credentials in environment variables rather than hardcoding them.
os.environ['ALIBABA_CLOUD_ACCESS_KEY_ID'] = '<your-access-key-id>'
os.environ['ALIBABA_CLOUD_ACCESS_KEY_SECRET'] = '<your-access-key-secret>'

od = o.as_account(
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET')
)

# Query rows where sepallength > 5.
with od.execute_sql('SELECT * FROM pyodps_iris WHERE sepallength > 5').open_reader() as reader:
    print(reader.raw)
    for record in reader:
        print(record["sepallength"], record["sepalwidth"],
              record["petallength"], record["petalwidth"], record["name"])

# Verify the current user.
print(od.get_project().current_user)

Run the node. The output looks similar to the following:

Executing user script with PyODPS 0.11.4.post0

"sepallength","sepalwidth","petallength","petalwidth","name"
5.4,3.9,1.7,0.4,"Iris-setosa"
...
<User 139xxxxxxxxxxxxx>

Diagnostics

If node execution hangs with no output, add the following comment to the top of your code. DataWorks prints the stack trace of all threads every 30 seconds.

# -*- dump_traceback: true -*-

This feature requires PyODPS 3 nodes running versions later than 0.11.4.1.

Check the PyODPS version

Run the following code in a PyODPS node to print the installed version:

import odps
print(odps.__version__)
# Example output: 0.11.2.3

The version is also shown in the node's runtime log.

Appendix: Grant permissions to another account

To let a different Alibaba Cloud account access projects and tables in the current workspace, create an ODPS SQL node and run the following commands. For instructions on creating the node, see Create an ODPS SQL node. For more information about permissions, see Users and permissions.

-- Add the account to the project.
ADD USER ALIYUN$<account_name>;

-- Grant CreateInstance permission on the project.
GRANT CreateInstance ON PROJECT <project_name> TO USER ALIYUN$<account_name>;

-- Grant Describe and Select permissions on the table.
GRANT Describe, Select ON TABLE <table_name> TO USER ALIYUN$<account_name>;

-- Verify the permissions.
SHOW GRANTS FOR ALIYUN$<account_name>;

Appendix: Sample data

The examples in this topic use the pyodps_iris table. To create the table and import the iris dataset, follow Step 1 in Use a PyODPS node to query data based on specific criteria.

What's next

Overview of basic operations — complete PyODPS API reference
Overview of DataFrame — DataFrame operations in depth
Use a PyODPS node to segment Chinese text based on Jieba — end-to-end example
View operating history — view runtime logs and stop running tasks