Getting started - MaxCompute - Alibaba Cloud Documentation Center

PyODPS can be called as a data development node on a data development platform, such as DataWorks. These platforms provide an environment to run PyODPS and allow you to schedule and run nodes. You do not need to manually create a MaxCompute entry object. Similar to pandas, PyODPS provides fast, flexible, and expressive data structures. The data processing feature of PyODPS is similar to that of pandas. You can use the data processing feature of PyODPS by calling the DataFrame API provided by PyODPS. This topic describes how to use PyODPS in your projects in the DataWorks console.

Prerequisites

MaxCompute is activated. For more information, see Activate MaxCompute and DataWorks.
DataWorks is activated, and a workspace is created. For more information, see Create a MaxCompute project.

Procedure

Create a PyODPS node.
This section describes how to create a PyODPS 3 node in the DataWorks console. For more information, see Create a PyODPS 3 node.
Note
- The PyODPS 3 node is used as an example. The Python version of a PyODPS 3 node is 3.7.
- Each PyODPS node can process a maximum of 50 MB of data and can occupy a maximum of 1 GB of memory. Otherwise, DataWorks terminates the PyODPS node. We recommend that you do not write Python code that will process a large amount of data in PyODPS nodes.
- Writing and debugging code in DataWorks is inefficient. We recommend that you install IntelliJ IDEA locally to write code.
1. Create a workflow.
  Log on to the DataWorks console and go to the DataStudio page. On the left side of the page, right-click Business Flow and click Create Workflow.
2. Create a PyODPS 3 node.
  Find the workflow that you create, right-click the name of the workflow, and then choose Create > MaxCompute > PyODPS 3. In the Create Node dialog box, specify Node Name and click Commit.

Configure and run the PyODPS 3 node.

Write the code of the PyODPS 3 node.

Write the test code in the code editor of the PyODPS 3 node. In this example, write the following code in the code editor that includes a full range of table operations. For more information about table operations and SQL operations, see Tables and SQL.

from odps import ODPS

# Create a non-partitioned table named my_new_table. The non-partitioned table contains the fields that are of the specified data types and have the specified names. 
# Each PyODPS node in DataWorks contains the global variable odps or o, which is the MaxCompute entry. You can use the entry without the need to define it. For more information, see Use PyODPS in DataWorks. 
table = o.create_table('my_new_table', 'num bigint, id string', if_not_exists=True)

# Write data to the my_new_table table. 
records = [[111, 'aaa'],
          [222, 'bbb'],
          [333, 'ccc'],
          [444, 'Chinese']]
o.write_table(table, records)

# Read data from the my_new_table table. 
for record in o.read_table(table):
    print(record[0],record[1])

# Read data from the my_new_table table by executing an SQL statement. 
result = o.execute_sql('select * from my_new_table;',hints={'odps.sql.allow.fullscan': 'true'})

# Obtain the execution result of the SQL statement. 
with result.open_reader() as reader:    
    for record in reader:            
        print(record[0],record[1])

# Delete the table. 
table.drop()

Run the code.
After you write the code, click the icon in the top toolbar. After the code is run, you can view the running result of the PyODPS 3 node on the Runtime Log tab. The result in the following figure indicates that the code is successfully run.