Use a PyODPS node to read data from a partitioned table - MaxCompute

This topic describes how to use a PyODPS node to read data from a partitioned table.

Prerequisites

The following operations are completed:

MaxCompute is activated. For more information, see Activate MaxCompute.
DataWorks is activated. For more information, see Activate DataWorks.
A workflow is created in the DataWorks console. In this example, a workflow is created for a DataWorks workspace in basic mode. For more information, see Create a workflow.

Procedure

Prepare test data.
1. Create a partitioned table and a source table, and import data to the source table. For more information, see Create tables and upload data.
  In this example, use the following table creation statements and source data.
  - Execute the following statement to create a partitioned table named user_detail:
```
create table if not exists user_detail
(
userid    BIGINT comment 'user ID',
job       STRING comment 'job type',
education STRING comment 'education level'
) comment 'user information table'
partitioned by (dt STRING comment 'date',region STRING comment 'region');
```
  - Execute the following statement to create a source table named user_detail_ods:
```
create table if not exists user_detail_ods
(
  userid    BIGINT comment 'user ID',
  job       STRING comment 'job type',
  education STRING comment 'education level',
  dt STRING comment 'date',
  region STRING comment 'region'
);
```
  - Create a source data file named user_detail.txt and save the following data to the file. Import the data to the user_detail_ods table.
```
0001,Internet,bachelor,20190715,beijing
0002,education,junior college,20190716,beijing
0003,finance,master,20190715,shandong
0004,Internet,master,20190715,beijing
```
2. Log on to the DataWorks console and click Workspaces in the left-side navigation pane. On the Workspaces page, find the target workspace and click Data Analytics in the Actions column. On the Data Development tab, right-click the target workflow in the Business process section and choose New > MaxCompute > ODPS SQL.
3. In the New node dialog box, set the Node name parameter and click Submit.
4. On the configuration tab of the ODPS SQL node, enter the following code in the code editor:
```
insert overwrite table user_detail partition (dt,region)
select userid,job,education,dt,region from user_detail_ods;
```
5. Click the Run icon in the toolbar to insert the data from the user_detail_ods table to the user_detail table.

Use a PyODPS node to read data from a partitioned table.

Log on to the DataWorks console.
In the left-side navigation pane, click Workspaces.
On the Workspaces page, find the target workspace and click Data Analytics in the Actions column.
On the Data Development tab, right-click the target workflow in the Business process section and choose New > MaxCompute > PyODPS 2.
In the New node dialog box, set the Node name parameter and click Submit.

On the configuration tab of the PyODPS 2 node, enter the following code in the code editor:

import sys
from odps import ODPS
reload(sys)
print('dt=' + args['dt'])
# Set UTF-8 as the default encoding format.
sys.setdefaultencoding('utf8')
# Obtain the partitioned table.
t = o.get_table('user_detail')
# Check whether the specified partition exists.
print t.exist_partition('dt=20190715,region=beijing')
# View all partitions in the partitioned table.
for partition in t.partitions:
    print partition.name
# You can use one of the following methods to query data in the partitioned table:
# Method 1
with t.open_reader(partition='dt=20190715,region=beijing') as reader1:
    count = reader1.count
print("Query data in the partitioned table by using Method 1:")
for record in reader1:
    print record[0],record[1],record[2]
# Method 2
print("Query data in the partitioned table by using Method 2:")
reader2 = t.open_reader(partition='dt=20190715,region=beijing')
for record in reader2:
    print record["userid"],record["job"],record["education"]
# Method 3
print("Query data in the partitioned table by using Method 3:")
for record in o.read_table('user_detail', partition='dt=20190715,region=beijing'):
    print record["userid"],record["job"],record["education"]

Click Advanced run (run with parameters) in the toolbar.
In the Parameters dialog box, set the parameters and click Confirm.
Parameter description:
- Scheduling Resource Group: Set this parameter to Common scheduler resource group.
- dt: Set this parameter to dt=20190715.
View the running result of the PyODPS 2 node on the Run Log tab.