PyODPS provides three ways to filter rows from a MaxCompute table inside a DataWorks PyODPS node: raw SQL via execute_sql(), DataFrame boolean indexing, and the query() string method. This tutorial walks through all three using the Iris dataset.
Prerequisites
Before you begin, ensure that you have:
-
Created a workflow in DataWorks — the steps in this topic are based on a workspace in basic mode that has not joined the public preview of DataStudio
Step 1: Prepare the dataset
-
Download the Iris dataset, decompress the archive, and rename
iris.datatoiris.csv. -
Create a table named
pyodps_irisand uploadiris.csv. For details, see Create a table and upload data. Use the following DDL to create the table:CREATE TABLE if not exists pyodps_iris ( sepallength DOUBLE comment 'sepal length (cm)', sepalwidth DOUBLE comment 'sepal width (cm)', petallength DOUBLE comment 'petal length (cm)', petalwidth DOUBLE comment 'petal width (cm)', name STRING comment 'type' );
Step 2: Create a PyODPS node
-
Log on to the DataWorks console.
-
In the left navigation pane, click Workspace.
-
In the Actions column, choose Shortcuts > Data Development.
-
On the Data Development page, right-click the flow and select Create Node > MaxCompute > PyODPS 2.
-
In the Create Node dialog box, enter a node name and click Confirm.
Step 3: Write the query code
Choose the method that fits your use case:
| Method | Best for |
|---|---|
execute_sql() |
Existing SQL queries, full result sets, complex joins |
DataFrame boolean indexing (iris[condition]) |
Chaining multiple filter steps in Python |
query() |
Readable multi-condition filters written as a string |
Enter the following code in the PyODPS node. All three methods filter rows from the same pyodps_iris table.
import sys
reload(sys)
# Change the default encoding of the system.
sys.setdefaultencoding('utf8')
iris = DataFrame(o.get_table('pyodps_iris'))
# Method 1: Run a SQL query directly.
# Equivalent SQL: SELECT * FROM pyodps_iris WHERE sepallength > 5
with o.execute_sql('select * from pyodps_iris WHERE sepallength > 5 ').open_reader() as reader4:
print reader4.raw
for record in reader4:
print record["sepallength"],record["sepalwidth"],record["petallength"],record["petalwidth"],record["name"]
# Method 2: Filter using DataFrame boolean indexing.
# The expression iris.sepallength > 5 produces a boolean series;
# only rows where the value is True are returned.
print iris[iris.sepallength > 5].head(5)
# Method 3: Filter using the query() string method.
print iris.query("(sepallength < 5) and (petallength > 1.5)").head(5)
Methods 2 and 3 use the PyODPS DataFrame API, which translates Python expressions into SQL and submits them as MaxCompute jobs. Calling.head(n)triggers execution and returns the firstnrows.
Step 4: Run the node and view results
-
Click
.
-
View the output in the Log panel. Method 1 prints each matching record and shows the compiled SQL:
Executing user script with PyODPS 0.8.0 "sepallength","sepalwidth","petallength","petalwidth","name" 5.1,3.5,1.4,0.2,"Iris-setosa" 5.4,3.9,1.7,0.4,"Iris-setosa" ... Sql compiled: CREATE TABLE tmp_pyodps_xxxx LIFECYCLE 1 AS SELECT * FROM xxx.`pyodps_iris` t1 WHERE t1.`sepallength` > 5Method 2 prints the top 5 rows as a DataFrame table:
sepallength sepalwidth petallength petalwidth name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 5.4 3.9 1.7 0.4 Iris-setosa 2 5.4 3.7 1.5 0.2 Iris-setosa 3 5.8 4.0 1.2 0.2 Iris-setosa 4 5.7 4.4 1.5 0.4 Iris-setosaMethod 3 prints the top 5 rows matching both conditions:
sepallength sepalwidth petallength petalwidth name 0 4.8 3.4 1.6 0.2 Iris-setosa 1 4.8 3.4 1.9 0.2 Iris-setosa 2 4.7 3.2 1.6 0.2 Iris-setosa 3 4.8 3.1 1.6 0.2 Iris-setosa 4 4.9 2.4 3.3 1.0 Iris-versicolor