Sort data in DataWorks using a PyODPS node - MaxCompute

This tutorial shows how to sort DataFrame data using the PyODPS sort() method in a DataWorks PyODPS 2 node. Each call to sort() compiles to an ORDER BY clause in MaxCompute SQL.

Prerequisites

Before you begin, ensure that you have:

Activated MaxCompute and DataWorks. See Activate MaxCompute and DataWorks and Purchase guide
Created a workflow in a DataWorks workspace. This tutorial uses a workspace in basic mode. See Create a workflow

Step 1: Create a table and import data

Download the iris dataset from iris and rename iris.data to iris.csv.

Create a table named pyodps_iris and upload iris.csv to the table. See Create tables and upload data. Use the following DDL statement to create the table:

CREATE TABLE if not exists pyodps_iris
(
sepallength  DOUBLE comment 'sepal length (cm)',
sepalwidth   DOUBLE comment 'sepal width (cm)',
petallength  DOUBLE comment 'petal length (cm)',
petalwidth   DOUBLE comment 'petal width (cm)',
name         STRING comment 'type'
);

Step 2: Create a PyODPS 2 node

Log on to the DataWorks console.
In the left-side navigation pane, click Workspace.
Find your workspace and choose Shortcuts > Data Development in the Actions column.
On the DataStudio page, right-click your workflow and choose Create Node > MaxCompute > PyODPS 2.
In the Create Node dialog box, enter a name and click Confirm.

Step 3: Write and run the sorting code

On the configuration tab of the PyODPS 2 node, enter the following code in the code editor, then click the Run icon in the toolbar.

The sort() method accepts a column name string or a column reference object. All examples below call .head(5) to return only the first five rows.

Sort by a single column in ascending order

Ascending order is the default. Pass the column name as a string.

from  odps.df import DataFrame

iris = DataFrame(o.get_table('pyodps_iris'))

print iris.sort('sepalwidth').head(5)

Expected output:

Sql compiled:
CREATE TABLE tmp_pyodps_d1b06785_dc18_4288_ad34_de860de1be08 LIFECYCLE 1 AS
SELECT *
FROM WB_BestPractice_dev.`pyodps_iris` t1
ORDER BY sepalwidth
LIMIT 10000
Instance ID: 20191010061554817gwml0lim

   sepallength  sepalwidth  petallength  petalwidth             name
0          5.0         2.0          3.5         1.0  Iris-versicolor
1          6.0         2.2          5.0         1.5   Iris-virginica
2          6.2         2.2          4.5         1.5  Iris-versicolor
3          6.0         2.2          4.0         1.0  Iris-versicolor
4          5.5         2.3          4.0         1.3  Iris-versicolor

Sort by a single column in descending order

Two syntax options are available. Choose based on how you reference the column:

Use ascending=False when the column name is a string.
Use a hyphen prefix (-iris.column) when you have a column reference object.

Option 1: `ascending=False`

print iris.sort('sepalwidth', ascending=False).head(5)

Expected output:

Sql compiled:
CREATE TABLE tmp_pyodps_3cb90bb2_fb95_43fb_ae84_f2b5a27d72dc LIFECYCLE 1 AS
SELECT *
FROM WB_BestPractice_dev.`pyodps_iris` t1
ORDER BY sepalwidth DESC
LIMIT 10000
Instance ID: 20191010061601287gs086792

   sepallength  sepalwidth  petallength  petalwidth         name
0          5.7         4.4          1.5         0.4  Iris-setosa
1          5.5         4.2          1.4         0.2  Iris-setosa
2          5.2         4.1          1.5         0.1  Iris-setosa
3          5.8         4.0          1.2         0.2  Iris-setosa
4          5.4         3.9          1.3         0.4  Iris-setosa

Option 2: Hyphen prefix on a column reference

print iris.sort(-iris.sepalwidth).head(5)

Expected output:

Sql compiled:
CREATE TABLE tmp_pyodps_97b080bb_e014_48e8_a310_4b45fcd6a2ed LIFECYCLE 1 AS
SELECT *
FROM WB_BestPractice_dev.`pyodps_iris` t1
ORDER BY sepalwidth DESC
LIMIT 10000
Instance ID: 20191010061606927g6emz192

   sepallength  sepalwidth  petallength  petalwidth         name
0          5.7         4.4          1.5         0.4  Iris-setosa
1          5.5         4.2          1.4         0.2  Iris-setosa
2          5.2         4.1          1.5         0.1  Iris-setosa
3          5.8         4.0          1.2         0.2  Iris-setosa
4          5.4         3.9          1.3         0.4  Iris-setosa

Both options produce the same SQL (ORDER BY sepalwidth DESC) and the same results.

Sort by multiple columns in the same direction

Pass a list of column name strings. All columns sort in ascending order by default.

print iris.sort(['sepalwidth', 'petallength']).head(5)

Expected output:

Sql compiled:
CREATE TABLE tmp_pyodps_6fe37b6e_6705_4052_b733_211eb9bd16ac LIFECYCLE 1 AS
SELECT *
FROM WB_BestPractice_dev.`pyodps_iris` t1
ORDER BY sepalwidth, petallength
LIMIT 10000
Instance ID: 20191010061611714gn586792

   sepallength  sepalwidth  petallength  petalwidth             name
0          5.0         2.0          3.5         1.0  Iris-versicolor
1          6.0         2.2          4.0         1.0  Iris-versicolor
2          6.2         2.2          4.5         1.5  Iris-versicolor
3          6.0         2.2          5.0         1.5   Iris-virginica
4          4.5         2.3          1.3         0.3      Iris-setosa

Sort by multiple columns in different directions

Two syntax options are available:

Use ascending with a list of boolean values — one value per column.
Use a hyphen prefix on column references for descending columns within the list.

Option 1: Boolean list

print iris.sort(['sepalwidth', 'petallength'], ascending=[True, False]).head(5)

Expected output:

Sql compiled:
CREATE TABLE tmp_pyodps_a52c805c_94a1_4a75_a6af_4fc9ed06ae68 LIFECYCLE 1 AS
SELECT *
FROM WB_BestPractice_dev.`pyodps_iris` t1
ORDER BY sepalwidth, petallength DESC
LIMIT 10000
Instance ID: 20191010061616553gw3m9592

   sepallength  sepalwidth  petallength  petalwidth             name
0          5.0         2.0          3.5         1.0  Iris-versicolor
1          6.0         2.2          5.0         1.5   Iris-virginica
2          6.2         2.2          4.5         1.5  Iris-versicolor
3          6.0         2.2          4.0         1.0  Iris-versicolor
4          6.3         2.3          4.4         1.3  Iris-versicolor

Option 2: Hyphen prefix on column references

print iris.sort(['sepalwidth', -iris.petallength]).head(5)

Expected output:

Sql compiled:
CREATE TABLE tmp_pyodps_aac5538e_9b40_4078_b3c6_852b99c663c1 LIFECYCLE 1 AS
SELECT *
FROM WB_BestPractice_dev.`pyodps_iris` t1
ORDER BY sepalwidth, petallength DESC
LIMIT 10000
Instance ID: 20191010061621329gvmkc292

   sepallength  sepalwidth  petallength  petalwidth             name
0          5.0         2.0          3.5         1.0  Iris-versicolor
1          6.0         2.2          5.0         1.5   Iris-virginica
2          6.2         2.2          4.5         1.5  Iris-versicolor
3          6.0         2.2          4.0         1.0  Iris-versicolor
4          6.3         2.3          4.4         1.3  Iris-versicolor

Both options produce identical SQL and results. The boolean list is easier to read when column names are long; the hyphen prefix is more concise when you already have column reference objects.

View the run log

After the node runs, view the output on the Run Log tab.

Each sort operation generates a separate SQL statement and instance ID.