This topic describes how to use a PyODPS node to de-duplicate data.
Prerequisites
The following operations are completed:
Procedure
- Prepare test data.
- Download the iris dataset named iris.data and rename it iris.csv.
- Create a table named pyodps_iris and import the data in the iris.csv dataset to the table. For more information, see Create tables and import data.
In this example, execute the following statement to create the pyodps_iris table:
CREATE TABLE if not exists pyodps_iris
(
sepallength DOUBLE comment 'sepal length (cm)',
sepalwidth DOUBLE comment 'sepal width (cm)',
petallength DOUBLE comment 'petal length (cm)',
petalwidth DOUBLE comment 'petal width (cm)',
name STRING comment 'type'
);
- Log on to the DataWorks console.
- In the left-side navigation pane, click Workspaces.
- On the Workspaces page, find the target workspace and click Data Analytics in the Actions column.
- On the Data Development tab, right-click the target workflow in the Business process
section and choose.
- In the New node dialog box, set the Node name parameter and click Submit.
- On the configuration tab of the PyODPS 2 node, enter the code of the node in the code
editor.
In this example, enter the following code:
from odps.df import DataFrame
iris = DataFrame(o.get_table('pyodps_iris'))
print iris[['name']].distinct()
print iris.distinct('name')
print iris.distinct('name','sepallength').head(3)
# You can call the unique method to de-duplicate data in the specified sequence. The sequence whose data is de-duplicated by using the unique method cannot be selected as a column.
print iris.name.unique()
- Click the Run icon in the toolbar.
- View the running result of the PyODPS 2 node on the Run Log tab.

In this example, the following information appears on the Run Log tab:
Sql compiled:
CREATE TABLE tmp_pyodps_ed85ebd5_d678_44dd_9ece_bff1822376f6 LIFECYCLE 1 AS
SELECT DISTINCT t1.`name`
FROM WB_BestPractice_dev.`pyodps_iris` t1
Instance ID: 2019101006391142g2cp5692
name
0 Iris-setosa
1 Iris-versicolor
2 Iris-virginica
Sql compiled:
CREATE TABLE tmp_pyodps_8ce6128f_9c6f_45af_b9de_c73ce9d5ba51 LIFECYCLE 1 AS
SELECT DISTINCT t1.`name`
FROM WB_BestPractice_dev.`pyodps_iris` t1
Instance ID: 20191010063915987gmuws592
name
0 Iris-setosa
1 Iris-versicolor
2 Iris-virginica
Sql compiled:
CREATE TABLE tmp_pyodps_a3dc338e_0fea_4d5f_847c_79fb19ec1c72 LIFECYCLE 1 AS
SELECT DISTINCT t1.`name`, t1.`sepallength`
FROM WB_BestPractice_dev.`pyodps_iris` t1
Instance ID: 2019101006392210gj056292
name sepallength
0 Iris-setosa 4.3
1 Iris-setosa 4.4
2 Iris-setosa 4.5
Sql compiled:
CREATE TABLE tmp_pyodps_bc0917bb_f10c_426b_9b75_47e94478382a LIFECYCLE 1 AS
SELECT t2.`name`
FROM (
SELECT DISTINCT t1.`name`
FROM WB_BestPractice_dev.`pyodps_iris` t1
) t2
Instance ID: 20191010063927189g9fsz192
name
0 Iris-setosa
1 Iris-versicolor
2 Iris-virginica