This topic describes how to use a PyODPS node to de-duplicate data.
Prerequisites
The following operations are performed:
MaxCompute and DataWorks have been activated. For more information, see Activate MaxCompute and DataWorks and Purchase guide.
A workflow is created in the DataWorks console. In this example, a workflow is created for a DataWorks workspace in basic mode. For more information, see Create a workflow.
Procedure
Create a table and import data to the table.
Download the dataset iris.data from iris and rename iris.data as iris.csv.
Create a table named pyodps_iris and upload the dataset iris.csv to the table. For more information, see Create tables and upload data.
Sample statement:
CREATE TABLE if not exists pyodps_iris ( sepallength DOUBLE comment 'sepal length (cm)', sepalwidth DOUBLE comment 'sepal width (cm)', petallength DOUBLE comment ''petal length (cm)', petalwidth DOUBLE comment 'petal width (cm)', name STRING comment 'type' );
Log on to the DataWorks console.
In the left-side navigation pane, click Workspace.
Find your workspace, and choose in the Actions column.
On the DataStudio page, right-click the created workflow and choose .
In the Create Node dialog box, specify Name and click Confirm.
On the configuration tab of the PyODPS 2 node, enter the code of the node in the code editor.
In this example, enter the following code:
from odps.df import DataFrame iris = DataFrame(o.get_table('pyodps_iris')) print iris[['name']].distinct() print iris.distinct('name') print iris.distinct('name','sepallength').head(3) # You can call the unique method to de-duplicate data in the specified sequence. The sequence whose data is de-duplicated by using the unique method cannot be selected as a column. print iris.name.unique()Click the Run icon in the toolbar.
View the running result of the PyODPS 2 node on the Run Log tab.

In this example, the following information appears on the Run Log tab:
Sql compiled: CREATE TABLE tmp_pyodps_ed85ebd5_d678_44dd_9ece_bff1822376f6 LIFECYCLE 1 AS SELECT DISTINCT t1.`name` FROM WB_BestPractice_dev.`pyodps_iris` t1 Instance ID: 2019101006391142g2cp5692 name 0 Iris-setosa 1 Iris-versicolor 2 Iris-virginica Sql compiled: CREATE TABLE tmp_pyodps_8ce6128f_9c6f_45af_b9de_c73ce9d5ba51 LIFECYCLE 1 AS SELECT DISTINCT t1.`name` FROM WB_BestPractice_dev.`pyodps_iris` t1 Instance ID: 20191010063915987gmuws592 name 0 Iris-setosa 1 Iris-versicolor 2 Iris-virginica Sql compiled: CREATE TABLE tmp_pyodps_a3dc338e_0fea_4d5f_847c_79fb19ec1c72 LIFECYCLE 1 AS SELECT DISTINCT t1.`name`, t1.`sepallength` FROM WB_BestPractice_dev.`pyodps_iris` t1 Instance ID: 2019101006392210gj056292 name sepallength 0 Iris-setosa 4.3 1 Iris-setosa 4.4 2 Iris-setosa 4.5 Sql compiled: CREATE TABLE tmp_pyodps_bc0917bb_f10c_426b_9b75_47e94478382a LIFECYCLE 1 AS SELECT t2.`name` FROM ( SELECT DISTINCT t1.`name` FROM WB_BestPractice_dev.`pyodps_iris` t1 ) t2 Instance ID: 20191010063927189g9fsz192 name 0 Iris-setosa 1 Iris-versicolor 2 Iris-virginica