This topic describes how to use a PyODPS node to de-duplicate data.

Prerequisites

  • MaxCompute is activated. For more information, see Activate MaxCompute.
  • DataWorks is activated. For more information, see Activate DataWorks.
  • A workflow is created in the DataWorks console. In this example, a workflow is created for a DataWorks workspace in basic mode. For more information, see Create a workflow.

Procedure

  1. Create a table and import data to it.
    1. Download the Iris dataset named iris.data and rename it iris.csv.
    2. Create a table named pyodps_iris and upload the iris.csv dataset. For more information, see Create tables and import data.
      In this example, execute the following statement to create the pyodps_iris table:
      CREATE TABLE if not exists pyodps_iris
      (
      sepallength  DOUBLE comment 'sepal length (cm)',
      sepalwidth   DOUBLE comment 'sepal width (cm)',
      petallength  DOUBLE comment ''petal length (cm)',
      petalwidth   DOUBLE comment 'petal width (cm)',
      name         STRING comment 'type'
      );
  2. Log on to the DataWorks console.
  3. In the left-side navigation pane, click Workspaces.
  4. On the page that appears, find the workspace that you want to manage and click Data Analytics in the Actions column.
  5. On the DataStudio page, right-click a workflow and choose New > MaxCompute > PyODPS 2.
  6. In the New node dialog box, specify the Node name parameter and click Submit.
  7. On the configuration tab of the PyODPS 2 node, enter the code of the node in the code editor.
    In this example, enter the following code:
    from  odps.df import DataFrame
    iris = DataFrame(o.get_table('pyodps_iris'))
    print iris[['name']].distinct()
    print iris.distinct('name')
    print iris.distinct('name','sepallength').head(3)
    # You can call the unique method to de-duplicate data in the specified sequence. The sequence whose data is de-duplicated by using the unique method cannot be selected as a column.
    print iris.name.unique()
  8. Click the Run icon in the toolbar.
  9. View the running result of the PyODPS 2 node on the Run Log tab.Run Log
    In this example, the following information appears on the Run Log tab:
    Sql compiled:
    CREATE TABLE tmp_pyodps_ed85ebd5_d678_44dd_9ece_bff1822376f6 LIFECYCLE 1 AS
    SELECT DISTINCT t1.`name`
    FROM WB_BestPractice_dev.`pyodps_iris` t1
    Instance ID: 2019101006391142g2cp5692
    
    
                  name
    0      Iris-setosa
    1  Iris-versicolor
    2   Iris-virginica
    Sql compiled:
    CREATE TABLE tmp_pyodps_8ce6128f_9c6f_45af_b9de_c73ce9d5ba51 LIFECYCLE 1 AS
    SELECT DISTINCT t1.`name`
    FROM WB_BestPractice_dev.`pyodps_iris` t1
    
    Instance ID: 20191010063915987gmuws592
    
    
                  name
    0      Iris-setosa
    1  Iris-versicolor
    2   Iris-virginica
    Sql compiled:
    CREATE TABLE tmp_pyodps_a3dc338e_0fea_4d5f_847c_79fb19ec1c72 LIFECYCLE 1 AS
    SELECT DISTINCT t1.`name`, t1.`sepallength`
    FROM WB_BestPractice_dev.`pyodps_iris` t1
    
    Instance ID: 2019101006392210gj056292
    
    
              name  sepallength
    0  Iris-setosa          4.3
    1  Iris-setosa          4.4
    2  Iris-setosa          4.5
    Sql compiled:
    CREATE TABLE tmp_pyodps_bc0917bb_f10c_426b_9b75_47e94478382a LIFECYCLE 1 AS
    SELECT t2.`name`
    FROM (
      SELECT DISTINCT t1.`name`
      FROM WB_BestPractice_dev.`pyodps_iris` t1
    ) t2
    Instance ID: 20191010063927189g9fsz192
    
                  name
    0      Iris-setosa
    1  Iris-versicolor
    2   Iris-virginica