All Products
Search
Document Center

MaxCompute:Use a PyODPS node to de-duplicate data

Last Updated:Jan 23, 2025

This topic describes how to use a PyODPS node to de-duplicate data.

Prerequisites

The following operations are performed:

Procedure

  1. Create a table and import data to the table.

    1. Download the dataset iris.data from iris and rename iris.data as iris.csv.

    2. Create a table named pyodps_iris and upload the dataset iris.csv to the table. For more information, see Create tables and upload data.

      Sample statement:

      CREATE TABLE if not exists pyodps_iris
      (
      sepallength  DOUBLE comment 'sepal length (cm)',
      sepalwidth   DOUBLE comment 'sepal width (cm)',
      petallength  DOUBLE comment ''petal length (cm)',
      petalwidth   DOUBLE comment 'petal width (cm)',
      name         STRING comment 'type'
      );
  2. Log on to the DataWorks console.

  3. In the left-side navigation pane, click Workspace.

  4. Find your workspace, and choose Shortcuts > Data Development in the Actions column.

  5. On the DataStudio page, right-click the created workflow and choose Create Node > MaxCompute > PyODPS 2.

  6. In the Create Node dialog box, specify Name and click Confirm.

  7. On the configuration tab of the PyODPS 2 node, enter the code of the node in the code editor.

    In this example, enter the following code:

    from  odps.df import DataFrame
    iris = DataFrame(o.get_table('pyodps_iris'))
    print iris[['name']].distinct()
    print iris.distinct('name')
    print iris.distinct('name','sepallength').head(3)
    # You can call the unique method to de-duplicate data in the specified sequence. The sequence whose data is de-duplicated by using the unique method cannot be selected as a column.
    print iris.name.unique()
  8. Click the Run icon in the toolbar.

  9. View the running result of the PyODPS 2 node on the Run Log tab.Run Log

    In this example, the following information appears on the Run Log tab:

    Sql compiled:
    CREATE TABLE tmp_pyodps_ed85ebd5_d678_44dd_9ece_bff1822376f6 LIFECYCLE 1 AS
    SELECT DISTINCT t1.`name`
    FROM WB_BestPractice_dev.`pyodps_iris` t1
    Instance ID: 2019101006391142g2cp5692
    
    
                  name
    0      Iris-setosa
    1  Iris-versicolor
    2   Iris-virginica
    Sql compiled:
    CREATE TABLE tmp_pyodps_8ce6128f_9c6f_45af_b9de_c73ce9d5ba51 LIFECYCLE 1 AS
    SELECT DISTINCT t1.`name`
    FROM WB_BestPractice_dev.`pyodps_iris` t1
    
    Instance ID: 20191010063915987gmuws592
    
    
                  name
    0      Iris-setosa
    1  Iris-versicolor
    2   Iris-virginica
    Sql compiled:
    CREATE TABLE tmp_pyodps_a3dc338e_0fea_4d5f_847c_79fb19ec1c72 LIFECYCLE 1 AS
    SELECT DISTINCT t1.`name`, t1.`sepallength`
    FROM WB_BestPractice_dev.`pyodps_iris` t1
    
    Instance ID: 2019101006392210gj056292
    
    
              name  sepallength
    0  Iris-setosa          4.3
    1  Iris-setosa          4.4
    2  Iris-setosa          4.5
    Sql compiled:
    CREATE TABLE tmp_pyodps_bc0917bb_f10c_426b_9b75_47e94478382a LIFECYCLE 1 AS
    SELECT t2.`name`
    FROM (
      SELECT DISTINCT t1.`name`
      FROM WB_BestPractice_dev.`pyodps_iris` t1
    ) t2
    Instance ID: 20191010063927189g9fsz192
    
                  name
    0      Iris-setosa
    1  Iris-versicolor
    2   Iris-virginica