This topic describes how to use a PyODPS node to scale data.

Prerequisites

  • MaxCompute is activated. For more information, see Activate MaxCompute.
  • DataWorks is activated. For more information, see Activate DataWorks.
  • A workflow is created in the DataWorks console. In this example, a workflow is created for a DataWorks workspace in basic mode. For more information, see Create a workflow.

Procedure

  1. Create a table and import data to it.
    1. Download the Iris dataset named iris.data and rename it iris.csv.
    2. Create a table named pyodps_iris and upload the iris.csv dataset. For more information, see Create tables and import data.
      In this example, execute the following statement to create the pyodps_iris table:
      CREATE TABLE if not exists pyodps_iris
      (
      sepallength  DOUBLE comment 'sepal length (cm)',
      sepalwidth   DOUBLE comment 'sepal width (cm)',
      petallength  DOUBLE comment ''petal length (cm)',
      petalwidth   DOUBLE comment 'petal width (cm)',
      name         STRING comment 'type'
      );
  2. Log on to the DataWorks console.
  3. In the left-side navigation pane, click Workspaces.
  4. On the DataStudio page, right-click a workflow and choose New > MaxCompute > PyODPS 2.
  5. In the New node dialog box, specify the Node name parameter and click Submit.
  6. On the configuration tab of the PyODPS 2 node, enter the code of the node in the code editor:
    In this example, enter the following code:
    # Scale data.
    df = DataFrame(o.get_table('pytable'))
    
    # Call the min_max_scale method to normalize data.
    print df.min_max_scale(columns=['fid']).head()
    
    # Use the feature_range parameter to specify an output value range for the min_max_scale method. For example, to specify (-1,1) as the output value range, execute the following statement:
    print df.min_max_scale(columns=['fid'],feature_range=(-1,1)).head()
    
    # To keep the original values, set the preserve parameter to True. In this case, a new column that contains the scaled data is added. The new column is named in the format Original column name_scaled. You can change the suffix _scaled by using the suffix parameter.
    print df.min_max_scale(columns=['fid'],preserve=True).head()
    
    # Use the group parameter to specify a group column.
    print df.min_max_scale(columns=['fid'],group=['name']).head()
    
    # You can use the group parameter to specify one or more group columns for the std_scale method. Similar to the min_max_scale method, the std_scale method obtains the minimum and maximum values from the group columns to scale data.
    print df.std_scale(columns=['fid']).head()
  7. Click the Run icon in the toolbar.Run node.png
  8. View the running result of the PyODPS 2 node on the Run Log tab.
    In this example, the following information appears on the Run Log tab:
    Sql compiled:
    CREATE TABLE tmp_pyodps_59cb5533_6dc3_4b64_9312_b2da29d70cee LIFECYCLE 1 AS
    SELECT /*+mapjoin(t5)*/ t2.`name`, t2.`id`, (IF((t5.`fid_max_1570691133` - t5.`fid_min_1570691133`) == 0, 0, (t2.`fid` - t5.`fid_min_1570691133`) / (t5.`fid_max_1570691133` - t5.`fid_min_1570691133`)) * 1) + 0 AS `fid`
    FROM (
      SELECT t1.`name`, t1.`id`, t1.`fid`, 1 AS `idx_col_1570691133`
      FROM WB_BestPractice_dev.`pytable` t1
    ) t2
    INNER JOIN
      (
        SELECT 1 AS `idx_col_1570691133`, MIN(t4.`fid`) AS `fid_min_1570691133`, MAX(t4.`fid`) AS `fid_max_1570691133`
        FROM (
          SELECT t3.`fid`
          FROM WB_BestPractice_dev.`pytable` t3
        ) t4
      ) t5
    ON t2.`idx_col_1570691133` == t5.`idx_col_1570691133`
    Instance ID: 2019101007053495gl8by72m
    
        name  id       fid
    0  name1   4  1.000000
    1  name2   2  0.526316
    2  name2   3  0.000000
    3  name1   4  0.710526
    4  name1   3  0.184211
    5  name1   3  0.684211
    Sql compiled:
    CREATE TABLE tmp_pyodps_88a24967_3bdc_41ce_85d7_4bbd23e4ce01 LIFECYCLE 1 AS
    SELECT /*+mapjoin(t5)*/ t2.`name`, t2.`id`, (IF((t5.`fid_max_1570691139` - t5.`fid_min_1570691139`) == 0, 0, (t2.`fid` - t5.`fid_min_1570691139`) / (t5.`fid_max_1570691139` - t5.`fid_min_1570691139`)) * 2) + -1 AS `fid`
    FROM (
      SELECT t1.`name`, t1.`id`, t1.`fid`, 1 AS `idx_col_1570691139`
      FROM WB_BestPractice_dev.`pytable` t1
    ) t2
    INNER JOIN
      (
        SELECT 1 AS `idx_col_1570691139`, MIN(t4.`fid`) AS `fid_min_1570691139`, MAX(t4.`fid`) AS `fid_max_1570691139`
        FROM (
          SELECT t3.`fid`
          FROM WB_BestPractice_dev.`pytable` t3
        ) t4
      ) t5
    ON t2.`idx_col_1570691139` == t5.`idx_col_1570691139`
    Instance ID: 20191010070539772gjo56292
    
        name  id       fid
    0  name1   4  1.000000
    1  name2   2  0.052632
    2  name2   3 -1.000000
    3  name1   4  0.421053
    4  name1   3 -0.631579
    5  name1   3  0.368421
    Sql compiled:
    CREATE TABLE tmp_pyodps_439117fc_9ef7_4086_899d_a5bf77d653e5 LIFECYCLE 1 AS
    SELECT /*+mapjoin(t5)*/ t2.`name`, t2.`id`, t2.`fid`, (IF((t5.`fid_max_1570691146` - t5.`fid_min_1570691146`) == 0, 0, (t2.`fid` - t5.`fid_min_1570691146`) / (t5.`fid_max_1570691146` - t5.`fid_min_1570691146`)) * 1) + 0 AS `fid_scaled`
    FROM (
      SELECT t1.`name`, t1.`id`, t1.`fid`, 1 AS `idx_col_1570691146`
      FROM WB_BestPractice_dev.`pytable` t1
    ) t2
    INNER JOIN
      (
        SELECT 1 AS `idx_col_1570691146`, MIN(t4.`fid`) AS `fid_min_1570691146`, MAX(t4.`fid`) AS `fid_max_1570691146`
        FROM (
          SELECT t3.`fid`
          FROM WB_BestPractice_dev.`pytable` t3
        ) t4
      ) t5
    ON t2.`idx_col_1570691146` == t5.`idx_col_1570691146`
    
    Instance ID: 20191010070546769g0c14f72
    
        name  id  fid  fid_scaled
    0  name1   4  5.3    1.000000
    1  name2   2  3.5    0.526316
    2  name2   3  1.5    0.000000
    3  name1   4  4.2    0.710526
    4  name1   3  2.2    0.184211
    5  name1   3  4.1    0.684211
    Sql compiled:
    CREATE TABLE tmp_pyodps_d3839b4b_1087_4d52_91f5_52763b72f272 LIFECYCLE 1 AS
    SELECT /*+mapjoin(t3)*/ t1.`name`, t1.`id`, (IF((t3.`fid_max_1570691151` - t3.`fid_min_1570691151`) == 0, 0, (t1.`fid` - t3.`fid_min_1570691151`) / (t3.`fid_max_1570691151` - t3.`fid_min_1570691151`)) * 1) + 0 AS `fid`
    FROM WB_BestPractice_dev.`pytable` t1
    INNER JOIN
      (
        SELECT t2.`name`, MAX(t2.`fid`) AS `fid_max_1570691151`, MIN(t2.`fid`) AS `fid_min_1570691151`
        FROM WB_BestPractice_dev.`pytable` t2
        GROUP BY t2.`name`
      ) t3
    ON t1.`name` == t3.`name`
    Instance ID: 20191010070551756gtf14f72
    
    
        name  id       fid
    0  name1   4  1.000000
    1  name2   2  1.000000
    2  name2   3  0.000000
    3  name1   4  0.645161
    4  name1   3  0.000000
    5  name1   3  0.612903
    Sql compiled:
    CREATE TABLE tmp_pyodps_1ea6e5b4_129f_4d1e_a6a7_410e08d77ae6 LIFECYCLE 1 AS
    SELECT /*+mapjoin(t5)*/ t2.`name`, t2.`id`, IF(t5.`fid_std_1570691157` == 0, 0, (t2.`fid` - t5.`fid_mean_1570691157`) / t5.`fid_std_1570691157`) AS `fid`
    FROM (
      SELECT t1.`name`, t1.`id`, t1.`fid`, 1 AS `idx_col_1570691157`
      FROM WB_BestPractice_dev.`pytable` t1
    ) t2
    INNER JOIN
      (
        SELECT 1 AS `idx_col_1570691157`, AVG(t4.`fid`) AS `fid_mean_1570691157`, STDDEV(t4.`fid`) AS `fid_std_1570691157`
        FROM (
          SELECT t3.`fid`
          FROM WB_BestPractice_dev.`pytable` t3
        ) t4
      ) t5
    ON t2.`idx_col_1570691157` == t5.`idx_col_1570691157`
    Instance ID: 20191010070557788gdl14f72
    
        name  id       fid
    0  name1   4  1.436467
    1  name2   2  0.026118
    2  name2   3 -1.540938
    3  name1   4  0.574587
    4  name1   3 -0.992468
    5  name1   3  0.496234