This topic describes how to use a PyODPS node to scale data.

Prerequisites

The following operations are completed:

Procedure

  1. Prepare test data.
    1. Download the iris dataset named iris.data and rename it iris.csv.
    2. Create a table named pyodps_iris and import the data in the iris.csv dataset to the table. For more information, see Create tables and import data.
      In this example, execute the following statement to create the pyodps_iris table:
      CREATE TABLE if not exists pyodps_iris
      (
      sepallength  DOUBLE comment 'sepal length (cm)',
      sepalwidth   DOUBLE comment 'sepal width (cm)',
      petallength  DOUBLE comment 'petal length (cm)',
      petalwidth   DOUBLE comment 'petal width (cm)',
      name         STRING comment 'type'
      );
  2. Log on to the DataWorks console. Click Workspaces in the left-side navigation pane. On the Workspaces page, find the required workspace and click Data Analytics in the Actions column. The Data Development tab appears.
  3. In the left-side navigation pane, click Workspaces.
  4. On the Data Development tab, right-click the target workflow in the Business process section and choose New > MaxCompute > PyODPS 2.
  5. In the New node dialog box, set the Node name parameter and click Submit.
  6. On the configuration tab of the PyODPS 2 node, enter the code of the node in the code editor:
    In this example, enter the following code:
    # Scale data.
    df = DataFrame(o.get_table('pytable'))
    
    # Call the min_max_scale method to normalize data.
    print df.min_max_scale(columns=['fid']).head()
    
    # Use the feature_range parameter to specify an output value range for the min_max_scale method. For example, to specify (-1,1) as the output value range, execute the following statement:
    print df.min_max_scale(columns=['fid'],feature_range=(-1,1)).head()
    
    # To keep the original values, set the preserve parameter to True. In this case, a new column that contains the scaled data is added. The new column is named in the format Original column name_scaled. You can change the suffix _scaled by using the suffix parameter.
    print df.min_max_scale(columns=['fid'],preserve=True).head()
    
    # Use the group parameter to specify a group column.
    print df.min_max_scale(columns=['fid'],group=['name']).head()
    
    # You can use the group parameter to specify one or more group columns for the std_scale method. Similar to the min_max_scale method, the std_scale method obtains the minimum and maximum values from the group columns to scale data.
    print df.std_scale(columns=['fid']).head()
  7. Click the Run icon in the toolbar.Run node.png
  8. View the running result of the PyODPS 2 node on the Run Log tab.
    In this example, the following information appears on the Run Log tab:
    Sql compiled:
    CREATE TABLE tmp_pyodps_59cb5533_6dc3_4b64_9312_b2da29d70cee LIFECYCLE 1 AS
    SELECT /*+mapjoin(t5)*/ t2.`name`, t2.`id`, (IF((t5.`fid_max_1570691133` - t5.`fid_min_1570691133`) == 0, 0, (t2.`fid` - t5.`fid_min_1570691133`) / (t5.`fid_max_1570691133` - t5.`fid_min_1570691133`)) * 1) + 0 AS `fid`
    FROM (
      SELECT t1.`name`, t1.`id`, t1.`fid`, 1 AS `idx_col_1570691133`
      FROM WB_BestPractice_dev.`pytable` t1
    ) t2
    INNER JOIN
      (
        SELECT 1 AS `idx_col_1570691133`, MIN(t4.`fid`) AS `fid_min_1570691133`, MAX(t4.`fid`) AS `fid_max_1570691133`
        FROM (
          SELECT t3.`fid`
          FROM WB_BestPractice_dev.`pytable` t3
        ) t4
      ) t5
    ON t2.`idx_col_1570691133` == t5.`idx_col_1570691133`
    Instance ID: 2019101007053495gl8by72m
    
        name  id       fid
    0  name1   4  1.000000
    1  name2   2  0.526316
    2  name2   3  0.000000
    3  name1   4  0.710526
    4  name1   3  0.184211
    5  name1   3  0.684211
    Sql compiled:
    CREATE TABLE tmp_pyodps_88a24967_3bdc_41ce_85d7_4bbd23e4ce01 LIFECYCLE 1 AS
    SELECT /*+mapjoin(t5)*/ t2.`name`, t2.`id`, (IF((t5.`fid_max_1570691139` - t5.`fid_min_1570691139`) == 0, 0, (t2.`fid` - t5.`fid_min_1570691139`) / (t5.`fid_max_1570691139` - t5.`fid_min_1570691139`)) * 2) + -1 AS `fid`
    FROM (
      SELECT t1.`name`, t1.`id`, t1.`fid`, 1 AS `idx_col_1570691139`
      FROM WB_BestPractice_dev.`pytable` t1
    ) t2
    INNER JOIN
      (
        SELECT 1 AS `idx_col_1570691139`, MIN(t4.`fid`) AS `fid_min_1570691139`, MAX(t4.`fid`) AS `fid_max_1570691139`
        FROM (
          SELECT t3.`fid`
          FROM WB_BestPractice_dev.`pytable` t3
        ) t4
      ) t5
    ON t2.`idx_col_1570691139` == t5.`idx_col_1570691139`
    Instance ID: 20191010070539772gjo56292
    
        name  id       fid
    0  name1   4  1.000000
    1  name2   2  0.052632
    2  name2   3 -1.000000
    3  name1   4  0.421053
    4  name1   3 -0.631579
    5  name1   3  0.368421
    Sql compiled:
    CREATE TABLE tmp_pyodps_439117fc_9ef7_4086_899d_a5bf77d653e5 LIFECYCLE 1 AS
    SELECT /*+mapjoin(t5)*/ t2.`name`, t2.`id`, t2.`fid`, (IF((t5.`fid_max_1570691146` - t5.`fid_min_1570691146`) == 0, 0, (t2.`fid` - t5.`fid_min_1570691146`) / (t5.`fid_max_1570691146` - t5.`fid_min_1570691146`)) * 1) + 0 AS `fid_scaled`
    FROM (
      SELECT t1.`name`, t1.`id`, t1.`fid`, 1 AS `idx_col_1570691146`
      FROM WB_BestPractice_dev.`pytable` t1
    ) t2
    INNER JOIN
      (
        SELECT 1 AS `idx_col_1570691146`, MIN(t4.`fid`) AS `fid_min_1570691146`, MAX(t4.`fid`) AS `fid_max_1570691146`
        FROM (
          SELECT t3.`fid`
          FROM WB_BestPractice_dev.`pytable` t3
        ) t4
      ) t5
    ON t2.`idx_col_1570691146` == t5.`idx_col_1570691146`
    
    Instance ID: 20191010070546769g0c14f72
    
        name  id  fid  fid_scaled
    0  name1   4  5.3    1.000000
    1  name2   2  3.5    0.526316
    2  name2   3  1.5    0.000000
    3  name1   4  4.2    0.710526
    4  name1   3  2.2    0.184211
    5  name1   3  4.1    0.684211
    Sql compiled:
    CREATE TABLE tmp_pyodps_d3839b4b_1087_4d52_91f5_52763b72f272 LIFECYCLE 1 AS
    SELECT /*+mapjoin(t3)*/ t1.`name`, t1.`id`, (IF((t3.`fid_max_1570691151` - t3.`fid_min_1570691151`) == 0, 0, (t1.`fid` - t3.`fid_min_1570691151`) / (t3.`fid_max_1570691151` - t3.`fid_min_1570691151`)) * 1) + 0 AS `fid`
    FROM WB_BestPractice_dev.`pytable` t1
    INNER JOIN
      (
        SELECT t2.`name`, MAX(t2.`fid`) AS `fid_max_1570691151`, MIN(t2.`fid`) AS `fid_min_1570691151`
        FROM WB_BestPractice_dev.`pytable` t2
        GROUP BY t2.`name`
      ) t3
    ON t1.`name` == t3.`name`
    Instance ID: 20191010070551756gtf14f72
    
    
        name  id       fid
    0  name1   4  1.000000
    1  name2   2  1.000000
    2  name2   3  0.000000
    3  name1   4  0.645161
    4  name1   3  0.000000
    5  name1   3  0.612903
    Sql compiled:
    CREATE TABLE tmp_pyodps_1ea6e5b4_129f_4d1e_a6a7_410e08d77ae6 LIFECYCLE 1 AS
    SELECT /*+mapjoin(t5)*/ t2.`name`, t2.`id`, IF(t5.`fid_std_1570691157` == 0, 0, (t2.`fid` - t5.`fid_mean_1570691157`) / t5.`fid_std_1570691157`) AS `fid`
    FROM (
      SELECT t1.`name`, t1.`id`, t1.`fid`, 1 AS `idx_col_1570691157`
      FROM WB_BestPractice_dev.`pytable` t1
    ) t2
    INNER JOIN
      (
        SELECT 1 AS `idx_col_1570691157`, AVG(t4.`fid`) AS `fid_mean_1570691157`, STDDEV(t4.`fid`) AS `fid_std_1570691157`
        FROM (
          SELECT t3.`fid`
          FROM WB_BestPractice_dev.`pytable` t3
        ) t4
      ) t5
    ON t2.`idx_col_1570691157` == t5.`idx_col_1570691157`
    Instance ID: 20191010070557788gdl14f72
    
        name  id       fid
    0  name1   4  1.436467
    1  name2   2  0.026118
    2  name2   3 -1.540938
    3  name1   4  0.574587
    4  name1   3 -0.992468
    5  name1   3  0.496234