全部产品
Search
文档中心

云原生大数据计算服务 MaxCompute:PyODPS的数据缩放

更新时间:May 31, 2023

本文介绍如何进行PyODPS的数据缩放。

前提条件

请提前完成如下操作:

操作步骤

  1. 创建表并导入数据。

    1. 下载鸢尾花数据集iris.data,重命名为iris.csv

    2. 创建表pyodps_iris并上传数据集iris.csv。操作方法请参见建表并上传数据

      建表语句如下。

      CREATE TABLE if not exists pyodps_iris
      (
      sepallength  DOUBLE comment '片长度(cm)',
      sepalwidth   DOUBLE comment '片宽度(cm)',
      petallength  DOUBLE comment '瓣长度(cm)',
      petalwidth   DOUBLE comment '瓣宽度(cm)',
      name         STRING comment '种类'
      );
  2. 登录DataWorks控制台
  3. 在左侧导航栏上单击工作空间列表

  4. 在数据开发页面,右键单击已经创建的业务流程,选择新建节点 > MaxCompute > PyODPS 2

  5. 新建节点对话框,输入节点名称,并单击确认

  6. 在PyODPS 2节点中输入数据缩放代码。
    示例代码如下。
    #数据缩放
    df = DataFrame(o.get_table('pytable'))
    
    #使用min_max_scale方法进行归一化
    print df.min_max_scale(columns=['fid']).head()
    
    #min_max_scale还支持使用feature_range参数指定输出值的范围。例如,需要使输出值在 (-1, 1) 范围内
    print df.min_max_scale(columns=['fid'],feature_range=(-1,1)).head()
    
    #如果需要保留原始值,可以使用preserve参数。此时,缩放后的数据将会以新增列的形式追加到数据中,列名默认为原列名追加_scaled后缀,该后缀可使用suffix参数更改
    print df.min_max_scale(columns=['fid'],preserve=True).head()
    
    #使用group分组
    print df.min_max_scale(columns=['fid'],group=['name']).head()
    
    #min_max_scale也支持使用group参数指定一个或多个分组列,在分组列中分别取最值进行缩放
    print df.std_scale(columns=['fid']).head()
  7. 单击运行运行节点.png
  8. 运行日志中查看运行结果。
    完整的运行结果如下。
    Sql compiled:
    CREATE TABLE tmp_pyodps_59cb5533_6dc3_4b64_9312_b2da29d70cee LIFECYCLE 1 AS
    SELECT /*+mapjoin(t5)*/ t2.`name`, t2.`id`, (IF((t5.`fid_max_1570691133` - t5.`fid_min_1570691133`) == 0, 0, (t2.`fid` - t5.`fid_min_1570691133`) / (t5.`fid_max_1570691133` - t5.`fid_min_1570691133`)) * 1) + 0 AS `fid`
    FROM (
      SELECT t1.`name`, t1.`id`, t1.`fid`, 1 AS `idx_col_1570691133`
      FROM WB_BestPractice_dev.`pytable` t1
    ) t2
    INNER JOIN
      (
        SELECT 1 AS `idx_col_1570691133`, MIN(t4.`fid`) AS `fid_min_1570691133`, MAX(t4.`fid`) AS `fid_max_1570691133`
        FROM (
          SELECT t3.`fid`
          FROM WB_BestPractice_dev.`pytable` t3
        ) t4
      ) t5
    ON t2.`idx_col_1570691133` == t5.`idx_col_1570691133`
    Instance ID: 2019101007053495gl8by72m
    
        name  id       fid
    0  name1   4  1.000000
    1  name2   2  0.526316
    2  name2   3  0.000000
    3  name1   4  0.710526
    4  name1   3  0.184211
    5  name1   3  0.684211
    Sql compiled:
    CREATE TABLE tmp_pyodps_88a24967_3bdc_41ce_85d7_4bbd23e4ce01 LIFECYCLE 1 AS
    SELECT /*+mapjoin(t5)*/ t2.`name`, t2.`id`, (IF((t5.`fid_max_1570691139` - t5.`fid_min_1570691139`) == 0, 0, (t2.`fid` - t5.`fid_min_1570691139`) / (t5.`fid_max_1570691139` - t5.`fid_min_1570691139`)) * 2) + -1 AS `fid`
    FROM (
      SELECT t1.`name`, t1.`id`, t1.`fid`, 1 AS `idx_col_1570691139`
      FROM WB_BestPractice_dev.`pytable` t1
    ) t2
    INNER JOIN
      (
        SELECT 1 AS `idx_col_1570691139`, MIN(t4.`fid`) AS `fid_min_1570691139`, MAX(t4.`fid`) AS `fid_max_1570691139`
        FROM (
          SELECT t3.`fid`
          FROM WB_BestPractice_dev.`pytable` t3
        ) t4
      ) t5
    ON t2.`idx_col_1570691139` == t5.`idx_col_1570691139`
    Instance ID: 20191010070539772gjo56292
    
        name  id       fid
    0  name1   4  1.000000
    1  name2   2  0.052632
    2  name2   3 -1.000000
    3  name1   4  0.421053
    4  name1   3 -0.631579
    5  name1   3  0.368421
    Sql compiled:
    CREATE TABLE tmp_pyodps_439117fc_9ef7_4086_899d_a5bf77d653e5 LIFECYCLE 1 AS
    SELECT /*+mapjoin(t5)*/ t2.`name`, t2.`id`, t2.`fid`, (IF((t5.`fid_max_1570691146` - t5.`fid_min_1570691146`) == 0, 0, (t2.`fid` - t5.`fid_min_1570691146`) / (t5.`fid_max_1570691146` - t5.`fid_min_1570691146`)) * 1) + 0 AS `fid_scaled`
    FROM (
      SELECT t1.`name`, t1.`id`, t1.`fid`, 1 AS `idx_col_1570691146`
      FROM WB_BestPractice_dev.`pytable` t1
    ) t2
    INNER JOIN
      (
        SELECT 1 AS `idx_col_1570691146`, MIN(t4.`fid`) AS `fid_min_1570691146`, MAX(t4.`fid`) AS `fid_max_1570691146`
        FROM (
          SELECT t3.`fid`
          FROM WB_BestPractice_dev.`pytable` t3
        ) t4
      ) t5
    ON t2.`idx_col_1570691146` == t5.`idx_col_1570691146`
    
    Instance ID: 20191010070546769g0c14f72
    
        name  id  fid  fid_scaled
    0  name1   4  5.3    1.000000
    1  name2   2  3.5    0.526316
    2  name2   3  1.5    0.000000
    3  name1   4  4.2    0.710526
    4  name1   3  2.2    0.184211
    5  name1   3  4.1    0.684211
    Sql compiled:
    CREATE TABLE tmp_pyodps_d3839b4b_1087_4d52_91f5_52763b72f272 LIFECYCLE 1 AS
    SELECT /*+mapjoin(t3)*/ t1.`name`, t1.`id`, (IF((t3.`fid_max_1570691151` - t3.`fid_min_1570691151`) == 0, 0, (t1.`fid` - t3.`fid_min_1570691151`) / (t3.`fid_max_1570691151` - t3.`fid_min_1570691151`)) * 1) + 0 AS `fid`
    FROM WB_BestPractice_dev.`pytable` t1
    INNER JOIN
      (
        SELECT t2.`name`, MAX(t2.`fid`) AS `fid_max_1570691151`, MIN(t2.`fid`) AS `fid_min_1570691151`
        FROM WB_BestPractice_dev.`pytable` t2
        GROUP BY t2.`name`
      ) t3
    ON t1.`name` == t3.`name`
    Instance ID: 20191010070551756gtf14f72
    
    
        name  id       fid
    0  name1   4  1.000000
    1  name2   2  1.000000
    2  name2   3  0.000000
    3  name1   4  0.645161
    4  name1   3  0.000000
    5  name1   3  0.612903
    Sql compiled:
    CREATE TABLE tmp_pyodps_1ea6e5b4_129f_4d1e_a6a7_410e08d77ae6 LIFECYCLE 1 AS
    SELECT /*+mapjoin(t5)*/ t2.`name`, t2.`id`, IF(t5.`fid_std_1570691157` == 0, 0, (t2.`fid` - t5.`fid_mean_1570691157`) / t5.`fid_std_1570691157`) AS `fid`
    FROM (
      SELECT t1.`name`, t1.`id`, t1.`fid`, 1 AS `idx_col_1570691157`
      FROM WB_BestPractice_dev.`pytable` t1
    ) t2
    INNER JOIN
      (
        SELECT 1 AS `idx_col_1570691157`, AVG(t4.`fid`) AS `fid_mean_1570691157`, STDDEV(t4.`fid`) AS `fid_std_1570691157`
        FROM (
          SELECT t3.`fid`
          FROM WB_BestPractice_dev.`pytable` t3
        ) t4
      ) t5
    ON t2.`idx_col_1570691157` == t5.`idx_col_1570691157`
    Instance ID: 20191010070557788gdl14f72
    
        name  id       fid
    0  name1   4  1.436467
    1  name2   2  0.026118
    2  name2   3 -1.540938
    3  name1   4  0.574587
    4  name1   3 -0.992468
    5  name1   3  0.496234