This topic describes how to use a PyODPS node to scale data.
Prerequisites
The following operations are performed:
MaxCompute and DataWorks have been activated. For more information, see Activate MaxCompute and DataWorks and Purchase guide.
A workflow is created in the DataWorks console. In this example, a workflow is created for a DataWorks workspace in basic mode. For more information, see Create a workflow.
Procedure
Create a table and import data to the table.
Download the dataset iris.data from iris and rename iris.data as iris.csv.
Create a table named pyodps_iris and upload the dataset iris.csv to the table. For more information, see Create tables and upload data.
Sample statement:
CREATE TABLE if not exists pyodps_iris ( sepallength DOUBLE comment 'sepal length (cm)', sepalwidth DOUBLE comment 'sepal width (cm)', petallength DOUBLE comment ''petal length (cm)', petalwidth DOUBLE comment 'petal width (cm)', name STRING comment 'type' );
- Log on to the DataWorks console.
In the left-side navigation pane, click Workspace.
On the DataStudio page, right-click the created workflow and choose .
In the Create Node dialog box, specify Name and click Confirm.
- On the configuration tab of the PyODPS 2 node, enter the code of the node in the code editor:In this example, enter the following code:
# Scale data. df = DataFrame(o.get_table('pytable')) # Call the min_max_scale method to normalize data. print df.min_max_scale(columns=['fid']).head() # Use the feature_range parameter to specify an output value range for the min_max_scale method. For example, to specify (-1,1) as the output value range, execute the following statement: print df.min_max_scale(columns=['fid'],feature_range=(-1,1)).head() # To keep the original values, set the preserve parameter to True. In this case, a new column that contains the scaled data is added. The new column is named in the format Original column name_scaled. You can change the suffix _scaled by using the suffix parameter. print df.min_max_scale(columns=['fid'],preserve=True).head() # Use the group parameter to specify a group column. print df.min_max_scale(columns=['fid'],group=['name']).head() # You can use the group parameter to specify one or more group columns for the std_scale method. Similar to the min_max_scale method, the std_scale method obtains the minimum and maximum values from the group columns to scale data. print df.std_scale(columns=['fid']).head() - Click the Run icon in the toolbar.

- View the running result of the PyODPS 2 node on the Run Log tab.
In this example, the following information appears on the Run Log tab:Sql compiled: CREATE TABLE tmp_pyodps_59cb5533_6dc3_4b64_9312_b2da29d70cee LIFECYCLE 1 AS SELECT /*+mapjoin(t5)*/ t2.`name`, t2.`id`, (IF((t5.`fid_max_1570691133` - t5.`fid_min_1570691133`) == 0, 0, (t2.`fid` - t5.`fid_min_1570691133`) / (t5.`fid_max_1570691133` - t5.`fid_min_1570691133`)) * 1) + 0 AS `fid` FROM ( SELECT t1.`name`, t1.`id`, t1.`fid`, 1 AS `idx_col_1570691133` FROM WB_BestPractice_dev.`pytable` t1 ) t2 INNER JOIN ( SELECT 1 AS `idx_col_1570691133`, MIN(t4.`fid`) AS `fid_min_1570691133`, MAX(t4.`fid`) AS `fid_max_1570691133` FROM ( SELECT t3.`fid` FROM WB_BestPractice_dev.`pytable` t3 ) t4 ) t5 ON t2.`idx_col_1570691133` == t5.`idx_col_1570691133` Instance ID: 2019101007053495gl8by72m name id fid 0 name1 4 1.000000 1 name2 2 0.526316 2 name2 3 0.000000 3 name1 4 0.710526 4 name1 3 0.184211 5 name1 3 0.684211 Sql compiled: CREATE TABLE tmp_pyodps_88a24967_3bdc_41ce_85d7_4bbd23e4ce01 LIFECYCLE 1 AS SELECT /*+mapjoin(t5)*/ t2.`name`, t2.`id`, (IF((t5.`fid_max_1570691139` - t5.`fid_min_1570691139`) == 0, 0, (t2.`fid` - t5.`fid_min_1570691139`) / (t5.`fid_max_1570691139` - t5.`fid_min_1570691139`)) * 2) + -1 AS `fid` FROM ( SELECT t1.`name`, t1.`id`, t1.`fid`, 1 AS `idx_col_1570691139` FROM WB_BestPractice_dev.`pytable` t1 ) t2 INNER JOIN ( SELECT 1 AS `idx_col_1570691139`, MIN(t4.`fid`) AS `fid_min_1570691139`, MAX(t4.`fid`) AS `fid_max_1570691139` FROM ( SELECT t3.`fid` FROM WB_BestPractice_dev.`pytable` t3 ) t4 ) t5 ON t2.`idx_col_1570691139` == t5.`idx_col_1570691139` Instance ID: 20191010070539772gjo56292 name id fid 0 name1 4 1.000000 1 name2 2 0.052632 2 name2 3 -1.000000 3 name1 4 0.421053 4 name1 3 -0.631579 5 name1 3 0.368421 Sql compiled: CREATE TABLE tmp_pyodps_439117fc_9ef7_4086_899d_a5bf77d653e5 LIFECYCLE 1 AS SELECT /*+mapjoin(t5)*/ t2.`name`, t2.`id`, t2.`fid`, (IF((t5.`fid_max_1570691146` - t5.`fid_min_1570691146`) == 0, 0, (t2.`fid` - t5.`fid_min_1570691146`) / (t5.`fid_max_1570691146` - t5.`fid_min_1570691146`)) * 1) + 0 AS `fid_scaled` FROM ( SELECT t1.`name`, t1.`id`, t1.`fid`, 1 AS `idx_col_1570691146` FROM WB_BestPractice_dev.`pytable` t1 ) t2 INNER JOIN ( SELECT 1 AS `idx_col_1570691146`, MIN(t4.`fid`) AS `fid_min_1570691146`, MAX(t4.`fid`) AS `fid_max_1570691146` FROM ( SELECT t3.`fid` FROM WB_BestPractice_dev.`pytable` t3 ) t4 ) t5 ON t2.`idx_col_1570691146` == t5.`idx_col_1570691146` Instance ID: 20191010070546769g0c14f72 name id fid fid_scaled 0 name1 4 5.3 1.000000 1 name2 2 3.5 0.526316 2 name2 3 1.5 0.000000 3 name1 4 4.2 0.710526 4 name1 3 2.2 0.184211 5 name1 3 4.1 0.684211 Sql compiled: CREATE TABLE tmp_pyodps_d3839b4b_1087_4d52_91f5_52763b72f272 LIFECYCLE 1 AS SELECT /*+mapjoin(t3)*/ t1.`name`, t1.`id`, (IF((t3.`fid_max_1570691151` - t3.`fid_min_1570691151`) == 0, 0, (t1.`fid` - t3.`fid_min_1570691151`) / (t3.`fid_max_1570691151` - t3.`fid_min_1570691151`)) * 1) + 0 AS `fid` FROM WB_BestPractice_dev.`pytable` t1 INNER JOIN ( SELECT t2.`name`, MAX(t2.`fid`) AS `fid_max_1570691151`, MIN(t2.`fid`) AS `fid_min_1570691151` FROM WB_BestPractice_dev.`pytable` t2 GROUP BY t2.`name` ) t3 ON t1.`name` == t3.`name` Instance ID: 20191010070551756gtf14f72 name id fid 0 name1 4 1.000000 1 name2 2 1.000000 2 name2 3 0.000000 3 name1 4 0.645161 4 name1 3 0.000000 5 name1 3 0.612903 Sql compiled: CREATE TABLE tmp_pyodps_1ea6e5b4_129f_4d1e_a6a7_410e08d77ae6 LIFECYCLE 1 AS SELECT /*+mapjoin(t5)*/ t2.`name`, t2.`id`, IF(t5.`fid_std_1570691157` == 0, 0, (t2.`fid` - t5.`fid_mean_1570691157`) / t5.`fid_std_1570691157`) AS `fid` FROM ( SELECT t1.`name`, t1.`id`, t1.`fid`, 1 AS `idx_col_1570691157` FROM WB_BestPractice_dev.`pytable` t1 ) t2 INNER JOIN ( SELECT 1 AS `idx_col_1570691157`, AVG(t4.`fid`) AS `fid_mean_1570691157`, STDDEV(t4.`fid`) AS `fid_std_1570691157` FROM ( SELECT t3.`fid` FROM WB_BestPractice_dev.`pytable` t3 ) t4 ) t5 ON t2.`idx_col_1570691157` == t5.`idx_col_1570691157` Instance ID: 20191010070557788gdl14f72 name id fid 0 name1 4 1.436467 1 name2 2 0.026118 2 name2 3 -1.540938 3 name1 4 0.574587 4 name1 3 -0.992468 5 name1 3 0.496234