Standard Scaler Train - Platform For AI - Alibaba Cloud Documentation Center

In data preprocessing, user can standardize data to reduce the effect caused by the data volume and range in each column. After standardization, data in different columns falls within the same data range. A standardization component assumes that the input data conforms to a normal distribution.

Limits

The supported compute engines are MaxCompute and Realtime Compute for Apache Flink.

Introduction

Standardization is the process of calculating values by using the mean and variance of data while assuming that the data conforms to a normal distribution. You can use the standardization method to calculate the mean and standard deviation of data during training.

Configure the component in Machine Learning Designer

Input ports

Input port (from left to right)

Data type

Recommended upstream component

Required

data

Integer

Read Table

Read CSV File

Yes

Component parameters

Tab	Parameter	Description
Field Setting	selectedCols	The names of the columns that you want to process. Multiple columns can be selected. The data in the columns can only be of a numeric type.
Parameter Setting	withMean	Specifies whether to use the mean. By default, the mean is used.
Parameter Setting	withStd	Specifies whether to use the standard deviation. By default, the standard deviation is used.
Execution Tuning	Number of Workers	The number of workers. This parameter must be used together with the Memory per worker, unit MB parameter. The value of this parameter must be a positive integer. Valid values: [1,9999].
Execution Tuning	Memory per worker, unit MB	The memory size of each worker. Valid values: 1024 to 65536. Unit: MB.

Output ports

Output port (from left to right)	Storage location	Recommended downstream component	Model type
model	N/A	Standard Scaler Batch Predict	None

Example

You can copy the following code to the code editor of the PyAlink Script component. This allows the PyAlink Script component to function like the Standard Scaler Train component.

from pyalink.alink import *

def main(sources, sinks, parameter):
    data = sources[0]
    selectedColNames = ["col2", "col3"]
    modelop = StandardScalerTrainBatchOp()\
        .setSelectedCols(selectedColNames)
    result = modelop.linkFrom(data)
    result.link(sinks[0])
    BatchOperator.execute()