All Products
Search
Document Center

Platform For AI:Standard Scaler Train

Last Updated:Mar 08, 2024

In data preprocessing, user can standardize data to reduce the effect caused by the data volume and range in each column. After standardization, data in different columns falls within the same data range. A standardization component assumes that the input data conforms to a normal distribution.

Limits

The supported compute engines are MaxCompute and Realtime Compute for Apache Flink.

Introduction

Standardization is the process of calculating values by using the mean and variance of data while assuming that the data conforms to a normal distribution. You can use the standardization method to calculate the mean and standard deviation of data during training.

Configure the component in Machine Learning Designer

Input ports

Input port (from left to right)

Data type

Recommended upstream component

Required

data

Integer

Read Table

Read CSV File

Yes

Component parameters

Tab

Parameter

Description

Field Setting

selectedCols

The names of the columns that you want to process. Multiple columns can be selected. The data in the columns can only be of a numeric type.

Parameter Setting

withMean

Specifies whether to use the mean. By default, the mean is used.

withStd

Specifies whether to use the standard deviation. By default, the standard deviation is used.

Execution Tuning

Number of Workers

The number of workers. This parameter must be used together with the Memory per worker, unit MB parameter. The value of this parameter must be a positive integer. Valid values: [1,9999].

Memory per worker, unit MB

The memory size of each worker. Valid values: 1024 to 65536. Unit: MB.

Output ports

Output port (from left to right)

Storage location

Recommended downstream component

Model type

model

N/A

Standard Scaler Batch Predict

None

Example

You can copy the following code to the code editor of the PyAlink Script component. This allows the PyAlink Script component to function like the Standard Scaler Train component.

from pyalink.alink import *

def main(sources, sinks, parameter):
    data = sources[0]
    selectedColNames = ["col2", "col3"]
    modelop = StandardScalerTrainBatchOp()\
        .setSelectedCols(selectedColNames)
    result = modelop.linkFrom(data)
    result.link(sinks[0])
    BatchOperator.execute()