In data preprocessing, user can standardize data to reduce the effect caused by the data volume and range in each column. After standardization, data in different columns falls within the same data range. A standardization component assumes that the input data conforms to a normal distribution.
Limits
The supported compute engines are MaxCompute and Realtime Compute for Apache Flink.
Introduction
Standardization is the process of calculating values by using the mean and variance of data while assuming that the data conforms to a normal distribution. You can use the standardization method to calculate the mean and standard deviation of data during training.
Configure the component in Machine Learning Designer
Input ports
Input port (from left to right) | Data type | Recommended upstream component | Required |
data | Integer | Yes |
Component parameters
Tab | Parameter | Description |
Field Setting | selectedCols | The names of the columns that you want to process. Multiple columns can be selected. The data in the columns can only be of a numeric type. |
Parameter Setting | withMean | Specifies whether to use the mean. By default, the mean is used. |
withStd | Specifies whether to use the standard deviation. By default, the standard deviation is used. | |
Execution Tuning | Number of Workers | The number of workers. This parameter must be used together with the Memory per worker, unit MB parameter. The value of this parameter must be a positive integer. Valid values: [1,9999]. |
Memory per worker, unit MB | The memory size of each worker. Valid values: 1024 to 65536. Unit: MB. |
Output ports
Output port (from left to right) | Storage location | Recommended downstream component | Model type |
model | N/A | None |
Example
You can copy the following code to the code editor of the PyAlink Script component. This allows the PyAlink Script component to function like the Standard Scaler Train component.
from pyalink.alink import *
def main(sources, sinks, parameter):
data = sources[0]
selectedColNames = ["col2", "col3"]
modelop = StandardScalerTrainBatchOp()\
.setSelectedCols(selectedColNames)
result = modelop.linkFrom(data)
result.link(sinks[0])
BatchOperator.execute()