Data Pivoting - Platform For AI - Alibaba Cloud Documentation Center

The Data Pivoting component provided by Machine Learning Designer allows you to view the distributions of feature values, feature columns, and label columns. This facilitates follow-up data analysis. This component supports both sparse and dense data formats. This topic describes how to configure the component and provides an example on how to use the component.

Configure the component

You can use one of the following methods to configure the Data Pivoting component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Data Pivoting component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.


Tab	Parameter	Description
Fields Setting	Feature Columns	The columns that represent the features of data in training samples.
	Target Column	The column that you want to use for training.
	Enumeration Features	The features that you want to use as enumeration features.
	Sparse Format (K:V,K:V)	Specifies whether data in the sparse format is used.
Parameters Setting	Continuous Feature Discretization Intervals	The maximum number of intervals for the equal-distance division of continuous features.
Tuning	Cores	The number of cores used in computing. The value must be a positive integer.
Tuning	Memory Size per Core	The memory size of each core. Valid values: 1 to 65536. Unit: MB.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

PAI
-name fe_meta_runner
-project algo_public
-DinputTable="pai_dense_10_10"
-DoutputTable="pai_temp_2263_20384_1"
-DmapTable="pai_temp_2263_20384_2"
-DselectedCols="pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign,poutcome"
-DlabelCol="y"
-DcategoryCols="previous"
-Dlifecycle="28"-DmaxBins="5" ;


Parameter	Required	Description	Default value
inputTable	Yes	The name of the input table.	None
inputTablePartitions	No	The partitions that are selected from the input table for training. Valid values: Partition_name=value name1=value1/name2=value2: multi-level partitions Note If you specify multiple partitions, separate them with commas (,).	None
outputTable	Yes	The name of the output table.	None
mapTable	Yes	The output mapping table. The Data Pivoting component maps STRING-type data to INT-type data for PAI to use for training.	None
selectedCols	Yes	The columns that are selected from the input table.	None
labelCol	No	The column that you want to use for training.	None
categoryCols	No	The INT- or DOUBLE-type columns that you want to use as enumeration features.	None
maxBins	No	The maximum number of intervals for the equal-distance division of continuous features.	100
isSparse	No	Specifies whether the input data is sparse. Valid values: true and false.	false
itemSpliter	No	The delimiter that is used to separate key-value pairs if data in the input table is in the sparse format.	,
kvSpliter	No	The delimiter that is used to separate keys and values if data in the input table is in the sparse format.	:
lifecycle	No	The lifecycle of the output table.	28
coreNum	No	The number of cores used in computing. The value must be a positive integer. Valid values: 1 to 9999.	Determined by the system
memSizePerCore	No	The memory size of each core. Valid values: 1 to 65536. Unit: MB.	Determined by the system

Examples

Input data


age	workclass	fwlght	edu	edu_num	married	c	family	race	sex	gail	work_year	country	income
39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174.0	40.0	United-States	<=50K
50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0.0	13.0	United-States	<=50K
38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0.0	40.0	United-States	<=50K
53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0.0	40.0	United-States	<=50K
28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0.0	40.0	Other	<=50K
37	Private	284582	Masters	14	Married-civ-spouse	Exec-managerial	Wife	White	Female	0.0	40.0	United-States	<=50K
49	Private	160187	9th	5	Married-spouse-absent	Other-service	Not-in-family	Black	Female	0.0	16.0	Jamaica	<=50K
52	Self-emp-not-inc	209642	HS-grad	9	Married-civ-spouse	Exec-managerial	Husband	White	Male	0.0	45.0	United-States	>50K
31	Private	45781	Masters	14	Never-married	Prof-specialty	Not-in-family	White	Female	14084.0	50.0	United-States	>50K
42	Private	159449	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	5178.0	40.0	United-States	>50K

Modeling
Click the Data Pivoting component and then click the Fields Setting tab. Set the Target Column parameter to income and specify the other 14 columns for the Feature Columns parameter. The BIGINT-type values in the edu_num column are used as enumeration values.
Result
- Right-click Data Pivoting and choose View Data > Output Port. The values in the family, race, sex, and income columns of the STRING data type are converted into numeric values for PAI to use for training. This is similar to data format conversion.
- Right-click Data Pivoting and choose View Data > String Column Feature Mapping Table.
  Note If you do not specify STRING-type data for the Feature Columns parameter, the String Column Feature Mapping Table parameter is left empty in the output.
- Right-click Data Pivoting and choose View Data > Output Meta Table. distribute_info indicates the number of records in each interval based on the uniform distribution between the maximum value and the minimum value.