A random forest is a classifier that contains multiple decision trees. The classification result is determined by the mode of the output classes from the individual trees.
Component configuration
You can configure the parameters for the random forest component in one of the following ways.
Method 1: Use the GUI
You can configure the component parameters on the Designer workflow page.
|
Tab |
Parameter |
Description |
|
Field Settings |
Feature Columns |
By default, all columns are selected except for the label column and weight column. |
|
Excluded Columns |
The columns that are not used for training. This parameter cannot be used with Feature Columns. |
|
|
Forced Conversion Columns |
The columns are parsed based on the following rules:
Note
To parse a BIGINT column as CATEGORICAL, you must use the forceCategorical parameter to specify the type. |
|
|
Weight Column Name |
The column used to weight each sample row. Only numeric types are supported. |
|
|
Label Column |
The label column in the input table. STRING and numeric types are supported. |
|
|
Parameter Settings |
Number of Trees in the Forest |
The value must be an integer from 1 to 1,000. |
|
Position of an Individual Tree in the Forest |
If the number of trees is N and algorithmTypes=[a,b], then:
For example, in a five-tree forest, if you set this parameter to [2,4], tree 1 uses the ID3 algorithm, trees 2 and 3 use the CART algorithm, and tree 4 uses the C4.5 algorithm. If you enter None, the algorithms are evenly distributed among the trees in the forest. |
|
|
Number of Random Features for a Single Tree |
The value must be in the range of [1,N], where N is the number of features. |
|
|
Minimum Number of Records on a Leaf Node |
A positive integer. The default value is 2. |
|
|
Minimum Ratio of Records on a Leaf Node to Its Parent Node |
The value must be in the range of [0,1]. The default value is 0. |
|
|
Maximum Depth of a Single Tree |
The value must be in the range of [1,+∞). The default value is infinity. |
|
|
Number of Random Records for a Single Tree |
The value must be in the range of (1000,1000000]. The default value is 100,000. |
Method 2: Use a PAI command
You can configure the component parameters using a PAI command. You can use the SQL script component to run PAI commands. For more information, see the SQL script topic.
PAI -name randomforests
-project algo_public
-DinputTableName="pai_rf_test_input"
-DmodelName="pai_rf_test_model"
-DforceCategorical="f1"
-DlabelColName="class"
-DfeatureColNames="f0,f1"
-DmaxRecordSize="100000"
-DminNumPer="0"
-DminNumObj="2"
-DtreeNum="3";
|
Parameter |
Required |
Description |
Default value |
|
inputTableName |
Yes |
The input table. |
None |
|
inputTablePartitions |
No |
The partitions in the input table that are used for training. The following formats are supported:
Note
If you specify multiple partitions, separate them with commas (,). |
All partitions |
|
labelColName |
Yes |
The name of the label column in the input table. |
None |
|
modelName |
Yes |
The name of the output model. |
None |
|
treeNum |
Yes |
The number of trees in the forest. The value must be an integer from 1 to 1000. |
100 |
|
excludedColNames |
No |
The columns that are not used for training. This parameter cannot be used with featureColNames. |
Empty |
|
weightColName |
No |
The name of the weight column in the input table. |
None |
|
featureColNames |
No |
The names of the feature columns in the input table that are used for training. |
All columns except for the ones specified by labelColName and weightColName. |
|
forceCategorical |
No |
The following parsing rules apply:
Note
To parse a BIGINT column as CATEGORICAL, you must use the forceCategorical parameter to specify the type. |
INT is parsed as a continuous type. |
|
algorithmTypes |
No |
The position of the algorithm for a single tree in the forest. If a forest has N trees and algorithmTypes=[a,b] is specified:
For example, in a forest that has five trees, if you set this parameter to [2,4], tree 1 uses the ID3 algorithm, trees 2 and 3 use the CART algorithm, and tree 4 uses the C4.5 algorithm. If you enter None, the algorithms are evenly distributed in the forest. |
The algorithms are evenly distributed in the forest. |
|
randomColNum |
No |
The number of random features selected for each split when a single tree is generated. The value must be in the range of [1,N], where N is the number of features. |
log 2N |
|
minNumObj |
No |
The minimum number of records on a leaf node. The value must be a positive integer. |
2 |
|
minNumPer |
No |
The minimum ratio of records on a leaf node to its parent node. The value must be in the range of [0,1]. |
0.0 |
|
maxTreeDeep |
No |
The maximum depth of a single tree. The value must be in the range of [1,+∞). |
infinity |
|
maxRecordSize |
No |
The number of random records for a single tree. The value must be in the range of (1000,1000000]. |
100000 |
Examples
-
Use an SQL statement to generate training data.
create table pai_rf_test_input as select * from ( select 1 as f0,2 as f1, "good" as class union all select 1 as f0,3 as f1, "good" as class union all select 1 as f0,4 as f1, "bad" as class union all select 0 as f0,3 as f1, "good" as class union all select 0 as f0,4 as f1, "bad" as class )tmp; -
Submit the parameters for the random forest component using a PAI command.
PAI -name randomforests -project algo_public -DinputTableName="pai_rf_test_input" -Dmodelname="pai_rf_test_model" -DforceCategorical="f1" -DlabelColName="class" -DfeatureColNames="f0,f1" -DmaxRecordSize="100000" -DminNumPer="0" -DminNumObj="2" -DtreeNum="3"; -
View the Predictive Model Markup Language (PMML) of the model.
<?xml version="1.0" encoding="utf-8"?> <PMML xmlns="http://www.dmg.org/PMML-4_2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="4.2" xsi:schemaLocation="http://www.dmg.org/PMML-4_2 http://www.dmg.org/v4-2/pmml-4-2.xsd"> <Header copyright="Copyright (c) 2014, Alibaba Inc." description=""> <Application name="ODPS/PMML" version="0.1.0"/> <Timestamp>Tue, 12 Jul 2016 07:04:48 GMT</Timestamp> </Header> <DataDictionary numberOfFields="2"> <DataField name="f0" optype="continuous" dataType="integer"/> <DataField name="f1" optype="continuous" dataType="integer"/> <DataField name="class" optype="categorical" dataType="string"> <Value value="bad"/> <Value value="good"/> </DataField> </DataDictionary> <MiningModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests"/> <MiningSchema> <MiningField name="f0" usageType="active"/> <MiningField name="f1" usageType="active"/> <MiningField name="class" usageType="target"/> </MiningSchema> <Segmentation multipleModelMethod="majorityVote"> <Segment id="0"> <True/> <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests"> <MiningSchema> <MiningField name="f0" usageType="active"/> <MiningField name="f1" usageType="active"/> <MiningField name="class" usageType="target"/> </MiningSchema> <Node id="1"> <True/> <ScoreDistribution value="bad" recordCount="2"/> <ScoreDistribution value="good" recordCount="3"/> <Node id="2" score="good"> <SimplePredicate field="f1" operator="equal" value="2"/> <ScoreDistribution value="good" recordCount="1"/> </Node> <Node id="3" score="good"> <SimplePredicate field="f1" operator="equal" value="3"/> <ScoreDistribution value="good" recordCount="2"/> </Node> <Node id="4" score="bad" <SimplePredicate field="f1" operator="equal" value="4"/> <ScoreDistribution value="bad" recordCount="2"/> </Node> </Node> </TreeModel> </Segment> <Segment id="1"> <True/> <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests"> <MiningSchema> <MiningField name="f0" usageType="active"/> <MiningField name="f1" usageType="active"/> <MiningField name="class" usageType="target"/> </MiningSchema> <Node id="1"> <True/> <ScoreDistribution value="bad" recordCount="2"/> <ScoreDistribution value="good" recordCount="3"/> <Node id="2" score="good"> <SimpleSetPredicate field="f1" booleanOperator="isIn"> <Array n="2" type="integer"2 3</Array> </SimpleSetPredicate> <ScoreDistribution value="good" recordCount="3"/> </Node> <Node id="3" score="bad"> <SimpleSetPredicate field="f1" booleanOperator="isNotIn"> <Array n="2" type="integer"2 3</Array> </SimpleSetPredicate> <ScoreDistribution value="bad" recordCount="2"/> </Node> </Node> </TreeModel> </Segment> <Segment id="2"> <True/> <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests"> <MiningSchema> <MiningField name="f0" usageType="active"/> <MiningField name="f1" usageType="active"/> <MiningField name="class" usageType="target"/> </MiningSchema> <Node id="1"> <True/> <ScoreDistribution value="bad" recordCount="2"/> <ScoreDistribution value="good" recordCount="3"/> <Node id="2" score="bad"> <SimplePredicate field="f0" operator="lessOrEqual" value="0.5"/> <ScoreDistribution value="bad" recordCount="1"/> <ScoreDistribution value="good" recordCount="1"/> </Node> <Node id="3" score="good"> <SimplePredicate field="f0" operator="greaterThan" value="0.5"/> <ScoreDistribution value="bad" recordCount="1"/> <ScoreDistribution value="good" recordCount="2"/> </Node> </Node> </TreeModel> </Segment> </Segmentation> </MiningModel> </PMML> -
View the visual output of the model.
