Random forest - Platform For AI - Alibaba Cloud Documentation Center

A random forest is a classifier that contains multiple decision trees. The classification result is determined by the mode of the output classes from the individual trees.

Component configuration

You can configure the parameters for the random forest component in one of the following ways.

Method 1: Use the GUI

You can configure the component parameters on the Designer workflow page.

Tab	Parameter	Description
Field Settings	Feature Columns	By default, all columns are selected except for the label column and weight column.
	Excluded Columns	The columns that are not used for training. This parameter cannot be used with Feature Columns.
	Forced Conversion Columns	The columns are parsed based on the following rules: Columns of the STRING, BOOLEAN, and DATETIME types are parsed as discrete types. Columns of the DOUBLE and BIGINT types are parsed as continuous types. Note To parse a BIGINT column as CATEGORICAL, you must use the forceCategorical parameter to specify the type.
	Weight Column Name	The column used to weight each sample row. Only numeric types are supported.
	Label Column	The label column in the input table. STRING and numeric types are supported.
Parameter Settings	Number of Trees in the Forest	The value must be an integer from 1 to 1,000.
	Position of an Individual Tree in the Forest	If the number of trees is N and algorithmTypes=[a,b], then: The range [0,a) corresponds to the ID3 algorithm. The range [a,b) corresponds to the CART algorithm. The range [b,n] corresponds to the C4.5 algorithm. For example, in a five-tree forest, if you set this parameter to [2,4], tree 1 uses the ID3 algorithm, trees 2 and 3 use the CART algorithm, and tree 4 uses the C4.5 algorithm. If you enter None, the algorithms are evenly distributed among the trees in the forest.
	Number of Random Features for a Single Tree	The value must be in the range of [1,N], where N is the number of features.
	Minimum Number of Records on a Leaf Node	A positive integer. The default value is 2.
	Minimum Ratio of Records on a Leaf Node to Its Parent Node	The value must be in the range of [0,1]. The default value is 0.
	Maximum Depth of a Single Tree	The value must be in the range of [1,+∞). The default value is infinity.
	Number of Random Records for a Single Tree	The value must be in the range of (1000,1000000]. The default value is 100,000.

Method 2: Use a PAI command

You can configure the component parameters using a PAI command. You can use the SQL script component to run PAI commands. For more information, see the SQL script topic.

 PAI -name randomforests
     -project algo_public
     -DinputTableName="pai_rf_test_input"
     -DmodelName="pai_rf_test_model"
     -DforceCategorical="f1"
     -DlabelColName="class"
     -DfeatureColNames="f0,f1"
     -DmaxRecordSize="100000"
     -DminNumPer="0"
     -DminNumObj="2"
     -DtreeNum="3";

Parameter	Required	Description	Default value
inputTableName	Yes	The input table.	None
inputTablePartitions	No	The partitions in the input table that are used for training. The following formats are supported: Partition_name=value name1=value1/name2=value2: a multi-level format Note If you specify multiple partitions, separate them with commas (,).	All partitions
labelColName	Yes	The name of the label column in the input table.	None
modelName	Yes	The name of the output model.	None
treeNum	Yes	The number of trees in the forest. The value must be an integer from 1 to 1000.	100
excludedColNames	No	The columns that are not used for training. This parameter cannot be used with featureColNames.	Empty
weightColName	No	The name of the weight column in the input table.	None
featureColNames	No	The names of the feature columns in the input table that are used for training.	All columns except for the ones specified by labelColName and weightColName.
forceCategorical	No	The following parsing rules apply: Columns of the STRING, BOOLEAN, and DATETIME types are parsed as discrete types. Columns of the DOUBLE and BIGINT types are parsed as continuous types. Note To parse a BIGINT column as CATEGORICAL, you must use the forceCategorical parameter to specify the type.	INT is parsed as a continuous type.
algorithmTypes	No	The position of the algorithm for a single tree in the forest. If a forest has N trees and algorithmTypes=[a,b] is specified: [0,a) is the ID3 algorithm. [a,b) is the CART algorithm. [b,n] specifies the C4.5 algorithm. For example, in a forest that has five trees, if you set this parameter to [2,4], tree 1 uses the ID3 algorithm, trees 2 and 3 use the CART algorithm, and tree 4 uses the C4.5 algorithm. If you enter None, the algorithms are evenly distributed in the forest.	The algorithms are evenly distributed in the forest.
randomColNum	No	The number of random features selected for each split when a single tree is generated. The value must be in the range of [1,N], where N is the number of features.	log ₂N
minNumObj	No	The minimum number of records on a leaf node. The value must be a positive integer.	2
minNumPer	No	The minimum ratio of records on a leaf node to its parent node. The value must be in the range of [0,1].	0.0
maxTreeDeep	No	The maximum depth of a single tree. The value must be in the range of [1,+∞).	infinity
maxRecordSize	No	The number of random records for a single tree. The value must be in the range of (1000,1000000].	100000

Examples

Use an SQL statement to generate training data.

create table pai_rf_test_input as
select * from
(
  select 1 as f0,2 as f1, "good" as class
  union all
  select 1 as f0,3 as f1, "good" as class
  union all
  select 1 as f0,4 as f1, "bad" as class
  union all
  select 0 as f0,3 as f1, "good" as class
  union all
  select 0 as f0,4 as f1, "bad" as class
)tmp;

Submit the parameters for the random forest component using a PAI command.

PAI -name randomforests
     -project algo_public
     -DinputTableName="pai_rf_test_input"
     -Dmodelname="pai_rf_test_model"
     -DforceCategorical="f1"
     -DlabelColName="class"
     -DfeatureColNames="f0,f1"
     -DmaxRecordSize="100000"
     -DminNumPer="0"
     -DminNumObj="2"
     -DtreeNum="3";

View the Predictive Model Markup Language (PMML) of the model.

<?xml version="1.0" encoding="utf-8"?>
<PMML xmlns="http://www.dmg.org/PMML-4_2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="4.2" xsi:schemaLocation="http://www.dmg.org/PMML-4_2 http://www.dmg.org/v4-2/pmml-4-2.xsd">
  <Header copyright="Copyright (c) 2014, Alibaba Inc." description="">
    <Application name="ODPS/PMML" version="0.1.0"/>
    <Timestamp>Tue, 12 Jul 2016 07:04:48 GMT</Timestamp>
  </Header>
  <DataDictionary numberOfFields="2">
    <DataField name="f0" optype="continuous" dataType="integer"/>
    <DataField name="f1" optype="continuous" dataType="integer"/>
    <DataField name="class" optype="categorical" dataType="string">
      <Value value="bad"/>
      <Value value="good"/>
    </DataField>
  </DataDictionary>
  <MiningModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests"/>
    <MiningSchema>
      <MiningField name="f0" usageType="active"/>
      <MiningField name="f1" usageType="active"/>
      <MiningField name="class" usageType="target"/>
    </MiningSchema>
    <Segmentation multipleModelMethod="majorityVote">
      <Segment id="0">
        <True/>
        <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests">
          <MiningSchema>
            <MiningField name="f0" usageType="active"/>
            <MiningField name="f1" usageType="active"/>
            <MiningField name="class" usageType="target"/>
          </MiningSchema>
          <Node id="1">
            <True/>
            <ScoreDistribution value="bad" recordCount="2"/>
            <ScoreDistribution value="good" recordCount="3"/>
            <Node id="2" score="good">
              <SimplePredicate field="f1" operator="equal" value="2"/>
              <ScoreDistribution value="good" recordCount="1"/>
            </Node>
            <Node id="3" score="good">
              <SimplePredicate field="f1" operator="equal" value="3"/>
              <ScoreDistribution value="good" recordCount="2"/>
            </Node>
            <Node id="4" score="bad"
              <SimplePredicate field="f1" operator="equal" value="4"/>
              <ScoreDistribution value="bad" recordCount="2"/>
            </Node>
          </Node>
        </TreeModel>
      </Segment>
      <Segment id="1">
        <True/>
        <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests">
          <MiningSchema>
            <MiningField name="f0" usageType="active"/>
            <MiningField name="f1" usageType="active"/>
            <MiningField name="class" usageType="target"/>
          </MiningSchema>
          <Node id="1">
            <True/>
            <ScoreDistribution value="bad" recordCount="2"/>
            <ScoreDistribution value="good" recordCount="3"/>
            <Node id="2" score="good">
              <SimpleSetPredicate field="f1" booleanOperator="isIn">
                <Array n="2" type="integer"2 3</Array>
              </SimpleSetPredicate>
              <ScoreDistribution value="good" recordCount="3"/>
            </Node>
            <Node id="3" score="bad">
              <SimpleSetPredicate field="f1" booleanOperator="isNotIn">
                <Array n="2" type="integer"2 3</Array>
              </SimpleSetPredicate>
              <ScoreDistribution value="bad" recordCount="2"/>
            </Node>
          </Node>
        </TreeModel>
      </Segment>
      <Segment id="2">
        <True/>
        <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests">
          <MiningSchema>
            <MiningField name="f0" usageType="active"/>
            <MiningField name="f1" usageType="active"/>
            <MiningField name="class" usageType="target"/>
          </MiningSchema>
          <Node id="1">
            <True/>
            <ScoreDistribution value="bad" recordCount="2"/>
            <ScoreDistribution value="good" recordCount="3"/>
            <Node id="2" score="bad">
              <SimplePredicate field="f0" operator="lessOrEqual" value="0.5"/>
              <ScoreDistribution value="bad" recordCount="1"/>
              <ScoreDistribution value="good" recordCount="1"/>
            </Node>
            <Node id="3" score="good">
              <SimplePredicate field="f0" operator="greaterThan" value="0.5"/>
              <ScoreDistribution value="bad" recordCount="1"/>
              <ScoreDistribution value="good" recordCount="2"/>
            </Node>
          </Node>
        </TreeModel>
      </Segment>
    </Segmentation>
  </MiningModel>
</PMML>

View the visual output of the model.