The Random Forest component is a classifier that consists of multiple decision trees. The classification result is determined by the mode of output classes of individual trees.

Configure the component

You can configure the component by using one of the following methods:
  • Use the Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Setting Feature Columns By default, the columns except the label columns and weight columns are selected.
    Excluded Columns The columns that are not used for training. These columns cannot be used as feature columns.
    Forced Conversion Column Comply with the following rules to parse columns:
    • Parse the columns of the STRING, BOOLEAN, or DATETIME type to the columns of a discrete type.
    • Parse the columns of the DOUBLE or BIGINT type to the columns of a continuous type.
    Note To parse the columns of the BIGINT type to the columns of the categorical type, you must use the forceCategorical parameter to specify the type.
    Weight Columns The column that contains the weight of each row of samples. The columns of numeric data types are supported.
    Label Column The label column in the input table. The columns of the STRING type and numeric data types are supported.
    Parameters Setting Number of Decision Trees in the Forest The number of trees. Valid values: 1 to 1000.
    Single Decision tree Algorithm If a forest has N trees and the condition is algorithmTypes=[a,b]:
    • [0, a) indicates the ID3 algorithm.
    • [a,b) indicates the CART algorithm.
    • [b,n] indicates the C4.5 algorithm.
    For example, if a forest has five trees and [2,4] indicates 0, 1 indicates the ID3 algorithm, 2 and 3 indicate the CART algorithm, and 4 indicates the C4.5 algorithm. If the value is None, tree algorithms are evenly allocated across the forest.
    Number of Random Features for Each Decision Tree Valid values: [1,N]. N represents the number of features.
    Minimum Number of Leaf Nodes Valid values: positive integers. Default value: 2.
    Minimum Ratio of Leaf Nodes to Parent Nodes Valid values: [0,1]. Default value: 0.
    Maximum Decision Tree Depth Valid values: [1,+∞). Default value: ∞.
    Number of Random Data Input for Each Decision Tree Valid values: (1000,1000000]. Default value: 100000.
  • Use commands
     PAI -name randomforests
         -project algo_public
         -DinputTableName="pai_rf_test_input"
         -DmodelName="pai_rf_test_model"
         -DforceCategorical="f1"
         -DlabelColName="class"
         -DfeatureColNames="f0,f1"
         -DmaxRecordSize="100000"
         -DminNumPer="0"
         -DminNumObj="2"
         -DtreeNum="3";
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. N/A
    inputTablePartitions No The partitions that are selected from the input table for training. Specify this parameter in one of the following formats:
    • Partition_name=value
    • name1=value1/name2=value2: multi-level partitions
    Note If you specify multiple partitions, separate these partitions with commas (,).
    All partitions
    labelColName Yes The name of the label column that is selected from the input table. N/A
    modelName Yes The name of the output model. N/A
    treeNum Yes The number of trees in the forest. Valid values: 1 to 1000. 100
    excludedColNames No The columns that are not used for training. The columns cannot be used as feature columns. Empty string
    weightColName No The name of the weight column in the input table. N/A
    featureColNames No The feature columns that are selected from the input table for training. All columns except the label columns specified by the labelColName parameter and weight column specified by the weightColName parameter.
    forceCategorical No Comply with the following rules to parse columns:
    • Parse the columns of the STRING, BOOLEAN, or DATETIME type to the columns of a discrete type.
    • Parse the columns of the DOUBLE or BIGINT type to the columns of a continuous type.
    Note To parse the columns of the BIGINT type to the columns of the categorical type, you must use the forceCategorical parameter to specify the type.
    INT is a continuous type.
    algorithmTypes No The location of a tree algorithm in the forest. If the forest has N trees and the condition is algorithmTypes=[a,b]:
    • [0, a) indicates the ID3 algorithm.
    • [a,b) indicates the CART algorithm.
    • [b,n] indicates the C4.5 algorithm.
    For example, if a forest has five trees and [2,4] indicates 0, 1 indicates the ID3 algorithm, 2 and 3 indicate the CART algorithm, and 4 indicates the C4.5 algorithm. If the value is None, tree algorithms are evenly allocated across the forest.
    Evenly allocated
    randomColNum No The number of random features that are selected for each split when a single tree is generated. Valid values: [1,N]. N represents the number of features. log 2N
    minNumObj No The minimum amount of data on leaf nodes. The parameter value must be a positive integer. 2
    minNumPer No The minimum ratio of data on leaf nodes to data on a parent node. Valid values: [0,1]. 0.0
    maxTreeDeep No The maximum depth of a single tree. Valid values: [1,+∞).
    maxRecordSize No The number of random data inputs for a tree. Valid values: (1000,1000000]. 100000

Example

  1. Execute the following SQL statements to generate training data:
    create table pai_rf_test_input as
    select * from
    (
      select 1 as f0,2 as f1, "good" as class from dual
      union all
      select 1 as f0,3 as f1, "good" as class from dual
      union all
      select 1 as f0,4 as f1, "bad" as class from dual
      union all
      select 0 as f0,3 as f1, "good" as class from dual
      union all
      select 0 as f0,4 as f1, "bad" as class from dual
    )tmp;
  2. Run the following PAI command to submit the parameters of the Random Forest component:
    PAI -name randomforests
         -project algo_public
         -DinputTableName="pai_rf_test_input"
         -Dmodelname="pai_rf_test_model"
         -DforceCategorical="f1"
         -DlabelColName="class"
         -DfeatureColNames="f0,f1"
         -DmaxRecordSize="100000"
         -DminNumPer="0"
         -DminNumObj="2"
         -DtreeNum="3";
  3. View the Predictive Model Markup Language (PMML) of the model.
    <?xml version="1.0" encoding="utf-8"?
    <PMML xmlns="http://www.dmg.org/PMML-4_2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="4.2" xsi:schemaLocation="http://www.dmg.org/PMML-4_2 http://www.dmg.org/v4-2/pmml-4-2.xsd"
      <Header copyright="Copyright (c) 2014, Alibaba Inc." description=""
        <Application name="ODPS/PMML" version="0.1.0"/
        <TimestampTue, 12 Jul 2016 07:04:48 GMT</Timestamp
      </Header
      <DataDictionary numberOfFields="2"
        <DataField name="f0" optype="continuous" dataType="integer"/
        <DataField name="f1" optype="continuous" dataType="integer"/
        <DataField name="class" optype="categorical" dataType="string"
          <Value value="bad"/
          <Value value="good"/
        </DataField
      </DataDictionary
      <MiningModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests"
        <MiningSchema
          <MiningField name="f0" usageType="active"/
          <MiningField name="f1" usageType="active"/
          <MiningField name="class" usageType="target"/
        </MiningSchema
        <Segmentation multipleModelMethod="majorityVote"
          <Segment id="0"
            <True/
            <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests"
              <MiningSchema
                <MiningField name="f0" usageType="active"/
                <MiningField name="f1" usageType="active"/
                <MiningField name="class" usageType="target"/
              </MiningSchema
              <Node id="1"
                <True/
                <ScoreDistribution value="bad" recordCount="2"/
                <ScoreDistribution value="good" recordCount="3"/
                <Node id="2" score="good"
                  <SimplePredicate field="f1" operator="equal" value="2"/
                  <ScoreDistribution value="good" recordCount="1"/
                </Node
                <Node id="3" score="good"
                  <SimplePredicate field="f1" operator="equal" value="3"/
                  <ScoreDistribution value="good" recordCount="2"/
                </Node
                <Node id="4" score="bad"
                  <SimplePredicate field="f1" operator="equal" value="4"/
                  <ScoreDistribution value="bad" recordCount="2"/
                </Node
              </Node
            </TreeModel
          </Segment
          <Segment id="1"
            <True/
            <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests"
              <MiningSchema
                <MiningField name="f0" usageType="active"/
                <MiningField name="f1" usageType="active"/
                <MiningField name="class" usageType="target"/
              </MiningSchema
              <Node id="1"
                <True/
                <ScoreDistribution value="bad" recordCount="2"/
                <ScoreDistribution value="good" recordCount="3"/
                <Node id="2" score="good"
                  <SimpleSetPredicate field="f1" booleanOperator="isIn"
                    <Array n="2" type="integer"2 3</Array
                  </SimpleSetPredicate
                  <ScoreDistribution value="good" recordCount="3"/
                </Node
                <Node id="3" score="bad"
                  <SimpleSetPredicate field="f1" booleanOperator="isNotIn"
                    <Array n="2" type="integer"2 3</Array
                  </SimpleSetPredicate
                  <ScoreDistribution value="bad" recordCount="2"/
                </Node
              </Node
            </TreeModel
          </Segment
          <Segment id="2"
            <True/
            <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests"
              <MiningSchema
                <MiningField name="f0" usageType="active"/
                <MiningField name="f1" usageType="active"/
                <MiningField name="class" usageType="target"/
              </MiningSchema
              <Node id="1"
                <True/
                <ScoreDistribution value="bad" recordCount="2"/
                <ScoreDistribution value="good" recordCount="3"/
                <Node id="2" score="bad"
                  <SimplePredicate field="f0" operator="lessOrEqual" value="0.5"/
                  <ScoreDistribution value="bad" recordCount="1"/
                  <ScoreDistribution value="good" recordCount="1"/
                </Node
                <Node id="3" score="good"
                  <SimplePredicate field="f0" operator="greaterThan" value="0.5"/
                  <ScoreDistribution value="bad" recordCount="1"/
                  <ScoreDistribution value="good" recordCount="2"/
                </Node
              </Node
            </TreeModel
          </Segment
        </Segmentation
      </MiningModel
    </PMML
  4. View the visualized output of the Random Forest component. Visualized output