All Products
Search
Document Center

Platform For AI:Random Forest

Last Updated:Oct 26, 2023

The Random Forest component is a classifier that consists of multiple decision trees. The classification result is determined by the mode of output classes of individual trees.

Configure the component

You can use one of the following methods to configure the Random Forest component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Random Forest component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.

Tab

Parameter

Description

Fields Setting

Feature Columns

By default, the columns except the label columns and weight columns are selected.

Excluded Columns

The columns that are not used for training. These columns cannot be used as feature columns.

Forced Conversion Column

Comply with the following rules to parse columns:

  • Parse the columns of the STRING, BOOLEAN, or DATETIME type to the columns of a discrete type.

  • Parse the columns of the DOUBLE or BIGINT type to the columns of a continuous type.

Note

To parse the columns of the BIGINT type to the columns of the categorical type, you must use the forceCategorical parameter to specify the type.

Weight Columns

The column that contains the weight of each row of samples. The columns of numeric data types are supported.

Label Column

The label column in the input table. The columns of the STRING type and numeric data types are supported.

Parameters Setting

Number of Decision Trees in the Forest

The number of trees. Valid values: 1 to 1000.

Single Decision tree Algorithm

If a forest has N trees and the condition is algorithmTypes=[a,b]:

  • [0, a) indicates the ID3 algorithm.

  • [a,b) indicates the CART algorithm.

  • [b,n] indicates the C4.5 algorithm.

For example, if a forest has five trees and [2,4] indicates 0, 1 indicates the ID3 algorithm, 2 and 3 indicate the CART algorithm, and 4 indicates the C4.5 algorithm. If the value is None, tree algorithms are evenly allocated across the forest.

Number of Random Features for Each Decision Tree

Valid values: [1,N]. N represents the number of features.

Minimum Number of Leaf Nodes

Valid values: positive integers. Default value: 2.

Minimum Ratio of Leaf Nodes to Parent Nodes

Valid values: [0,1]. Default value: 0.

Maximum Decision Tree Depth

Valid values: [1,+∞). Default value: ∞.

Number of Random Data Input for Each Decision Tree

Valid values: (1000,1000000]. Default value: 100000.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

 PAI -name randomforests
     -project algo_public
     -DinputTableName="pai_rf_test_input"
     -DmodelName="pai_rf_test_model"
     -DforceCategorical="f1"
     -DlabelColName="class"
     -DfeatureColNames="f0,f1"
     -DmaxRecordSize="100000"
     -DminNumPer="0"
     -DminNumObj="2"
     -DtreeNum="3";

Parameter

Required

Description

Default value

inputTableName

Yes

The name of the input table.

N/A

inputTablePartitions

No

The partitions that are selected from the input table for training. Specify this parameter in one of the following formats:

  • Partition_name=value

  • name1=value1/name2=value2: multi-level partitions

Note

If you specify multiple partitions, separate these partitions with commas (,).

All partitions

labelColName

Yes

The name of the label column that is selected from the input table.

N/A

modelName

Yes

The name of the output model.

N/A

treeNum

Yes

The number of trees in the forest. Valid values: 1 to 1000.

100

excludedColNames

No

The columns that are not used for training. The columns cannot be used as feature columns.

Empty string

weightColName

No

The name of the weight column in the input table.

N/A

featureColNames

No

The feature columns that are selected from the input table for training.

All columns except the label columns specified by the labelColName parameter and weight column specified by the weightColName parameter.

forceCategorical

No

Comply with the following rules to parse columns:

  • Parse the columns of the STRING, BOOLEAN, or DATETIME type to the columns of a discrete type.

  • Parse the columns of the DOUBLE or BIGINT type to the columns of a continuous type.

Note

To parse the columns of the BIGINT type to the columns of the categorical type, you must use the forceCategorical parameter to specify the type.

INT is a continuous type.

algorithmTypes

No

The location of a tree algorithm in the forest. If the forest has N trees and the condition is algorithmTypes=[a,b]:

  • [0, a) indicates the ID3 algorithm.

  • [a,b) indicates the CART algorithm.

  • [b,n] indicates the C4.5 algorithm.

For example, if a forest has five trees and [2,4] indicates 0, 1 indicates the ID3 algorithm, 2 and 3 indicate the CART algorithm, and 4 indicates the C4.5 algorithm. If the value is None, tree algorithms are evenly allocated across the forest.

Evenly allocated

randomColNum

No

The number of random features that are selected for each split when a single tree is generated. Valid values: [1,N]. N represents the number of features.

log 2N

minNumObj

No

The minimum amount of data on leaf nodes. The parameter value must be a positive integer.

2

minNumPer

No

The minimum ratio of data on leaf nodes to data on a parent node. Valid values: [0,1].

0.0

maxTreeDeep

No

The maximum depth of a single tree. Valid values: [1,+∞).

maxRecordSize

No

The number of random data inputs for a tree. Valid values: (1000,1000000].

100000

Example

  1. Execute the following SQL statements to generate training data:

    create table pai_rf_test_input as
    select * from
    (
      select 1 as f0,2 as f1, "good" as class from dual
      union all
      select 1 as f0,3 as f1, "good" as class from dual
      union all
      select 1 as f0,4 as f1, "bad" as class from dual
      union all
      select 0 as f0,3 as f1, "good" as class from dual
      union all
      select 0 as f0,4 as f1, "bad" as class from dual
    )tmp;
  2. Run the following PAI command to submit the parameters of the Random Forest component:

    PAI -name randomforests
         -project algo_public
         -DinputTableName="pai_rf_test_input"
         -Dmodelname="pai_rf_test_model"
         -DforceCategorical="f1"
         -DlabelColName="class"
         -DfeatureColNames="f0,f1"
         -DmaxRecordSize="100000"
         -DminNumPer="0"
         -DminNumObj="2"
         -DtreeNum="3";
  3. View the Predictive Model Markup Language (PMML) of the model.

    <?xml version="1.0" encoding="utf-8"?>
    <PMML xmlns="http://www.dmg.org/PMML-4_2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="4.2" xsi:schemaLocation="http://www.dmg.org/PMML-4_2 http://www.dmg.org/v4-2/pmml-4-2.xsd">
      <Header copyright="Copyright (c) 2014, Alibaba Inc." description="">
        <Application name="ODPS/PMML" version="0.1.0"/>
        <TimestampTue, 12 Jul 2016 07:04:48 GMT</Timestamp>
      </Header>
      <DataDictionary numberOfFields="2">
        <DataField name="f0" optype="continuous" dataType="integer"/>
        <DataField name="f1" optype="continuous" dataType="integer"/>
        <DataField name="class" optype="categorical" dataType="string">
          <Value value="bad"/>
          <Value value="good"/>
        </DataField>
      </DataDictionary>
      <MiningModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests"/>
        <MiningSchema>
          <MiningField name="f0" usageType="active"/>
          <MiningField name="f1" usageType="active"/>
          <MiningField name="class" usageType="target"/>
        </MiningSchema>
        <Segmentation multipleModelMethod="majorityVote">
          <Segment id="0">
            <True/>
            <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests">
              <MiningSchema>
                <MiningField name="f0" usageType="active"/>
                <MiningField name="f1" usageType="active"/>
                <MiningField name="class" usageType="target"/>
              </MiningSchema>
              <Node id="1">
                <True/>
                <ScoreDistribution value="bad" recordCount="2"/>
                <ScoreDistribution value="good" recordCount="3"/>
                <Node id="2" score="good">
                  <SimplePredicate field="f1" operator="equal" value="2"/>
                  <ScoreDistribution value="good" recordCount="1"/>
                </Node>
                <Node id="3" score="good">
                  <SimplePredicate field="f1" operator="equal" value="3"/>
                  <ScoreDistribution value="good" recordCount="2"/>
                </Node>
                <Node id="4" score="bad"
                  <SimplePredicate field="f1" operator="equal" value="4"/>
                  <ScoreDistribution value="bad" recordCount="2"/>
                </Node>
              </Node>
            </TreeModel>
          </Segment>
          <Segment id="1">
            <True/>
            <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests">
              <MiningSchema>
                <MiningField name="f0" usageType="active"/>
                <MiningField name="f1" usageType="active"/>
                <MiningField name="class" usageType="target"/>
              </MiningSchema>
              <Node id="1">
                <True/>
                <ScoreDistribution value="bad" recordCount="2"/>
                <ScoreDistribution value="good" recordCount="3"/>
                <Node id="2" score="good">
                  <SimpleSetPredicate field="f1" booleanOperator="isIn">
                    <Array n="2" type="integer"2 3</Array>
                  </SimpleSetPredicate>
                  <ScoreDistribution value="good" recordCount="3"/>
                </Node>
                <Node id="3" score="bad">
                  <SimpleSetPredicate field="f1" booleanOperator="isNotIn">
                    <Array n="2" type="integer"2 3</Array>
                  </SimpleSetPredicate>
                  <ScoreDistribution value="bad" recordCount="2"/>
                </Node>
              </Node>
            </TreeModel>
          </Segment>
          <Segment id="2">
            <True/>
            <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests">
              <MiningSchema>
                <MiningField name="f0" usageType="active"/>
                <MiningField name="f1" usageType="active"/>
                <MiningField name="class" usageType="target"/>
              </MiningSchema>
              <Node id="1">
                <True/>
                <ScoreDistribution value="bad" recordCount="2"/>
                <ScoreDistribution value="good" recordCount="3"/>
                <Node id="2" score="bad">
                  <SimplePredicate field="f0" operator="lessOrEqual" value="0.5"/>
                  <ScoreDistribution value="bad" recordCount="1"/>
                  <ScoreDistribution value="good" recordCount="1"/>
                </Node>
                <Node id="3" score="good">
                  <SimplePredicate field="f0" operator="greaterThan" value="0.5"/>
                  <ScoreDistribution value="bad" recordCount="1"/>
                  <ScoreDistribution value="good" recordCount="2"/>
                </Node>
              </Node>
            </TreeModel>
          </Segment>
        </Segmentation>
      </MiningModel>
    </PMML>
  4. View the visualized output of the Random Forest component. Visualized output