This topic describes the Box Plot component provided by Machine Learning Designer.

A box plot chart shows the distribution of a set of data. It shows the distribution features of raw data. It can also be used to compare the distribution features between multiple sets of data.

Limits

The visualized report of this component is available only in Machine Learning Studio.

Configure the component

You can configure the component by using one of the following methods:

Method 1: Using the Machine Learning Platform for AI console

Configure the component parameters on the pipeline configuration page of Machine Learning Designer.
Tab Parameter Description
Field Setting Continuous Features The column to represent the continuous feature.
Enumeration Feature The column to represent the enumeration feature.
Note Machine Learning Studio allows you to select only one field, whereas Machine Learning Designer allows you to select multiple fields.
Stratified Samples The number of adopted stratified samples.

Method 2: Using Machine Learning Platform for AI (PAI) commands

Configure the component parameters by using a PAI command. You can use the SQL Script component to run PAI commands. For more information, see SQL Script. The following table describes the parameters of the PAI command.
PAI -name box_plot -project algo_public
    -DinputTable="boxplot"
    -DcontinueCols="age"
    -DcategoryCol="y"
     -DoutputTable="pai_temp_6075_97181_1"
    -DsampleSize="1000"
    -Dlifecycle="7";
Parameter Required Description Default value
inputTable Yes The name of the input table. N/A
inputTablePartitions No The partition that is selected from the input table for training. The following formats are supported:
  • Partition_name=value
  • name1=value1/name2=value2: multi-level partitions
Note If you specify multiple partitions, separate them with commas (,).
N/A
outputTable Yes The name of the output table that stores the box plot chart and samples. N/A
continueCols Yes The column to represent the continuous feature. N/A
categoryCol Yes The column to represent the enumeration feature. N/A
sampleSize No The number of samples based on which the disturbance conditions of each feature are drawn. 1000
lifecycle No The lifecycle of the output table. Unit: days. 28
coreNum No The number of cores that are used in computing. The value of this parameter must be a positive integer. Automatically allocated
memSizePerCore No The memory size of each core. Valid values: 1 to 65536. Unit: MB. Automatically allocated

Examples

  • Input data
    create table boxplot as select age, y from bank_data limit 100;
    age y
    50 0
    53 0
    28 1
    39 0
    55 1
    30 0
    37 0
    39 0
    36 1
    27 0
    34 0
    41 0
    55 1
    33 0
    26 0
    52 0
    35 1
    27 1
    28 0
    26 0
    41 0
    35 0
    40 0
    32 0
    41 0
    34 0
    49 0
    37 0
    35 0
    38 0
    47 0
    46 0
    27 0
    29 1
    32 0
    36 0
    29 0
    47 0
    44 0
    54 0
    36 0
    42 0
    44 0
    72 1
    48 0
    36 0
    35 0
    43 0
    56 0
    42 0
    31 0
    32 0
    33 0
    31 0
    39 0
    30 1
    24 0
    24 0
    38 0
    26 0
    41 0
    34 0
    30 0
    37 0
    68 0
    31 0
    48 0
    33 0
    59 0
    44 0
    28 0
    50 0
    33 0
    45 0
    40 0
    45 0
    43 0
    54 0
    53 0
    35 0
    30 0
    25 0
    35 0
    54 1
    30 0
    38 0
    35 0
    47 0
    32 0
    27 0
    40 1
    31 0
    42 0
    40 0
    31 0
    57 0
    38 1
    39 0
    37 0
    44 0
  • Parameter settings

    Specify the age column as the continuous feature column, and the y column as the enumeration feature column. Retain the default values of other parameters.

  • Output
    • Output description
      Right-click Box Plot and choose View Data > Output Port to view the output. Parameters:
      • percent_points: indicates the calculated percentile.
      • percent_count: indicates the number of data entries in each interval. The intervals are divided by percentile.
      • sample_list: The samples are selected from each stratum based on the sampling rate. The sampling rate is calculated by using the following formula: Sampling rate = Number of stratified samples/Total number of data entries. If the sampling rate is too low and the value of the number of samples in each stratum multiplied by the sampling rate is less than 10, a new sampling rate is recalculated.
    • The following figure shows a box plot chart. Box plot chart
    • The following figure shows the distribution of disturbance points. Distribution of disturbance points