This topic introduces the concept of feature binning and describes the syntax used for feature binning in LindormTSDB.

Introduction

Feature binning, which is also known as called discrete binning or discrete bucketing, is a technique used to preprocess data. In feature engineering,feature binning is a common method used to discretize continuous data into multiple bins that contains a small amount of data.

Limits

  • Feature binning must be used together with SAMPLE BY 0 clause, which indicates that feature binning is performed on the Field column of all time series.
  • Feature binning cannot be used together with other functions such as RATE, DELTA, MAX, and MIN.

Feature binning policies

The following table describes the feature binning policies supported by LindormTSDB.
PolicyDescription
Equal width binning
  • The original data is divided into bins with the same width. The number of data entries in each bin may be different. For example, if you use equal width binning to divide data within a range from 0 to 100, data is divided into two bins with the same width: [0,50) and [50,100].
  • Data is divided into bins based on only the width and boundary of the bins.
  • Data divided by using this policy is prone to be affected by abnormal data.
Equal frequency binningData is equally divided into multiple bins based on the distribution of data. If you use this policy to divide data, the number of data entries in each bin is about the same.

Parameters

You can configure custom parameters by specifying options to specify the feature binning policy and how the results of feature binning are output. Take note of the following items when you configure custom parameters
  • The parameter name is case-insensitive.
  • The values of custom parameters can be digits and strings and cannot be NULL.
  • The values of custom parameters must be within the specific ranges.
ParameterTypeDescription
n_binsINTEGERThe number of bins to which you want to divide the original data. The value of this parameter must be a positive integer. Valid values: (1, 1000). Default value: 5.
output_typeVARCHARThe output method of the feature binning results. Default value: interval. Valid values:
  • interval: The intervals of the bins are returned.
  • ordinal: Ordinal encoding is performed on the bins and the ordinal numbers of the bins are returned
strategyVARCHARThe feature binning policy. Default value: quantile. Valid values:
  • quantile: The equal frequency binning policy is used.
  • uniform: The equal width binning policy is used.

Syntax

select_sample_by_statement ::=  SELECT ( select_clause )
                                FROM table_name
                                WHERE where_clause
                                SAMPLE BY 0
select_clause              ::=  selector [ AS identifier ] ( ',' selector [ AS identifier ] )
selector                   ::=  tag_name | time | bins '(' field_name [ ',' options] ')'
where_clause               ::=  relation ( AND relation )* (OR relation)*
relation                   ::=  ( field_name | tag_name ) operator term
operator                   ::=  '=' | '<' | '>' | '<=' | '>=' | '!=' | IN | CONTAINS | CONTAINS KEY
Note For more information about the function you can use to perform feature binning, see Feature binning functions.

Examples

  • Perform feature binning on the temperature data within the specified time range in a time series table named sensor.
    SELECT device_id, region, time, bins(temperature) AS temperature_bin FROM sensor WHERE time >= '2021-04-22 00:00:00' AND time < '2022-04-23 00:01:00' sample by 0;
    The following output is returned:
    +-----------+----------+---------------------------+-----------------+
    | device_id |  region  |           time            | temperature_bin |
    +-----------+----------+---------------------------+-----------------+
    | F07A1260  | north-cn | 2021-04-22T15:33:00+08:00 | [10.60,12.10)   |
    | F07A1260  | north-cn | 2021-04-22T15:33:10+08:00 | [12.10,13.20]   |
    | F07A1260  | north-cn | 2021-04-22T15:33:20+08:00 | [10.60,10.60)   |
    | F07A1261  | south-cn | 2021-04-22T15:51:20+08:00 | [10.60,12.10)   |
    | F07A1261  | south-cn | 2021-04-22T15:51:21+08:00 | [12.10,13.20)   |
    | F07A1261  | south-cn | 2021-04-22T15:51:27+08:00 | [10.60,10.60)   |
    | F07A1261  | south-cn | 2021-04-22T15:51:47+08:00 | [13.20,20.60]   |
    +-----------+----------+---------------------------+-----------------+
  • Perform feature binning on the temperature data of a device whose ID is F07A1260 within the specified time range in a time series table named sensor.
    SELECT device_id, region, time, bins(temperature) AS temperature_bin FROM sensor WHERE device_id in ('F07A1260') AND time >= '2021-04-22 00:00:00' AND time < '2022-04-23 00:01:00' sample by 0;
    The following output is returned:
    +-----------+----------+---------------------------+-----------------+
    | device_id |  region  |           time            | temperature_bin |
    +-----------+----------+---------------------------+-----------------+
    | F07A1260  | north-cn | 2021-04-22T15:33:00+08:00 | [10.60,12.10)   |
    | F07A1260  | north-cn | 2021-04-22T15:33:10+08:00 | [12.10,13.20]   |
    | F07A1260  | north-cn | 2021-04-22T15:33:20+08:00 | [10.60,10.60)   |
    +-----------+----------+---------------------------+-----------------+
  • Perform feature binning on the temperature data of a device whose ID is F07A1260 within the specified time range in a time series table named sensor. Specify that the temperature data is divided into two bins by using the equal width binning policy, and the ordinary numbers of the bins are returned.
    SELECT device_id, region, time, bins(temperature, 'n_bins=2, output_type=ordinal, strategy=uniform') AS temperature_bin FROM sensor WHERE device_id in ('F07A1260') AND time >= '2021-04-22 00:00:00' AND time < '2022-04-23 00:01:00' sample by 0;
    The following output is returned:
    +-----------+----------+---------------------------+-----------------+
    | device_id |  region  |           time            | temperature_bin |
    +-----------+----------+---------------------------+-----------------+
    | F07A1260  | north-cn | 2021-04-22T15:33:00+08:00 | 1               |
    | F07A1260  | north-cn | 2021-04-22T15:33:10+08:00 | 1               |
    | F07A1260  | north-cn | 2021-04-22T15:33:20+08:00 | 0               |
    +-----------+----------+---------------------------+-----------------+