This topic introduces the concept of feature binning and describes the syntax used for feature binning in LindormTSDB.
Introduction
Feature binning, which is also known as called discrete binning or discrete bucketing, is a technique used to preprocess data. In feature engineering,feature binning is a common method used to discretize continuous data into multiple bins that contains a small amount of data.
Limits
- Feature binning must be used together with
SAMPLE BY 0
clause, which indicates that feature binning is performed on the Field column of all time series. - Feature binning cannot be used together with other functions such as RATE, DELTA, MAX, and MIN.
Feature binning policies
The following table describes the feature binning policies supported by LindormTSDB.
Policy | Description |
---|---|
Equal width binning |
|
Equal frequency binning | Data is equally divided into multiple bins based on the distribution of data. If you use this policy to divide data, the number of data entries in each bin is about the same. |
Parameters
You can configure custom parameters by specifying options to specify the feature binning policy and how the results of feature binning are output. Take note of the following items when you configure custom parameters
- The parameter name is case-insensitive.
- The values of custom parameters can be digits and strings and cannot be NULL.
- The values of custom parameters must be within the specific ranges.
Parameter | Type | Description |
---|---|---|
n_bins | INTEGER | The number of bins to which you want to divide the original data. The value of this parameter must be a positive integer. Valid values: (1, 1000) . Default value: 5. |
output_type | VARCHAR | The output method of the feature binning results. Default value: interval. Valid values:
|
strategy | VARCHAR | The feature binning policy. Default value: quantile. Valid values:
|
Syntax
select_sample_by_statement ::= SELECT ( select_clause )
FROM table_name
WHERE where_clause
SAMPLE BY 0
select_clause ::= selector [ AS identifier ] ( ',' selector [ AS identifier ] )
selector ::= tag_name | time | bins '(' field_name [ ',' options] ')'
where_clause ::= relation ( AND relation )* (OR relation)*
relation ::= ( field_name | tag_name ) operator term
operator ::= '=' | '<' | '>' | '<=' | '>=' | '!=' | IN | CONTAINS | CONTAINS KEY
Note For more information about the function you can use to perform feature binning, see Feature binning functions.
Examples
- Perform feature binning on the temperature data within the specified time range in a time series table named sensor.
The following output is returned:SELECT device_id, region, time, bins(temperature) AS temperature_bin FROM sensor WHERE time >= '2021-04-22 00:00:00' AND time < '2022-04-23 00:01:00' sample by 0;
+-----------+----------+---------------------------+-----------------+ | device_id | region | time | temperature_bin | +-----------+----------+---------------------------+-----------------+ | F07A1260 | north-cn | 2021-04-22T15:33:00+08:00 | [10.60,12.10) | | F07A1260 | north-cn | 2021-04-22T15:33:10+08:00 | [12.10,13.20] | | F07A1260 | north-cn | 2021-04-22T15:33:20+08:00 | [10.60,10.60) | | F07A1261 | south-cn | 2021-04-22T15:51:20+08:00 | [10.60,12.10) | | F07A1261 | south-cn | 2021-04-22T15:51:21+08:00 | [12.10,13.20) | | F07A1261 | south-cn | 2021-04-22T15:51:27+08:00 | [10.60,10.60) | | F07A1261 | south-cn | 2021-04-22T15:51:47+08:00 | [13.20,20.60] | +-----------+----------+---------------------------+-----------------+
- Perform feature binning on the temperature data of a device whose ID is F07A1260 within the specified time range in a time series table named sensor.
The following output is returned:SELECT device_id, region, time, bins(temperature) AS temperature_bin FROM sensor WHERE device_id in ('F07A1260') AND time >= '2021-04-22 00:00:00' AND time < '2022-04-23 00:01:00' sample by 0;
+-----------+----------+---------------------------+-----------------+ | device_id | region | time | temperature_bin | +-----------+----------+---------------------------+-----------------+ | F07A1260 | north-cn | 2021-04-22T15:33:00+08:00 | [10.60,12.10) | | F07A1260 | north-cn | 2021-04-22T15:33:10+08:00 | [12.10,13.20] | | F07A1260 | north-cn | 2021-04-22T15:33:20+08:00 | [10.60,10.60) | +-----------+----------+---------------------------+-----------------+
- Perform feature binning on the temperature data of a device whose ID is F07A1260 within the specified time range in a time series table named sensor. Specify that the temperature data is divided into two bins by using the equal width binning policy, and the ordinary numbers of the bins are returned.
The following output is returned:SELECT device_id, region, time, bins(temperature, 'n_bins=2, output_type=ordinal, strategy=uniform') AS temperature_bin FROM sensor WHERE device_id in ('F07A1260') AND time >= '2021-04-22 00:00:00' AND time < '2022-04-23 00:01:00' sample by 0;
+-----------+----------+---------------------------+-----------------+ | device_id | region | time | temperature_bin | +-----------+----------+---------------------------+-----------------+ | F07A1260 | north-cn | 2021-04-22T15:33:00+08:00 | 1 | | F07A1260 | north-cn | 2021-04-22T15:33:10+08:00 | 1 | | F07A1260 | north-cn | 2021-04-22T15:33:20+08:00 | 0 | +-----------+----------+---------------------------+-----------------+