This topic describes the how to use Lindorm machine learning (ML) to perform machine learning in databases, including model training, model inference, and model management.

Overview

Lindorm ML is an out-of-the-box in-database machine learning service provided by LindormTSDB. The process of Lindorm ML includes the following steps: model training, model inference, and model management. After you enable Lindorm ML, you can execute SQL statements to perform the full process of machine learning within databases to explore more data value. Lindorm ML has the following benefits:

  • Easy-to-use: You can execute standard SQL statements to perform the full process of machine learning within databases even if you do not have professional knowledge for machine learning.
  • Data retention: You do not need to export data to external platforms for machine learning. This improves the efficiency of machine learning and helps your data meet regulation requirements.
  • Enterprise features: In Lindorm ML, models and data are both stored in databases. Therefore, you can use the enterprise features provided for databases in machine learning, such as permission management, data audit, and data encryption.

Enable Lindorm ML

For more information about how to enable Lindorm ML, see Enable Lindorm ML.

Model training

When you train a model in Lindorm ML, you must specify the type of tasks that need to be executed by the model. Lindorm ML provides different algorithms for different types of tasks. You can use the extended CREATE MODEL statement in your database to train a model.

Syntax

CREATE MODEL model_name
FROM { table_name | (select_statement) }
[ TARGET column_name ]
TASK ( TIME_SERIES_FORECAST | TIME_SERIES_ANOMALY_DETECTION )
ALGORITHM ( DEEPAR | TFT | esd | nsigma | ttest )
[ PREPROCESSORS 'string' ]
SETTINGS (
    EPOCHS integer,
    ...
)

Parameters

  • model_name: the name of the model. The model name is unique within a schema.
  • FROM { table_name | (select_statement) }: the name of table or the query statement that contains the data used to train the model. The results of the statement must contain at least two columns, and one of the columns must be the time column.
  • TARGET column_name: the name of the column that is used to specify the target of time series forecasting or time series anomaly detection.
  • TASK: the type of tasks that need to be executed by the model. The following table describes the task types supported by Lindorm ML.
    Task typeKeyword
    Time series forecastingTIME_SERIES_FORECAST
    Time series anomaly detectionTIME_SERIES_ANOMALY_DETECTION
  • ALGORITHM: the algorithm used by the model. The following table describes the algorithms supported by Lindorm ML.
    Task typeAlgorithmDescription
    Time series forecastingDeepARThe DeepAR algorithm is a deep neural network algorithm based on recurrent neural network (RNN). For more information, see related papers.
    TFTThe Temporal Fusion Transformer (TFT) algorithm is a deep neural network algorithm based on the Transformer mechanism. For more information, see related papers.
    Time series anomaly detectionesdThis algorithm is developed by Alibaba DAMO Academy and is applicable to spiked anomalies, such as spikes in monitoring curves and scenarios in which a small number of data points are significantly different from other data points. For more information, see Time series anomaly detection.
    nsigmaThis algorithm is developed by Alibaba DAMO Academy. It is simple and easy to analyze the causes of anomalies. For more information, see Time series anomaly detection.
    ttestThis algorithm is developed by Alibaba DAMO Academy. It is used to identify whether the metrics related to time series data are abnormal because of a change in the average value. For more information, see Time series anomaly detection.
  • PREPROCESSORS 'string': the preprocessing operations for specific columns. This parameter is optional. In general, the preprocessing operations are specified by JSON strings.

    The value of the PREPROCESSORS parameter contains two components: Columns that indicates the columns that need to be preprocessed and Transformers that indicates the preprocessing operations to be performed. Operations specified by Transformers are performed in the specified order. Each Transformers component contains two fields: Name that specifies the name of the preprocessing operation and Parameters that specifies parameters related to the preprocessing operation. The following example shows a sample value of the PREPROCESSORS parameter:

    PREPROCESSORS '[
    {
      "Columns":[
        "c1"
      ],
      "Transformers":[
        {
          "Name": "Imputer",
          "Parameters": {"value": 0}
        },
        {
          "Name": "StandardScaler"
        }
      ]
    },
    {
      "Columns":[
        "c2",
        "c3"
      ],
      "Transformers":[
        {
          "Name": "OrdinalEncoder"
        }
      ]
    }
    ]'
    Note In the value of the PREPROCESSORS parameter, the Columns component and the Parameters field are both optional.

    The preprocessing operations specified during model training are automatically performed in model inference. The following table describes the preprocessing operations supported by Lindorm ML.

    Preprocessing operationParameterDescription
    OneHotEncoderNoneEncode categorical features into binary values. This operation is applicable to categorical features that are not comparable in size.
    OrdinalEncoderNoneEncode categorical features into integers from 0. This operation is applicable to categorical features that can be compared in size.
    Imputer
    • method: a string that indicates the interpolation method. Valid values: dummy, mean, median, most_frequent, roll7, and last. Default value: dummy.
    • value: an integer that needs to be interpolated. Default value: 0. This parameter is optional.
    Interpolate missing values. You can select multiple interpolation policies.
    StandardScalerNoneConvert data into values that follow a standard normal distribution with a mean of 0 and a standard deviation of 1. This operation is also known as the z-score normalization.
    MinMaxScaler
    • min: an integer that indicates the minimum value of the range.
    • max: an integer that indicates the maximum value of the range. This parameter is optional.
    Scale the data into a range (min,max). By default, the value range is (0,1).
    LogTransformerNoneConverts values into their logarithms.
  • SETTINGS: other parameters that you can configure. You can specify different parameters for different task types. The following table describes the parameters supported by Lindorm ML.

    Parameters related to time series forecasting

    ParameterTypeDescriptionRequired
    epochsINTEGERThe number of epochs for model training. This parameter is applicable only to time series forecasting. Default value: 80 Yes
    time_columnVARCHARThe time column. Yes
    group_columnsVARCHARThe group column. This column is used as the TAG column to determine the time series. Yes
    freqVARCHARThe frequency of the time series data. Example: 1D. Yes
    prediction_lengthINTEGERThe step size of time series forecasting. Yes
    feat_static_columnsVARCHARThe set of static feature columns. Separate multiple columns with commas (,). No

    Parameters related to time series anomaly detection.

    For more information about the training parameters supported by time series anomaly detection, see Time series anomaly detection.

Examples

CREATE MODEL tft_model
FROM (SELECT * FROM fresh_sales WHERE `time` > '2021-02-08T00:00:00+08:00')
TARGET sales
TASK time_series_forecast
ALGORITHM tft
SETTINGS
(
  time_column 'time',
  group_columns 'id_code',
  feat_static_columns 'cate1_id,cate2_id,brand_id',
  context_length '28',
  prediction_length '6',
  epochs '5',
  freq '1D'
);

Model management

After the CREATE MODEL statement is executed, you can execute SQL statements to check whether the model is in the Ready state.

View the information about all models in the database

You can use the SHOW MODELS statement to view the basic information about all models in the database.

The following statement provides an example on how to view the information about all models in a database:

SHOW MODELS;

The following information is returned:

+-------------------+--------+--------------------+-------------------------------+-------------------------------+
|       name        | status |    sql_function    |         created_time          |          update_time          |
+-------------------+--------+--------------------+-------------------------------+-------------------------------+
| tft_model         | Ready  | forecast           | 2022-11-04T11:38:05.873+08:00 | 2022-11-04T11:39:14.046+08:00 |
+-------------------+--------+--------------------+-------------------------------+-------------------------------+
1 rows in set (524 ms)

The following table describes the returned values.

Returned valueDescription
nameThe name of the model.
statusThe state of the model. Valid values: Init, Training, Ready, and Failed. The value Init indicates that the model is being initialized. The value Training indicates that the model is being trained. The value Ready indicates that the model is trained. The value Failed indicates that the model failed to be trained.
sql_functionThe inference function.
created_timeThe time when the model is created.
update_timeThe time when the model is last updated.

View the information about a specified model

You can use the SHOW MODEL model_name statement to view the detailed information about a specified model.

The following statement provides an example on how to view the detailed information about a specified model:

SHOW MODEL tft_model;

The following information is returned:

+-----------+--------+--------------+----------------------+-----------+------------------------------------+---------------+-------------------------------------------------+--------------------------------+-------------------------------+-------------------------------+
|   name    | status | sql_function |     task_type        | algorithm |               query                | preprocessors |                    settings                     |            metrics             |         created_time          |          update_time          |
+-----------+--------+--------------+----------------------+-----------+------------------------------------+---------------+-------------------------------------------------+--------------------------------+-------------------------------+-------------------------------+
| tft_model | Ready  | forecast     | TIME_SERIES_FORECAST | TFT       | SELECT `time`, FIRST(`sales`) AS   | []            | {time_column=time, group_columns=id_code,       | {MAPE=0.35002756118774414,     | 2022-11-04T11:38:05.873+08:00 | 2022-11-04T11:39:14.046+08:00 |
|           |        |              |                      |           | `sales`, `id_code`, `cate1_id`,    |               | feat_static_columns=cate1_id,cate2_id,brand_id, | MASE=0.41281554008773325,      |                               |                               |
|           |        |              |                      |           | `cate2_id`, `brand_id` FROM        |               | context_length=28, prediction_length=6,         | MSE=456.3769938151042}         |                               |                               |
|           |        |              |                      |           | `fresh_sales` WHERE `time` >       |               | epochs=5, freq=1D,train_mode=LOCAL,             |                                |                               |                               |
|           |        |              |                      |           | '2021-02-08T00:00:00+08:00'        |               | past_length=28,                      |                                |                               |                               |
|           |        |              |                      |           | sample by 1D fill zero             |               | forecast_start=2022-07-31 08:00:00}             |                                |                               |                               |
+-----------+--------+--------------+----------------------+-----------+------------------------------------+---------------+-------------------------------------------------+--------------------------------+-------------------------------+-------------------------------+
1 rows in set (334 ms)

The following table describes the returned values.

Returned valueDescription
task_typeThe type of tasks that need to be executed by the model.
algorithmThe name of the algorithm used by the model.
queryThe query statement used to train the model.
preprocessorsThe syntax for preprocessing operations.
settingsThe parameters related to the model.
metricsThe metrics for the model.

Deletes a specified model

You can use the DROP MODEL model_name statement to delete a specified model.

The following statement provides an example on how to delete a specified model:

DROP MODEL tft_model;

The following information is returned:

No rows affected (0.397 seconds)

Model inference

If the model is in the Ready state, you can use system functions to perform model inference. You can use different functions for different types of tasks.

Syntax

SELECT function_name(field_name, model_name, params) FROM table_name [WHERE clause] SAMPLE BY 0;
function_name: the name of the system function. Valid values:
  • FORECAST: the function used for time series forecasting. For more information about how to configure the field_name, model_name, and params parameters of the FORECAST function, see Time series forecasting function.
  • ANOMALY_DETECT: the function used for time series anomaly detection. For more information about how to configure the field_name, model_name, and params parameters of the ANOMALY_DETECT function, see Time series anomaly detection function.

Examples

SELECT device_id, region, `time`, raw(temperature) as temperature, anomaly_detect(temperature, ad_model) as detect_result from sensor WHERE time >= '2022-01-01 00:00:00' and time < '2022-01-01 00:01:00' SAMPLE BY 0;