RDS SQLFlow syntax specification - - Alibaba Cloud Documentation Center

You can use RDS SQLFlow statements to efficiently perform model training, prediction, and evaluation. This topic describes the syntax specification of RDS SQLFlow in detail.

Introduction

You can use RDS SQLFlow statements to efficiently perform model training, prediction, and evaluation without writing complex Python code. RDS SQLFlow automatically parses these statements and converts them into equivalent Python programs, which significantly simplifies the model training process. The syntax of RDS SQLFlow covers various features, including the following:

Training syntax
Prediction syntax
Explanation syntax
Evaluation syntax
Optimization syntax

For example, you can use the iris dataset and train a TensorFlow DNNClassifier algorithm. In the dataset, the first four columns represent features that indicate the botanical shape of each iris flower, and the last column is the label that represents the subspecies of each iris flower.

sepal_length	sepal_width	petal_length	petal_width	class
6.4	2.8	5.6	2.2	2
5.0	2.3	3.3	1.0	1
...	...	...	...	...

With the following simple RDS SQLFlow statement, you can easily train a DNNClassifier model with two hidden layers, each containing ten hidden units. After training, the model will be saved to the sqlflow_models.my_dnn_model table for subsequent prediction. RDS SQLFlow automatically parses these statements and converts them into equivalent Python programs, which significantly simplifies the model training process.

SELECT * FROM iris.train
TO TRAIN DNNClassifier
WITH model.hidden_units = [10, 10], model.n_classes = 3, train.epoch= 10
COLUMN sepal_length, sepal_width, petal_length, petal_width
LABEL class
INTO sqlflow_models.my_dnn_model;

Training syntax

RDS SQLFlow training statements consist of SELECT, TRAIN, COLUMN, LABEL, and INTO clauses.

SELECT select_expr [, select_expr ...]
FROM table_references
  [WHERE where_condition]
  [LIMIT row_count]
TO TRAIN model_identifier
[WITH
  model_attr_expr [, model_attr_expr ...]
  [, train_attr_expr ...]]
COLUMN column_expr [, column_expr ...]
  | COLUMN column_expr [, column_expr ...] FOR column_name
    [COLUMN column_expr [, column_expr ...] FOR column_name ...]
[LABEL label_expr]
INTO table_references;

SELECT clause

The SELECT clause is used to retrieve data from a specified table.

Syntax:

SELECT select_expr [, select_expr ...]
FROM table_references
  [WHERE where_condition]
  [LIMIT row_count]

Parameter description:

Parameter	Description
select_expr	The data columns to retrieve. At least one column must be specified.
table_references	The table from which to retrieve data.
where_condition	An expression used to filter rows.
row_count	The maximum number of rows to retrieve.

Example:

For example, to quickly train a binary classification model on the sample dataset (iris.train):

SELECT * FROM iris.train
WHERE class = 0 OR class = 1
LIMIT 1000
TO TRAIN ...

TRAIN clause

The TRAIN clause is used to describe a specific model type and its training algorithm.

Syntax:

TO TRAIN model_identifier
WITH
  model_attr_expr [, model_attr_expr ...]
  [, train_attr_expr ...]

Parameter description:

Parameter	Description
model_identifier	The model type, such as DNNClassifier. For more models, see Models and algorithms.
model_attr_expr	Model attributes, such as `model.n_classes = 3`.
train_attr_expr	Training attributes, such as `train.epoch = 10`.

Example:

For example, to train a DNNClassifier model with two hidden layers, each containing ten hidden units, and with 10 training epochs:

SELECT * FROM iris.train
TO TRAIN DNNClassifier
WITH
  model.hidden_units = [10, 10],
  model.n_classes = 3,
  train.epoch = 10
...

COLUMN clause

The COLUMN clause is used to specify the field names of training features and their optional pre-processing methods.

Syntax:

COLUMN column_expr [, column_expr ...]
  | COLUMN column_expr [, column_expr ...] FOR column_name
    [COLUMN column_expr [, column_expr ...] FOR column_name ...]

Parameter description:

Parameter	Description
column_expr	The field name and pre-processing method for its content, such as `sepal_length,DENSE(dense, 3)`.
column_name	The feature column name for model input.

Example:

For example, to use the fields sepal_length, sepal_width, petal_length, and petal_width as features:

SELECT * FROM iris.train
TO TRAIN DNNClassifier WITH
  model.hidden_units = [10, 10],
  model.n_classes = 3,
  train.epoch = 10
COLUMN sepal_length, sepal_width, petal_length, petal_width
...

LABEL clause

The LABEL clause is used to indicate the field name of the training label and its optional pre-processing method.

Syntax:

LABEL label_expr

Parameter description:

Parameter	Description
label_expr	The field name and pre-processing method for its content, such as `class`.

Example:

SELECT * FROM iris.train
TO TRAIN DNNClassifier WITH
  model.hidden_units = [10, 10],
  model.n_classes = 3,
  train.epoch = 10
COLUMN sepal_length, sepal_width, petal_length, petal_width
LABEL class
...

INTO clause

The INTO clause is used to specify the table name where the trained model will be saved.

Syntax:

INTO table_references

Parameter description:

Parameter	Description
table_references	The table name where the trained model will be saved.

Example:

SELECT * FROM iris.train
TO TRAIN DNNClassifier WITH
  model.hidden_units = [10, 10],
  model.n_classes = 3,
  train.epoch = 10
COLUMN sepal_length, sepal_width, petal_length, petal_width
LABEL class
INTO sqlflow_models.my_dnn_model;

Prediction syntax

RDS SQLFlow prediction statements consist of SELECT, PREDICT, and USING clauses.

SELECT select_expr [, select_expr ...]
FROM table_references
  [WHERE where_condition]
  [LIMIT row_count]
TO PREDICT result_table_reference
[WITH
  attr_expr [, attr_expr ...]]
USING model_table_reference;

Important

The data structure of the SELECT clause before TO PREDICT must be consistent with the data structure used in the TO TRAIN operation in the training syntax to ensure that the data structure used in the prediction and explanation processes matches the data structure used during the training phase.

PREDICT and USING clauses

PREDICT is used to describe the table where prediction results will be written, USING is used to describe the table from which the model is loaded, and WITH is used to describe the necessary configurations for prediction.

Syntax:

TO PREDICT result_table_reference
[WITH
  attr_expr [, attr_expr ...]]
USING model_table_reference;

Parameter description:

Parameter	Description
result_table_reference	The table that stores prediction results.
attr_expr	Configuration attributes, such as `predict.batch_size = 1`.
model_table_reference	The table from which the model is loaded for prediction.

Example:

For example, to use a model stored in the sqlflow.my_dnn_model table and save the prediction results to a column in the iris.predict table:

SELECT * FROM iris.train
TO PREDICT iris.predict.class
USING sqlflow.my_dnn_model;

Explanation syntax

RDS SQLFlow explanation statements consist of a series of SELECT, EXPLAIN, and USING statements.

SELECT select_expr [, select_expr ...]
FROM table_references
  [WHERE where_condition]
  [LIMIT row_count]
TO EXPLAIN model_table_reference
[WITH
  attr_expr [, attr_expr ...]]
[USING explainer];

Important

The data structure of the SELECT clause that precedes TO EXPLAIN must be consistent with the data structure used by the TO TRAIN operation in the training syntax.

EXPLAIN and USING clauses

The EXPLAIN clause is used to specify the table from which the model is loaded and the necessary configuration attributes. The USING clause is used to specify the explainer.

Syntax:

TO EXPLAIN model_table_reference
[WITH
  attr_expr [, attr_expr ...]]
USING explainer;

Parameter description:

Parameter	Description
model_table_reference	The table from which the model is loaded for prediction.
attr_expr	Configuration attributes, such as `summary.plot_type="bar"`.
explainer	The type of explainer, such as `TreeExplainer`.

Example:

For example, to use TreeExplainer to explain a model stored in sqlflow_models.my_xgb_regression_model and plot the analysis results in sorted order:

SELECT * FROM boston.train
TO EXPLAIN sqlflow_models.my_xgb_regression_model
WITH
    summary.sort=True
USING TreeExplainer;

Evaluation syntax

RDS SQLFlow evaluation statements consist of SELECT, EVALUATE, and INTO statements.

SELECT select_expr [, select_expr ...]
FROM table_references
  [WHERE where_condition]
  [LIMIT row_count]
TO EVALUATE model_table_reference
[WITH
  attr_expr [, attr_expr ...]]
LABEL class
INTO evaluate_result_table;

Important

The data structure of the SELECT clause before TO EVALUATE must be consistent with the data structure used in the Training syntax TO TRAIN operation.
According to machine learning theory, it is inappropriate to use the training dataset for evaluation, as this might result in an overly idealized evaluation result.

EVALUATE and INTO clauses

The TO EVALUATE clause is used to specify the model to be evaluated and which metrics will be used for evaluation. The INTO clause is used to specify the table where evaluation results will be saved.

Syntax:

TO EVALUATE model_table_reference
[WITH
  attr_expr [, attr_expr ...]]
INTO evaluate_result_table;

Parameter description:

Parameter	Description
model_table_reference	The model to be evaluated, whose evaluation results represent the model's performance when predicting new data.
attr_expr	Attributes for evaluation. You can specify which metrics will be output to the result table by setting `validation.metrics`, such as `validation.metrics="Accuracy,AUC"`.
evaluate_result_table	The table that stores evaluation results.

Example:

SELECT * FROM iris.test 
TO EVALUATE sqlflow_models.my_dnn_model 
WITH validation.metrics = Accuracy 
LABEL class  
INTO sqlflow_models.evaluate_result_table;

Optimization syntax

RDS SQLFlow uses TO MAXIMIZE and TO MINIMIZE clauses to describe and solve mathematical programming problems.

Syntax:

SELECT select_expr [, select_expr ...]
FROM table_references
  [WHERE where_condition]
  [LIMIT row_count]
TO MAXIMIZE|MINIMIZE
  objective_expr
CONSTRAINT
  constraint_expr [GROUP BY column_name]
  [, constraint_expr [GROUP BY column_name]...]
WITH
  variables="variable_value(column_name, ...)"
  var_type="Integers|Reals|Binary|NonNegativeIntegers|..."
[USING solver_name]
INTO result_table_name;

Example:

SELECT c1, c2, c3 FROM my_db.my_table
TO MAXIMIZE SUM(x * c1)
CONSTRAINT
    SUM(x) <= c2 GROUP BY c1
    x <= 3
WITH
    variables="x(c3)",
    var_type="NonNegativeIntegers"
USING glpk
INTO my_db.my_result_table;

Parameter description:

Parameter	Description
TO MAXIMIZE	`objective_expr` is `SUM(x * c1)`, indicating that the value of `SUM(x * c1)` should be maximized.
CONSTRAINT	`SUM(x) <= c2 GROUP BY c1`: `GROUP BY` means that for each different `c1`, there will be a constraint expression `SUM(x) <= c2`. `x <= 3`: For each `x`, the condition `x <= 3` must be satisfied.
WITH	`variables=x(c3)`: Indicates that the optimization target column is `c3`; `x` represents the variable value to be solved. `var_type="NonNegativeIntegers"`: Indicates that the variable `x` is a non-negative integer. The supported data types for variables are as follows: `Binary`: The variable value is 0 or 1. `Integers`: The variable value is an integer. `PositiveIntegers`, `NegativeIntegers`: The variable value is a positive or negative integer. `NonPositiveIntegers`, `NonNegativeIntegers`: The variable value is a non-positive or non-negative integer. `Reals`: The variable value is a real number. `PositiveReals`, `NegativeReals`: The variable value is a positive or negative real number. `NonPositiveReals`, `NonNegativeReals`: The variable value is a non-positive or non-negative real number.
USING	`glpk`: Indicates using glpk to solve this problem. RDS SQLFlow supports: GLPK: Used for solving linear programming problems. BARON: Used for solving non-linear programming problems.
INTO	`my_db.my_result_table`: The table that stores the solution results.

References

Introduction to RDS Custom