RDS SQLFlow models and algorithms - - Alibaba Cloud Documentation Center

The RDS SQLFlow model library provides various models suitable for different scenarios. You can select an appropriate model based on the type of problem you need to solve. This topic describes the different models and their algorithms.

Model overview

Classification models

Classification models are used to resolve the issue that a dataset contains multiple data categories and the category to which new data belongs needs to be determined. Classification models help answer binary (yes or no) questions and multi-class classification (A, B, or C) questions. For more information, see Classification models.

Scenarios

Image classification: tasks such as handwritten digit recognition and object recognition
Text classification: tasks such as sentiment analysis and spam email identification
Recommendation system: prediction of the purchase intentions of users in the e-commerce field
Medical diagnosis: analysis of patient information to predict disease risks

Regression models

Regression models are suitable for data that has quantifiable relationships. A regression model can be used to estimate the relationship between data and make predictions for new datasets based on the estimated relationship. For more information, see Regression models. A regression model is built based on the relationship between input variables (features) and output variables (results) to meet the requirements in the following scenarios.

Scenarios

Economics: prediction of economic indicators such as housing prices and stock prices. In this scenario, linear regression helps efficiently understand the relationships between different variables.
Medicine: prediction of health outcomes such as blood glucose levels based on features such as lifestyle and genetic information.
Marketing: analysis of the relationship between advertising expenditures and sales revenue to optimize marketing strategies.
Engineering: prediction of product performance, with the results used to build models based on various parameters.
Social science: analysis of social phenomena such as the impact of education levels on income.

Clustering models

Clustering models are used to classify data in a dataset based on specific similarities and generate new instance models that can be used in the following scenarios. For more information, see Clustering models. Clustering models provide an unsupervised learning method for machine learning (ML). Clustering models are used to classify different samples based on feature similarities. In the results, the samples in the same group are similar to each other, but the samples in different groups are significantly different. Clustering models are widely used in data analytics and pattern recognition because the models can efficiently identify potential patterns in unlabeled data.

Scenarios

Customer classification: Vendors want to use clustering analysis to conduct in-depth research on the behavior patterns and preferences of their users to implement precision marketing strategies.
Image processing: During image classification and object recognition, pixels with similar features need to be classified into the same category through clustering.
Anomaly detection: Samples that do not belong to any clusters need to be identified to detect potential outliers, such as attacks in the field of cybersecurity.
Document clustering: Many documents need to be classified by topic or content to facilitate information retrieval and data collation.
Bioinformatics: In the analysis of gene expression data, researchers want to use cluster analysis to identify similar gene groups.

Classification models

DNNClassifier

Introduction

Deep Neural Network Classifier (DNNClassifier) is an advanced API provided by TensorFlow to build deep neural network (DNN) models for binary or multi-classification predictions. The API uses a multi-layer, fully-connected network to capture non-linear relationships in data and trains and optimizes weights and biases to minimize the loss between the model output and the actual category.

The following list describes the benefits of DNNClassifier:

Ease of use: DNNClassifier serves as an advanced API to simplify the building of deep learning models.
Modularity: DNNClassifier can be used to flexibly configure the network structure, such as the number and size of hidden layers.
Parallel training: DNNClassifier lets you use multi-core CPUs or GPUs to accelerate training.
Multiple input features: DNNClassifier supports the combination of numerical and categorical features.

Parameters

Parameter	Description
model.hidden_units	The number of neurons at the hidden layer of a DNN model. For example, if the value is in the range of [128, 32], the model contains two hidden layers, and the numbers of neurons at these layers are 128 and 32, respectively.
model.n_classes	The total number of categories that the model can classify (applicable to classifier models). For example, `:2` indicates that the model divides data into two categories.
model.optimizer	The optimizer that is used for model training.
model.batch_norm	Specifies whether to use batch_norm after each hidden layer. Valid values: True: Enabled. False (default)
model.dropout	The dropout probability. Example value: 0.5. Default value: None.

Example

### Training
SELECT * FROM iris.train TO TRAIN DNNClassifier WITH
  model.n_classes = 3,
  train.epoch = 10
COLUMN sepal_length, sepal_width, petal_length, petal_width
LABEL class
INTO sqlflow_models.my_dnn_model;

## Prediction
SELECT * FROM iris.test
TO PREDICT iris.predict.class
USING sqlflow_models.my_dnn_model;

XGBoost gbtree

Introduction

Extreme Gradient Boosting (XGBoost) is an efficient gradient boosting tree algorithm that is widely used for classification and regression. The XGBoost gbtree model uses decision trees as base learners.

The algorithm has the following characteristics:

High performance: In various classification tasks, BoostedTreesClassifier of XGBoost helps achieve higher accuracy than traditional methods.
Adaptability: XGBoost can automatically evaluate the importance of features and adapt to different data distributions.
Interpretability: XGBoost analyzes the importance of features and lets you understand how a model makes decisions.
Overfitting prevention: Boosted Trees can use regularization techniques to effectively prevent overfitting when deep trees are traversed.
Ability to handle missing values: Missing values in a sample can be appropriately handled to avoid pre-processing issues caused by missing values.
Parallel and distributed training: Models can be efficiently trained on large datasets.

Example

## Used for binary classification
SELECT * FROM train_table
TO TRAIN xgboost.gbtree
WITH objective="binary:logistic", validation.select="SELECT * FROM val_table"
LABEL class
INTO my_xgb_classification_model;

## Used for multi-class classification
SELECT * FROM train_table	
TO TRAIN xgboost.gbtree
WITH objective="multi:softmax", num_class=3, validation.select="SELECT * FROM val_table"
LABEL class
INTO my_xgb_classification_model;

## Used for regression tasks
SELECT * FROM train_table
TO TRAIN xgboost.gbtree
WITH objective="reg:squarederror", validation.select="SELECT * FROM val_table"
LABEL target
INTO my_xgb_regression_model;

GCN

Introduction

A Graph Convolutional Network (GCN) is a deep learning model that is specifically designed to process graph-structured data by applying convolutions over a graph. GCNs can effectively capture complex relationships between nodes and the deep structural features of graphs. This capability enables GCNs to analyze data with complex structures and relationships, such as data involved in social networks, recommendation systems, and biological networks. A GCN takes a graph as an input, and the graph is composed of nodes and edges. In many business scenarios, data can be naturally represented in a graph structure, such as social networks and knowledge graphs. GCNs use graph convolutions in convolutional neural networks (CNNs) and update the representation of each node by aggregating feature information from neighboring nodes. This way, the representation of a node includes the features of its neighboring nodes.

Scenarios

Social network analysis: GCNs can be used to perform tasks such as user recommendation, relationship prediction, and community detection in social networks.
Node classification: GCNs can be used to classify nodes or entities in a knowledge graph or similar structure and predict the category to which a specific node or entity belongs. This can be used to classify brands or products.
Image segmentation: GCNs can be used in an image that is composed of pixels to enhance the performance of image segmentation and object recognition.

Regression models

DNNRegressor

Introduction

DNNRegressor is an advanced API provided by TensorFlow for the deep learning of regression tasks. DNNRegressor belongs to the TensorFlow Estimator module and can be used to process structured data such as data in tables or data in the CSV format. DNNRegressor can also be used to learn complex non-linear data relationships to predict continuous value outputs such as prices or probabilities.

Parameters

Parameter	Description
n_classes	The total number of categories that the model can classify (applicable to classifier models). For example, `:2` indicates that the model divides data into two categories.
optimizer	The optimizer that is used for model training.
sparse_combiner	Specifies the processing method when the category column input contains multiple identical values, for example: `mean`, `sqrtn` or `sum`.

Example

## Training
SELECT f9, target FROM housing.train
TO TRAIN DNNRegressor WITH model.hidden_units = [10, 20]
COLUMN EMBEDDING(CATEGORY_ID(f9, 1000), 2, "sum")
LABEL target
INTO housing.dnn_model;

## Prediction
SELECT f9, target FROM housing.test
TO PREDICT housing.predict.class USING housing.dnn_model;

RNNBasedTimeSeriesModel

Introduction

RNNBasedTimeSeriesModel, a Recurrent Neural Network (RNN), is a prediction model based on long short-term memory (LSTM) time series. It predicts future values using past values as features. An RNN is a neural network that can be used to process sequential data. At each time step, RNNs not only receive data for the current point in time but also consider the status data for the previous point in time. RNNs are suitable for processing time series data such as stock prices, temperature readings, and natural languages.

The following list describes the benefits of RNNs:

Memory capability: RNNs can store the previous input information. This enables RNNs to capture long-term dependencies in time series.
Dynamic input length: RNNs can handle input sequences of any length, which is especially important for time series data.
Non-linear mapping: RNNs can establish non-linear relationships between input and output data. This enables RNNs to complete complex prediction tasks on time series data.
Sequence generation: RNNs can be used to complete prediction tasks and generate time series data, such as generating music and text information.

Parameters

Parameter	Description
model.stack_units	The number of LSTM layers and the size of each layer. Default value: [500, 500].
model.n_in	The number of input points in time.
model.n_out	The number of output points in time.

Example

## Training
SELECT sepal_length, sepal_width, petal_length, concat(petal_width,',',class) as class 
FROM iris.train 
TO TRAIN sqlflow_models.RNNBasedTimeSeriesModel WITH
    model.n_in=3,
    model.stack_units = [10, 10],
    model.n_out=2,
    model.model_type="lstm",
    validation.metrics= "MeanAbsoluteError,MeanSquaredError"
LABEL class
INTO sqlflow_models.my_dnn_regts_model_2;

## Prediction
SELECT sepal_length, sepal_width, petal_length, concat(petal_width,',',class) as class 
FROM iris.test 
TO PREDICT iris.predict_ts_2.class 
USING sqlflow_models.my_dnn_regts_model_2;

Clustering models

DeepEmbeddingClusterModel

Introduction

DeepEmbeddingClusterModel is a model that combines deep learning and clustering methods. DeepEmbeddingClusterModel is mainly used to automatically learn low-dimensional embedding representations of data and perform clustering in the embedding space.

The following list describes the benefits of DeepEmbeddingClusterModel:

Deep learning: DeepEmbeddingClusterModel can use deep neural networks for feature extraction, automatically learn useful representations of the data, and capture complex non-linear relationships.
Embedded representation: DeepEmbeddingClusterModel can project high-dimensional data into a lower-dimensional embedding space to simplify clustering and improve the clustering performance.
Clustering capabilities: DeepEmbeddingClusterModel can directly perform clustering in the embedding space to reduce the distance between data points of the same type. This helps improve the accuracy and interpretability of the clustering.
Generalization capabilities: In most cases, DeepEmbeddingClusterModel can process noisy and complex data due to the advantages of deep learning.
Scalability: DeepEmbeddingClusterModel can process data sets that contain a large amount of data and can effectively iterate and update data based on deep learning.

Parameters

Parameter	Description
model.n_clusters	The number of clustering categories. Default value: 10.
model.kmeans_init	The number of times that the K-means algorithm is executed to obtain the best center point. Default value: 20.
model.run_pretrain	Specifies whether to perform pre-training. Default value: True.
model.pretrain_dims	The dimension of each hidden layer when autoencoder is pre-trained. Default value: [500, 500, 2000, 10].
model.pretrain_activation_func	Activation function for the autoencoder part, default value: `'relu'`.
model.pretrain_batch_size	The batch size for autoencoder training. Default value: 256.
model.train_batch_size	The batch size for training. Default value: 256.
model.pretrain_epochs	The number of pre-training epochs. Default value: 10.
model.pretrain_initializer	Initialization method for autoencoder parameters, default value: `'glorot_uniform'`.
model.pretrain_lr	The learning rate of pre-training. Default value: 1.
model.train_lr	The learning rate of training. Default value: 0.1.
model.train_max_iters	The maximum number of training iteration steps. Default value: 8000.
model.update_interval	The number of steps at which the required distribution is updated. Default value: 100.
model.tol	The tolerance. Default value: 0.001.

Example

## Training
SELECT (sepal_length - 4.4) / 3.5 as sepal_length, (sepal_width - 2.0) / 2.2 as sepal_width, (petal_length - 1) / 5.9 as petal_length, (petal_width - 0.1) / 2.4 as petal_width
FROM iris.train
TO TRAIN sqlflow_models.DeepEmbeddingClusterModel
WITH
  model.pretrain_dims = [10,10,3],
  model.n_clusters = 3,
  model.pretrain_epochs=5,
  train.batch_size=10,
  train.verbose=1
INTO sqlflow_models.my_clustering_model;

## Prediction
SELECT (sepal_length - 4.4) / 3.5 as sepal_length, (sepal_width - 2.0) / 2.2 as sepal_width, (petal_length - 1) / 5.9 as petal_length, (petal_width - 0.1) / 2.4 as petal_width
FROM iris.test
TO PREDICT iris.predict.class
USING sqlflow_models.my_clustering_model;

References

Introduction to RDS Custom