Automatic feature interaction capture for prediction models-DeepFM algorithm - PolarDB

Deep Factorization Machine (DeepFM) is a machine learning algorithm built into PolarDB for MySQL that automatically captures feature interactions for classification and ranking tasks—no manual feature engineering required.

DeepFM combines a deep neural network (DNN) with a factorization machine (FM). The DNN captures high-order feature interactions, while the FM captures low-order ones. Together, they produce a floating-point score between 0 and 1, representing the probability that the target label equals 1. This output is suitable for binary classification and ranking.

Key concepts

DeepFM accepts two types of input features:

Categorical feature: a string value representing a discrete category, such as gender (male, female) or product category (clothing, electronics).
Numerical feature: an integer or floating-point value representing a measurable quantity, such as user activity count or product price.

Use cases

DeepFM is effective when your features do not directly predict outcomes, but their combinations do. It works well in two common business scenarios:

Click-through rate (CTR) prediction: predict whether a user will click on an ad or product recommendation, based on historical behavior data (clicks, non-clicks, purchases). DeepFM converts sparse user-behavior features into dense representations that capture interaction patterns across features.

Purchase intent prediction: rank products by likelihood of purchase in a personalized recommendation feed. DeepFM handles complex, non-obvious interactions between user profile features and product attributes that manual feature engineering would miss.

Parameters

Configure the parameters below in the model_parameter clause of your CREATE MODEL statement.

All parameters have default values that work for most binary classification tasks. Start with the defaults, then tune epochs, batch_size, and optimizer if model performance needs improvement.

Parameter	Default	Valid values	Description
`task`	`binary`	`binary`, `regression`	The task type. Use `binary` for click prediction and ranking. Use `regression` for continuous-value prediction.
`metrics`	`accuracy`	`accuracy`, `binary_crossentropy`, `mse`	The evaluation metric. `accuracy` measures classification accuracy. `binary_crossentropy` measures cross-entropy loss for binary classification. `mse` measures mean squared error for regression.
`loss`	`binary_crossentropy`	`binary_crossentropy`, `mean_squared_error`	The loss function. Use `binary_crossentropy` for binary classification and `mean_squared_error` for regression.
`optimizer`	`adam`	`adam`, `sgd`, `rmsdrop`	The optimizer algorithm. `adam` adapts the learning rate per parameter and handles sparse gradients well—recommended for most cases. `sgd` uses stochastic gradient descent. `rmsdrop` improves on AdaGrad by using a weighted sum of squared gradients.
`epochs`	`6`	Any positive integer	The number of training iterations over the full dataset. Increase this if the model underfits.
`batch_size`	`64`	Any positive integer	The number of samples per gradient update. A smaller batch size is prone to overfitting.
`validation_split`	`0.2`	A float between 0 and 1	The fraction of training data held out for validation.

Examples

The examples below use the db4ai.airlines dataset to predict flight delays. All three statements use the /*polar4ai*/ hint to route the query to the PolarDB AI engine.

Train a model

/*polar4ai*/CREATE MODEL airline_deepfm WITH
(model_class = 'deepfm',
x_cols = 'Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length',
y_cols = 'Delay',
model_parameter = (epochs = 6))
AS (SELECT * FROM db4ai.airlines);

Evaluate the model

Run EVALUATE on a held-out sample to measure accuracy against known labels.

/*polar4ai*/SELECT Airline FROM EVALUATE(MODEL airline_deepfm,
SELECT * FROM db4ai.airlines LIMIT 20) WITH
(x_cols = 'Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length',
y_cols = 'Delay',
metrics = 'acc');

The query returns one row per evaluated record. Each row includes the selected column (Airline) and the evaluation result for that record.

Run predictions

/*polar4ai*/SELECT Airline FROM PREDICT(MODEL airline_deepfm,
SELECT * FROM db4ai.airlines LIMIT 20) WITH
(x_cols = 'Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length');

The query returns one row per input record. Each row includes the selected column (Airline) and a predicted probability score between 0 and 1.