Introduction and Implementation of DBMTL for Multitask Learning Model
multitasking background
The recommendation algorithms currently used in the industry are not limited to singletarget (ctr) tasks, but also need to pay attention to subsequent conversion links, such as whether to comment, bookmark, add purchases, purchases, viewing time and other goals.
A common multiobjective optimization model starts from a separate model network for each optimization objective, and achieves an appropriate degree of independence and correlation of each objectiverelated model by allowing these networks to share parameters at the bottom layer. This type of model framework can be summarized by the structure in the above figure. Regardless of how the underlying parameters are shared, these networks have independent branches at the last few layers to predict the final value of each target. The probability model of such a network can be described by the following formula:
Among them, l and m are the target, x is the sample feature, and H is the model. The assumption here is that each target is independent.
Introduction to DBMTL
One of the starting points of DBMTL (Deep Bayesian MultiTarget Learning) is to solve the above problems. In fact, applying the simple Bayesian formula, the probability model can be written as:
As shown in the figure below, the main difference between DBMTL and the traditional MTL structure (which considers each target independent) lies in the construction of a Bayesian network between target nodes, which explicitly models the possible causal relationship between targets. Because in actual business, many behaviors of users often have obvious sequential dependencies. For example, in the information flow scenario, users must first click on the graphic details page before performing subsequent operations such as browsing/commenting/forwarding/favorite. DBMTL embodies these relationships in the model structure, and thus, tends to learn better results.
The following figure is the specific implementation of the DBMTL model. The network consists of input layer, shared embedding layer, shared layer, discriminative layer and Bayesian layer.
• The shared embedding layer is a shared lookup table shared by each target training.
• The shared and split layers are generic multilayer perceptrons (MLPs) that model shared/differentiated representations of objects, respectively.
• The Bayesian layer is the most important part of DBMTL. It implements the following probabilistic model:
Its corresponding loglikelihood loss function is:
In practical applications, weight adjustment for different goals still has a great practical effect. When assigning different weights to the target, it is equivalent to reexpressing the loss function as:
In the Bayesian layer of the network, the functions f1, f2, f3 are implemented as fully connected MLPs to learn implicit causal relationships among objects. They take as input the concatenation of embeddings of the function's input variables and feed in an embedding representing the function's output variables. The embedding of each target finally goes through a layer of MLP to output the probability of the final target.
Code
Based on the EasyRec recommendation algorithm framework, we have implemented the DBMTL algorithm, and the specific implementation can be moved to github: EasyRecDBMTL.
Introduction to EasyRec: EasyRec is a largescale distributed recommendation algorithm framework open sourced by the machine learning PAI team of Alibaba Cloud Computing Platform. The feature engineering method that has achieved excellent results, integrated training, evaluation, and deployment, and seamlessly connected with Alibaba Cloud products, can use EasyRec to build a cuttingedge recommendation system in a short period of time. As the leading product of Alibaba Cloud, it has been stably serving hundreds of enterprise customers.
Model Feedforward Network
def build_predict_graph(self):
"""Forward function.
Returns:
self._prediction_dict: Prediction result of two tasks.
"""
# Here we start from the tensor (self._features) after sharing the embedding layer, omitting its generation logic
# shared layer
if self._model_config.HasField('bottom_dnn'):
bottom_dnn = dnn.DNN(
self._model_config.bottom_dnn,
self._l2_reg,
name='bottom_dnn',
is_training=self._is_training)
bottom_fea = bottom_dnn(self._features)
else:
bottom_fea = self._features
# MMOE block
if self._model_config.HasField('expert_dnn'):
mmoe_layer = mmoe.MMOE(
self._model_config.expert_dnn,
l2_reg=self._l2_reg,
num_task=self._task_num,
num_expert=self._model_config.num_expert)
task_input_list = mmoe_layer(bottom_fea)
else:
task_input_list = [bottom_fea] * self._task_num
tower_features = {}
# specific layer
for i, task_tower_cfg in enumerate(self._model_config.task_towers):
tower_name = task_tower_cfg.tower_name
if task_tower_cfg. HasField('dnn'):
tower_dnn = dnn.DNN(
task_tower_cfg.dnn,
self._l2_reg,
name=tower_name + '/dnn',
is_training=self._is_training)
tower_fea = tower_dnn(task_input_list[i])
tower_features[tower_name] = tower_fea
else:
tower_features[tower_name] = task_input_list[i]
tower_outputs = {}
relation_features = {}
#bayesian network
for task_tower_cfg in self._model_config.task_towers:
tower_name = task_tower_cfg.tower_name
relation_dnn = dnn.DNN(
task_tower_cfg.relation_dnn,
self._l2_reg,
name=tower_name + '/relation_dnn',
is_training=self._is_training)
tower_inputs = [tower_features[tower_name]]
for relation_tower_name in task_tower_cfg.relation_tower_names:
tower_inputs.append(relation_features[relation_tower_name])
relation_input = tf.concat(
tower_inputs, axis=1, name=tower_name + '/relation_input')
relation_fea = relation_dnn(relation_input)
relation_features[tower_name] = relation_features
output_logits = tf.layers.dense(
relation_fea,
task_tower_cfg.num_class,
kernel_regularizer=self._l2_reg,
name=tower_name + '/output')
tower_outputs[tower_name] = output_logits
self._add_to_prediction_dict(tower_outputs)
Loss calculation
def build(loss_type, label, pred, loss_weight=1.0, num_class=1, **kwargs):
if loss_type == LossType. CLASSIFICATION:
if num_class == 1:
return tf.losses.sigmoid_cross_entropy(
label, logits=pred, weights=loss_weight, **kwargs)
else:
return tf.losses.sparse_softmax_cross_entropy(
labels=label, logits=pred, weights=loss_weight, **kwargs)
elif loss_type == LossType.CROSS_ENTROPY_LOSS:
return tf.losses.log_loss(label, pred, weights=loss_weight, **kwargs)
elif loss_type in [LossType.L2_LOSS, LossType.SIGMOID_L2_LOSS]:
logging.info('%s is used' % LossType.Name(loss_type))
return tf.losses.mean_squared_error(
labels=label, predictions=pred, weights=loss_weight, **kwargs)
elif loss_type == LossType. PAIR_WISE_LOSS:
return pairwise_loss(pred, label)
else:
raise ValueError('unsupported loss type: %s' % LossType.Name(loss_type))
def _build_loss_impl(self,
loss_type,
label_name,
loss_weight=1.0,
num_class=1,
suffix=''):
loss_dict = {}
if loss_type == LossType. CLASSIFICATION:
loss_name = 'cross_entropy_loss' + suffix
pred = self._prediction_dict['logits' + suffix]
elif loss_type in [LossType.L2_LOSS, LossType.SIGMOID_L2_LOSS]:
loss_name = 'l2_loss' + suffix
pred = self._prediction_dict['y' + suffix]
else:
raise ValueError('invalid loss type: %s' % LossType.Name(loss_type))
loss_dict[loss_name] = build(loss_type,
self._labels[label_name],
pred,
loss_weight, num_class)
return loss_dict
def build_loss_graph(self):
"""Build loss graph for multi task model."""
for task_tower_cfg in self._task_towers:
tower_name = task_tower_cfg.tower_name
loss_weight = task_tower_cfg.weight * self._sample_weight
if hasattr(task_tower_cfg, 'task_space_indicator_label') and
task_tower_cfg. HasField('task_space_indicator_label'):
in_task_space = tf.to_float(
self._labels[task_tower_cfg.task_space_indicator_label] > 0)
loss_weight = loss_weight * (
task_tower_cfg.in_task_space_weight * in_task_space +
task_tower_cfg.out_task_space_weight * (1  in_task_space))
# The EasyRec framework will automatically add the loss in self._loss_dict.
self._loss_dict.update(
self._build_loss_impl(
task_tower_cfg.loss_type,
label_name=self._label_name_dict[tower_name],
loss_weight=loss_weight,
num_class=task_tower_cfg.num_class,
suffix='_%s' % tower_name))
return self._loss_dict
application
Due to its excellent algorithm effect, DBMTL is widely used on PAI.
Taking a live broadcast recommendation business as an example, the scenario has multiple objectives of is_click, is_view, view_costtime, is_on_mic, and on_mic_duration, among which is_click, is_view, and is_on_mic are binary classification tasks, and view_costtime and on_mic_duration are regression tasks for predicting duration. The dependencies of user behavior are:
• is_click => is_view
• is_click+is_view=> view_costtime
• is_click => is_on_mic
• is_click+is_on_mic => on_mic_duration
So the configuration is as follows:
dbmtl {
bottom_dnn {
hidden_units: [512, 256]
}
task_towers {
tower_name: "is_click"
label_name: "is_click"
loss_type: CLASSIFICATION
metrics_set: {
auc {}
}
dnn {
hidden_units: [128, 96, 64]
}
relation_dnn {
hidden_units: [32]
}
weight: 1.0
}
task_towers {
tower_name: "is_view"
label_name: "is_view"
loss_type: CLASSIFICATION
metrics_set: {
auc {}
}
dnn {
hidden_units: [128, 96, 64]
}
relation_tower_names: ["is_click"]
relation_dnn {
hidden_units: [32]
}
weight: 1.0
}
task_towers {
tower_name: "view_costtime"
label_name: "view_costtime"
loss_type: L2_LOSS
metrics_set: {
mean_squared_error {}
}
dnn {
hidden_units: [128, 96, 64]
}
relation_tower_names: ["is_click", "is_view"]
relation_dnn {
hidden_units: [32]
}
weight: 1.0
}
task_towers {
tower_name: "is_on_mic"
label_name: "is_on_mic"
loss_type: CLASSIFICATION
metrics_set: {
auc {}
}
dnn {
hidden_units: [128, 96, 64]
}
relation_tower_names: ["is_click"]
relation_dnn {
hidden_units: [32]
}
weight: 1.0
}
task_towers {
tower_name: "on_mic_duration"
label_name: "on_mic_duration"
loss_type: L2_LOSS
metrics_set: {
mean_squared_error {}
}
dnn {
hidden_units: [128, 96, 64]
}
relation_tower_names: ["is_click", "is_on_mic"]
relation_dnn {
hidden_units: [32]
}
weight: 1.0
}
l2_regularization: 1e6
}
embedding_regularization: 5e6
}
It is worth mentioning that after the DBMTL model is launched, the online onlooker rate has increased by 18% and the mic rate has increased by 14% compared with GBDT+FM (onlooker single target).
The recommendation algorithms currently used in the industry are not limited to singletarget (ctr) tasks, but also need to pay attention to subsequent conversion links, such as whether to comment, bookmark, add purchases, purchases, viewing time and other goals.
A common multiobjective optimization model starts from a separate model network for each optimization objective, and achieves an appropriate degree of independence and correlation of each objectiverelated model by allowing these networks to share parameters at the bottom layer. This type of model framework can be summarized by the structure in the above figure. Regardless of how the underlying parameters are shared, these networks have independent branches at the last few layers to predict the final value of each target. The probability model of such a network can be described by the following formula:
Among them, l and m are the target, x is the sample feature, and H is the model. The assumption here is that each target is independent.
Introduction to DBMTL
One of the starting points of DBMTL (Deep Bayesian MultiTarget Learning) is to solve the above problems. In fact, applying the simple Bayesian formula, the probability model can be written as:
As shown in the figure below, the main difference between DBMTL and the traditional MTL structure (which considers each target independent) lies in the construction of a Bayesian network between target nodes, which explicitly models the possible causal relationship between targets. Because in actual business, many behaviors of users often have obvious sequential dependencies. For example, in the information flow scenario, users must first click on the graphic details page before performing subsequent operations such as browsing/commenting/forwarding/favorite. DBMTL embodies these relationships in the model structure, and thus, tends to learn better results.
The following figure is the specific implementation of the DBMTL model. The network consists of input layer, shared embedding layer, shared layer, discriminative layer and Bayesian layer.
• The shared embedding layer is a shared lookup table shared by each target training.
• The shared and split layers are generic multilayer perceptrons (MLPs) that model shared/differentiated representations of objects, respectively.
• The Bayesian layer is the most important part of DBMTL. It implements the following probabilistic model:
Its corresponding loglikelihood loss function is:
In practical applications, weight adjustment for different goals still has a great practical effect. When assigning different weights to the target, it is equivalent to reexpressing the loss function as:
In the Bayesian layer of the network, the functions f1, f2, f3 are implemented as fully connected MLPs to learn implicit causal relationships among objects. They take as input the concatenation of embeddings of the function's input variables and feed in an embedding representing the function's output variables. The embedding of each target finally goes through a layer of MLP to output the probability of the final target.
Code
Based on the EasyRec recommendation algorithm framework, we have implemented the DBMTL algorithm, and the specific implementation can be moved to github: EasyRecDBMTL.
Introduction to EasyRec: EasyRec is a largescale distributed recommendation algorithm framework open sourced by the machine learning PAI team of Alibaba Cloud Computing Platform. The feature engineering method that has achieved excellent results, integrated training, evaluation, and deployment, and seamlessly connected with Alibaba Cloud products, can use EasyRec to build a cuttingedge recommendation system in a short period of time. As the leading product of Alibaba Cloud, it has been stably serving hundreds of enterprise customers.
Model Feedforward Network
def build_predict_graph(self):
"""Forward function.
Returns:
self._prediction_dict: Prediction result of two tasks.
"""
# Here we start from the tensor (self._features) after sharing the embedding layer, omitting its generation logic
# shared layer
if self._model_config.HasField('bottom_dnn'):
bottom_dnn = dnn.DNN(
self._model_config.bottom_dnn,
self._l2_reg,
name='bottom_dnn',
is_training=self._is_training)
bottom_fea = bottom_dnn(self._features)
else:
bottom_fea = self._features
# MMOE block
if self._model_config.HasField('expert_dnn'):
mmoe_layer = mmoe.MMOE(
self._model_config.expert_dnn,
l2_reg=self._l2_reg,
num_task=self._task_num,
num_expert=self._model_config.num_expert)
task_input_list = mmoe_layer(bottom_fea)
else:
task_input_list = [bottom_fea] * self._task_num
tower_features = {}
# specific layer
for i, task_tower_cfg in enumerate(self._model_config.task_towers):
tower_name = task_tower_cfg.tower_name
if task_tower_cfg. HasField('dnn'):
tower_dnn = dnn.DNN(
task_tower_cfg.dnn,
self._l2_reg,
name=tower_name + '/dnn',
is_training=self._is_training)
tower_fea = tower_dnn(task_input_list[i])
tower_features[tower_name] = tower_fea
else:
tower_features[tower_name] = task_input_list[i]
tower_outputs = {}
relation_features = {}
#bayesian network
for task_tower_cfg in self._model_config.task_towers:
tower_name = task_tower_cfg.tower_name
relation_dnn = dnn.DNN(
task_tower_cfg.relation_dnn,
self._l2_reg,
name=tower_name + '/relation_dnn',
is_training=self._is_training)
tower_inputs = [tower_features[tower_name]]
for relation_tower_name in task_tower_cfg.relation_tower_names:
tower_inputs.append(relation_features[relation_tower_name])
relation_input = tf.concat(
tower_inputs, axis=1, name=tower_name + '/relation_input')
relation_fea = relation_dnn(relation_input)
relation_features[tower_name] = relation_features
output_logits = tf.layers.dense(
relation_fea,
task_tower_cfg.num_class,
kernel_regularizer=self._l2_reg,
name=tower_name + '/output')
tower_outputs[tower_name] = output_logits
self._add_to_prediction_dict(tower_outputs)
Loss calculation
def build(loss_type, label, pred, loss_weight=1.0, num_class=1, **kwargs):
if loss_type == LossType. CLASSIFICATION:
if num_class == 1:
return tf.losses.sigmoid_cross_entropy(
label, logits=pred, weights=loss_weight, **kwargs)
else:
return tf.losses.sparse_softmax_cross_entropy(
labels=label, logits=pred, weights=loss_weight, **kwargs)
elif loss_type == LossType.CROSS_ENTROPY_LOSS:
return tf.losses.log_loss(label, pred, weights=loss_weight, **kwargs)
elif loss_type in [LossType.L2_LOSS, LossType.SIGMOID_L2_LOSS]:
logging.info('%s is used' % LossType.Name(loss_type))
return tf.losses.mean_squared_error(
labels=label, predictions=pred, weights=loss_weight, **kwargs)
elif loss_type == LossType. PAIR_WISE_LOSS:
return pairwise_loss(pred, label)
else:
raise ValueError('unsupported loss type: %s' % LossType.Name(loss_type))
def _build_loss_impl(self,
loss_type,
label_name,
loss_weight=1.0,
num_class=1,
suffix=''):
loss_dict = {}
if loss_type == LossType. CLASSIFICATION:
loss_name = 'cross_entropy_loss' + suffix
pred = self._prediction_dict['logits' + suffix]
elif loss_type in [LossType.L2_LOSS, LossType.SIGMOID_L2_LOSS]:
loss_name = 'l2_loss' + suffix
pred = self._prediction_dict['y' + suffix]
else:
raise ValueError('invalid loss type: %s' % LossType.Name(loss_type))
loss_dict[loss_name] = build(loss_type,
self._labels[label_name],
pred,
loss_weight, num_class)
return loss_dict
def build_loss_graph(self):
"""Build loss graph for multi task model."""
for task_tower_cfg in self._task_towers:
tower_name = task_tower_cfg.tower_name
loss_weight = task_tower_cfg.weight * self._sample_weight
if hasattr(task_tower_cfg, 'task_space_indicator_label') and
task_tower_cfg. HasField('task_space_indicator_label'):
in_task_space = tf.to_float(
self._labels[task_tower_cfg.task_space_indicator_label] > 0)
loss_weight = loss_weight * (
task_tower_cfg.in_task_space_weight * in_task_space +
task_tower_cfg.out_task_space_weight * (1  in_task_space))
# The EasyRec framework will automatically add the loss in self._loss_dict.
self._loss_dict.update(
self._build_loss_impl(
task_tower_cfg.loss_type,
label_name=self._label_name_dict[tower_name],
loss_weight=loss_weight,
num_class=task_tower_cfg.num_class,
suffix='_%s' % tower_name))
return self._loss_dict
application
Due to its excellent algorithm effect, DBMTL is widely used on PAI.
Taking a live broadcast recommendation business as an example, the scenario has multiple objectives of is_click, is_view, view_costtime, is_on_mic, and on_mic_duration, among which is_click, is_view, and is_on_mic are binary classification tasks, and view_costtime and on_mic_duration are regression tasks for predicting duration. The dependencies of user behavior are:
• is_click => is_view
• is_click+is_view=> view_costtime
• is_click => is_on_mic
• is_click+is_on_mic => on_mic_duration
So the configuration is as follows:
dbmtl {
bottom_dnn {
hidden_units: [512, 256]
}
task_towers {
tower_name: "is_click"
label_name: "is_click"
loss_type: CLASSIFICATION
metrics_set: {
auc {}
}
dnn {
hidden_units: [128, 96, 64]
}
relation_dnn {
hidden_units: [32]
}
weight: 1.0
}
task_towers {
tower_name: "is_view"
label_name: "is_view"
loss_type: CLASSIFICATION
metrics_set: {
auc {}
}
dnn {
hidden_units: [128, 96, 64]
}
relation_tower_names: ["is_click"]
relation_dnn {
hidden_units: [32]
}
weight: 1.0
}
task_towers {
tower_name: "view_costtime"
label_name: "view_costtime"
loss_type: L2_LOSS
metrics_set: {
mean_squared_error {}
}
dnn {
hidden_units: [128, 96, 64]
}
relation_tower_names: ["is_click", "is_view"]
relation_dnn {
hidden_units: [32]
}
weight: 1.0
}
task_towers {
tower_name: "is_on_mic"
label_name: "is_on_mic"
loss_type: CLASSIFICATION
metrics_set: {
auc {}
}
dnn {
hidden_units: [128, 96, 64]
}
relation_tower_names: ["is_click"]
relation_dnn {
hidden_units: [32]
}
weight: 1.0
}
task_towers {
tower_name: "on_mic_duration"
label_name: "on_mic_duration"
loss_type: L2_LOSS
metrics_set: {
mean_squared_error {}
}
dnn {
hidden_units: [128, 96, 64]
}
relation_tower_names: ["is_click", "is_on_mic"]
relation_dnn {
hidden_units: [32]
}
weight: 1.0
}
l2_regularization: 1e6
}
embedding_regularization: 5e6
}
It is worth mentioning that after the DBMTL model is launched, the online onlooker rate has increased by 18% and the mic rate has increased by 14% compared with GBDT+FM (onlooker single target).
Related Articles

A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team

What Does IOT Mean
Knowledge Base Team

6 Optional Technologies for Data Storage
Knowledge Base Team

What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers

Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00