Developer Content

Introduction: The GBDT+FM model is a model extended from Gbdt+LR. The model uses GBDT to automatically screen and combine features to generate a new discrete feature vector, which is then used as the input of the FM model to generate the final prediction result. This model can make comprehensive use of various features such as users, items, and contexts to generate more comprehensive recommendations, and is widely used in CTR click-through rate prediction scenarios.

Use directly
Please open the Gbdt-FM model and click "Open in DSW" in the upper right corner.

Gbdt + FM integrated model training and service deployment

1: The GBDT+FM model is a model extended from Gbdt+LR. The model uses GBDT to automatically screen and combine features to generate a new discrete feature vector, which is then used as the input of the FM model to generate the final prediction result. This model can make comprehensive use of various features such as users, items, and contexts to generate more comprehensive recommendations, and is widely used in CTR click-through rate prediction scenarios.
2: This article will introduce how to use Alink to quickly build a Gbdt+FM model based on DSW, and how to easily deploy the established model as a service.

Operating environment requirements
1. PyAlink has been installed by default in the official image of PAI-DSW, and the memory requirement is 4G and above.
2. The content of this Notebook can be run and viewed directly without any other files.

from pyalink.alink import *
useLocalEnv(2)

Scale to Larger Data

In this example, we use useLocalEnv to run the Alink job locally (that is, in the DSW container), and use multi-threading to simulate distributed computing.
For larger-scale data, you can use usePAIEnv to submit jobs to large-scale clusters, and you can use help(usePAIEnv) to view the detailed usage.

Data preparation

Adult data source https://archive.ics.uci.edu/ml/datasets/Adult
Algorithm related documents:
• https://www.yuque.com/pinshu/alink_doc/csvsourcebatchop
The Adult data set (that is, the "census income" data set) is extracted from the US census data set database, which contains a total of 48,842 records. The annual income is greater than 50k US dollars accounted for 23.93%, and the annual income is less than 50k US dollars. The ratio is 76.07%, and it has been divided into 32561 training data and 16281 test data. The class variable of this data set is whether the annual income exceeds 50k dollars, and the attribute variables include 14 types of important information such as age, job type, education, occupation, etc., of which 8 types belong to category discrete variables, and the other 6 types belong to numerical continuous variables. This dataset is a classification dataset to predict whether the annual income is more than $50k.

Training model

Algorithm related documents:
• https://www.yuque.com/pinshu/alink_doc/intro
• https://www.yuque.com/pinshu/alink_doc/gbdtencoder
• https://www.yuque.com/pinshu/alink_doc/fmclassifier
We complete the integrated training of the model by putting the two operators GbdtEncoder and FM into one Pipeline. Here, GbdtEncoder is used to encode the input data, and the encoded result is sent to FM for training. Finally, we get a pipeline model, which can be used to reason about data and can also be deployed as a service.

Model evaluation

Algorithm related documents:
• https://www.yuque.com/pinshu/alink_doc/evalbinaryclassbatchop
• https://www.yuque.com/pinshu/alink_doc/jsonvaluebatchop
In the model evaluation phase, we first use the above-trained model to infer testData, then use the evaluation component EvalBinaryClassBatchOp to evaluate the inference results, and finally use the JsonValueBatchOp component to complete the extraction of evaluation results.

Compared with Gbdt+LR effect
Algorithm related documents:
• https://www.yuque.com/pinshu/alink_doc/evalbinaryclassbatchop
• https://www.yuque.com/pinshu/alink_doc/jsonvaluebatchop
• https://www.yuque.com/pinshu/alink_doc/logisticregression
• https://www.yuque.com/pinshu/alink_doc/gbdtencoder
It can be seen from the comparison that the effect of Gbdt+FM is better than that of Gbdt+LR. For the same data, the AUC is about 0.7 percentage points higher.

Model write out
Algorithm related documents:
• https://www.yuque.com/pinshu/alink_doc/aksinkbatchop
In the model writing phase, we use AkSinkBatchOp to write the model to the file system. The file system here can be a local file system (as shown in the code) or a network file system (such as OSS). You can pass the code:
fs = OssFileSystem("3.4.1", "oss-cn-hangzhou-zmf.aliyuncs.com", "name", "************", "****** ****")
filePath = FilePath("/model/gbdt_fm_model.ak", fs)

Complete the construction of the network file system path, and pass this path to the AkSinkBatchOp component as a parameter:
AkSinkBatchOp().setFilePath(filePath).setOverwriteSink(True)

Load the model and infer
The path to load the model here is the same as when the model was written, which can be a local file system (as shown in the code) or a network file system (such as OSS).

Model deployment

Model deployment can be deployed using the command line:
!./eascmd64 -i {EAS AccessKeyId} -k {EAS AccessKeySecret} -e pai-eas.cn-beijing.aliyuncs.com create config.json
• https://www.yuque.com/pinshu/alink_tutorial/pai_designer
You can also use the interactive interface of Alibaba Cloud PAI to fill in some parameters and deploy with one click. For details, please refer to the document:
• https://help.aliyun.com/document_detail/110981.html

【DSW Gallery】Gbdt-FM model

Related Articles

A detailed explanation of Hadoop core architecture HDFS

What Does IOT Mean

6 Optional Technologies for Data Storage

What Is Blockchain Technology

Explore More Special Offers

Short Message Service(SMS) & Mail Service

Sales Support

Technical Support

Connect & Report Abuse