Use EasyTransfer to develop a text classification model - Platform For AI

EasyTransfer is designed to help developers develop transfer learning models in natural language processing (NLP) scenarios. This topic uses text classification as an example to describe how to use EasyTransfer to train models, evaluate models, use models to make predictions, export model files, and deploy models in Data Science Workshop (DSW) of Machine Learning Platform for AI (PAI).

Prerequisites

A DSW instance is created and the software version requirements are met. For more information, see Create a DSW instance and Limits.

Note

We recommend that you use a GPU-accelerated DSW instance.

Background information

Transfer learning is a machine learning method of applying knowledge acquired from one resolved problem to a different problem. Industrial production shows a growing need for applying transfer learning to NLP applications. The adoption of conventional machine learning in emerging industries significantly increases the investment in manpower and resources for accumulating large volumes of training data. To resolve this issue, developers can reuse the training data of an existing task to improve the performance of learning in a new task. PAI provides EasyTransfer, a deep learning framework, to help developers develop transfer learning models for NLP applications.

Limits

EasyTransfer supports the following Python and TensorFlow versions:

Python: Python 2.7, Python 3.4, or versions later than Python 3.4.
Image: the official image tensorflow:1.12PAI-gpu-py36-cu101-ubuntu18.04.

Step 1: Prepare data

Go to the development environment of Data Science Workshop (DSW).
1. Log on to the PAI console.
2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
3. In the upper-left corner of the page, select the region where you want to use PAI.
4. In the left-side navigation pane, choose Model Development and Training > Interactive Modeling (DSW).
5. (Optional.) On the Interactive Modeling (DSW) page, enter the name of a DSW instance or a keyword in the search box to search for the DSW instance.
6. Find the DSW instance and click Launch in the Actions column.
In the development environment of DSW, click Terminal in the top navigation bar and follow the on-screen instructions to launch Terminal.
Run the following commands in the terminal to download the sample datasets:
```
wget http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/tutorial/ez_text_classify/zqkd_sample/train.csv
wget http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/tutorial/ez_text_classify/zqkd_sample/dev.csv
```
Note
The datasets used in this example are only for demonstration. You may need more datasets when you train a news classification model.

Step 2: Start a training task in the current directory

Run the following command to start a training task:

easy_transfer_app \
  --mode=train \
  --modelName=text_classify_bert \
  --inputTable="./train.csv,./dev.csv" \
  --inputSchema=content:str:1,label:str:1 \
  --firstSequence=content \
  --labelName=label \
  --labelEnumerateValues="教育,三农,娱乐,健康,美文,搞笑,美食,财经,科技,旅游,汽车,时尚,科学,文化,房产,热点,母婴,家居,体育,国际,育儿,宠物,游戏,健身,职场,读书,艺术,动漫" \
  --sequenceLength=128 \
  --checkpointDir=./classify_models \
  --batchSize=64 \
  --numEpochs=3 \
  --optimizerType=adam \
  --learningRate=3e-5 \
  --advancedParameters='\
    pretrain_model_name_or_path=pai-bert-base-zh \
    '

The following table describes the parameters.

Parameter	Required	Description	Default value	Type
mode	Yes	The mode that is used. Valid values: train evaluate predict export	None	STRING
modelName	No	The name of the model. Valid values: The value of the parameter is text_classify_bert if the model is a BERT model for text classification. The value of the parameter is text_classify_dgcnn if the model is a DGCNN model for text classification. The value of the parameter is text_match_bert if the model is a BERT model for text matching. The value of the parameter is text_match_bert_two_tower if the model is a two-tower BERT model for text matching. The value of the parameter is text_match_bicnn if the model is a BiCNN model (two-tower CNN model). The value of the parameter is text_match_hcnn if the model is an HCNN model. The value of the parameter is text_match_dam if the model is a DAM model. The value of the parameter is text_match_damplus if the model is a DAM+ model. The value of the parameter is text_classify_cnn if the model is a TextCNN model. The value of the parameter is text_comprehension_bert if the model is a BERT model for reading comprehension. The value of the parameter is text_comprehension_bert_hae if the model is a BERT-HAE model. The value of the parameter is sequence_labeling_bert if the model is a BERT model for sequence labeling.	text_match_bert	STRING
inputTable	Yes	The input table for model training. Separate multiple tables with commas (,). Example: `./train.csv,./dev.csv`.	None	STRING
inputSchema	Yes	The schema of the columns in the input table. The value must be in the following format: Column name:Type:Length. The following information is used: The valid values of Type are int, str, and float. In most cases, the value of Length is 1. If the column is a comma-separated array, the value of Length equals the length of the array.	None	STRING
firstSequence	Yes	The column that corresponds to the first text sequence in the input table.	None	STRING
labelName	No	The name of the label column in the input table.	Empty string ""	STRING
labelEnumerateValues	No	The enumerate values of labels. You can specify the values by using one of the following methods: Directly specify the enumerate values and separate them with commas (,). Specify the path of a TXT file. The TXT file contains the enumerate values that are separated by line feeds.	Empty string ""	STRING
sequenceLength	No	The maximum sequence length. Valid values: 1 to 512.	128	INT
checkpointDir	Yes	The directory of the model. Example: `./classify_models`.	None	STRING
batchSize	No	The size of each training batch. If multiple GPUs are used for model training, this parameter specifies the size of each batch scheduled to each GPU.	32	INT
numEpochs	No	The number of epochs for model training.	1	INT
optimizerType	No	The type of optimizer. Valid values: adam lamb adagrad adadeleta	adam	STRING
learningRate	No	The learning rate.	2e-5	FLOAT
advancedParameters	No	Other advanced parameters. For more information, refer to the following table.	None	STRING

The following table describes the advanced parameters.

Parameter	Required	Description	Default value	Type
pretrain_model_name_or_path	No	The pre-trained model. You can specify a pre-trained model provided by EasyTransfer or specify the Object Storage Service (OSS) path of a custom pre-trained model.	pai-bert-base-zh	STRING

Step 3: Evaluate the model

After you train the model, run the following command to test or evaluate the training result:

easy_transfer_app \
  --mode=evaluate \
  --inputTable=./dev.csv \
  --checkpointPath=./classify_models/model.ckpt-64 \
  --batchSize=10

The following table describes the parameters.

Parameter	Required	Description	Default value	Type
mode	Yes	The mode that is used. Valid values: train evaluate predict export	None	STRING
inputTable	Yes	The input table for model evaluation. Separate multiple tables with commas (,). Example: `./dev.csv`. Important The column schemas of the datasets for model training and model evaluation must be the same.	None	STRING
checkpointPath	Yes	The directory of the CKPT file for the model. Example: ./classify_models/model.ckpt-32.	None	STRING
batchSize	No	The size of each evaluation batch. If multiple GPUs are used, this parameter specifies the size of each batch scheduled to each GPU.	32	INT

Step 4: Use the model to make predictions

After you train the model, run the following command to use the model to process a file. The file can be unlabeled.

easy_transfer_app \
  --mode=predict \
  --inputSchema=content:str:1,label:str:1 \
  --inputTable=dev.csv \
  --outputTable=dev.pred.csv \
  --firstSequence=content \
  --appendCols=label \
  --outputSchema=predictions,probabilities,logits \
  --checkpointPath=./classify_models/ \
  --batchSize=100

The following table describes the parameters.

Parameter	Required	Description	Default value	Type
mode	Yes	The mode that is used. Valid values: train evaluate predict export	None	STRING
inputTable	Yes	The input table to be processed by the model. Example: `./dev.csv`.	None	STRING
outputTable	Yes	The output table that stores the prediction result. Example: `./dev.pred.csv`.	None	STRING
inputSchema	Yes	The schema of the columns in the input table. The value must be in the following format: Column name:Type:Length. The following information is used: The valid values of Type are int, str, and float. In most cases, the value of Length is 1. If the column is a comma-separated array, the value of Length equals the length of the array.	None	STRING
firstSequence	Yes	The column that corresponds to the first text sequence in the input table.	None	STRING
appendCols	No	The columns to be appended from the input table to the output table.	Empty string ""	STRING
outputSchema	No	The types of predicted values that you want the model to output. Separate multiple types with commas (,). The following types of predicted values are supported: predictions: If you use a single-label classification model, the model outputs the IDs of all categories that are sorted in the same order as the enumerate values specified in the labelEnumerateValue parameter If you use a multi-label classification model, the model outputs multi-hot vectors that are separated by commas (,). probabilities: The model outputs the probabilities of all categories that are separated by commas (,). logits: The model outputs the logit values of all categories that are separated by commas (,).	predictions	STRING
checkpointPath	Yes	The directory of the model. Example: `./bert_classify_models`.	None	STRING
batchSize	No	The size of each training batch. If multiple GPUs are used for model training, this parameter specifies the size of each batch scheduled to each GPU.	32	INT

Step 5: Export the model files and deploy the model as an online Elastic Algorithm Service (EAS) service

Export the model files.

By default, the system automatically exports the variables and the saved_model.pb file of the last checkpoint after the model is trained. If you want to export the training results of other check points, run the following command:

easy_transfer_app \
  --mode=export \
  --exportType=app_model \
  --checkpointPath=./classify_models/model.ckpt-64 \
  --exportDirBase=./export_model \
  --batchSize=100

The following table describes the parameters.

Parameter	Required	Description	Default value	Type
mode	Yes	The mode that is used. Valid values: train evaluate predict export	None	STRING
exportType	Yes	The type of model files that you want to export. Valid values: app_model: Export finetune model files. ez_bert_feat: Export model files that are required by text vectorization components.	None	STRING
checkpointPath	Yes	The directory of the CKPT file for the model.	None	STRING
exportDirBase	Yes	The directory of the exported model files.	None	STRING
batchSize	No	The size of each evaluation batch. If multiple GPUs are used, this parameter specifies the size of each batch scheduled to each GPU.	32	INT

Package the model files.
Package the exported variables, saved_model.pb, and vocab.txt files and the label_mapping file that is used to customize input. For example, the label_mapping file of a news classification model is label_mapping.json. The label IDs in the file must be of the INT type. The label IDs must be sorted in the same order as the enumerate values specified in the labelEnumerateValues parameter. The following code block shows an example of the label_mapping.json file:
```
{"教育": 0,
 "三农": 1,
 ...,
 "动漫": 27}
```
You can find the label_mapping.json file in the directory specified in the checkpointDir parameter.
The following figure shows the files that are packaged.
Upload the package to OSS and record the OSS path of the package. Example: oss://xxx/your_model.zip.
Deploy the model. For more information, see EasyTransfer Processor.