【DSW Gallery】RoBERTa Chinese text matching based on EasyNLP
RoBERTa Chinese Text Matching Based on EasyNLP
EasyNLP is an easy-to-use and rich NLP algorithm framework (https://github.com/alibaba/EasyNLP) developed by the Alibaba Cloud PAI algorithm team based on PyTorch. One-stop NLP development experience from training to deployment. EasyNLP provides a variety of model training and prediction functions, aiming to help natural language developers quickly and easily build models and apply them to production.
This article takes text matching as an example to introduce how to quickly use RoBERTa in PAI-DSW based on EasyNLP to build, train, evaluate and predict Chinese text matching models.
About ROBERTa
RoBERTa is a BERT-based improved pre-training language representation model proposed by the FaceBook (now META) AI Research Institute in July 2019. The full name is Robustly Optimized BERT pretraining Approach. Like BERT, RoBERTa also uses the Transformer encoder structure. However, it adds additional training data and improves the pre-training strategy on the basis of BERT, thus obtaining a more powerful pre-training language model. This makes it significantly better than BERT in natural language understanding (NLU) tasks.
Operating environment requirements
It is recommended that users use: Python 3.6, Pytorch 1.8 image, GPU model P100 or V100, memory at least 32G
EasyNLP installation
It is recommended to download the EasyNLP source code from GitHub for installation. The command is as follows:
!git clone https://github.com/alibaba/EasyNLP.git
!pip install -r EasyNLP/requirements.txt
!cd EasyNLP && python setup.py install
You can use the following command to verify whether the installation is successful:
! which easynlp
If the CLI tool of easynlp has been installed in your system, it means that the EasyNLP code library has been installed.
data preparation
First, you need to enter the specified model directory, download the training and test sets used in this example, and create a folder to save the model, the command is as follows:
!cd examples/appzoo_tutorials/text_match/single_tower
! wget http://atp-modelzoo.oss-cn-hangzhou.aliyuncs.com/release/tutorials/ez_text_match/afqmc_public/train.csv
! wget http://atp-modelzoo.oss-cn-hangzhou.aliyuncs.com/release/tutorials/ez_text_match/afqmc_public/dev.csv
After the data download is complete, you can view the first 5 pieces of data through the following code. Each row is a piece of data, and each column is a field value, including the two sentences that need to be text matched, and the corresponding matching result labels.
print('Training data sample:')
!head -n 5 train.csv
print('Development set data sample:')
!head -n 5 dev.csv
initialization
In the Python 3.6 environment, we first import various libraries required for model operation from the newly installed EasyNLP, and do some initialization. In this tutorial, we use chinese-roberta-wwm-ext. EasyNLP integrates a rich pre-training model library. If you want to try other pre-training models, such as bert, albert, etc., you can also make corresponding modifications in user_defined_parameters. The specific model name can be found in the model list.
# In order to avoid the conflict between the args in EasyNLP and the Jupyter system, it needs to be set manually, otherwise it cannot be initialized.
# If you run the code in the text on the command line or in the py file, you can ignore the following code.
import sys
sys.argv = ['main.py']
import torch.cuda
from easylp.appzoo import SingleTowerDataset
from easylp.appzoo import get_application_predictor, get_application_model, get_application_evaluator, get_application_model_for_evaluation
from easylp.core import Trainer, PredictorManager
from easynlp.utils import initialize_easynlp, get_args, get_pretrain_model_path
from easynlp.utils.global_vars import parse_user_defined_parameters
initialize_easynlp()
args = get_args()
user_defined_parameters = parse_user_defined_parameters('pretrain_model_name_or_path=hfl/chinese-roberta-wwm-ext loss_type=hinge_loss margin=0.45 gamma=32 embedding_size=256')
args.checkpoint_dir = "./text_match_single_tower_model_dir"
Note: If the "Address already in use" error occurs in the above code, you need to run the following code to clean up the program being executed on the port. netstat -tunlp|grep 6000 kill -9 PID (need to be replaced with the corresponding program ID in the execution result of the previous line of code)
load data
We use the SingleTowerDataset that comes with EasyNLP to load the training and test data. The main parameters are as follows:
• pretrained_model_name_or_path: pretrained model name path, here we use the encapsulated get_pretrain_model_path function to process the model name "hfl/chinese-roberta-wwm-ext" to get its path, and automatically download the model
• max_seq_length: the maximum length of the text, if it exceeds, it will be truncated, if it is insufficient, it will be padding
• first_sequence, second_sequence, label_name: used to explain which fields in input_schema are used as input sentence pairs and label columns, etc.
• label_enumerate_values: label type enumeration
• is_training: Whether it is a training process, train_dataset is True, valid_dataset is False
train_dataset = SingleTowerDataset(
pretrained_model_name_or_path=get_pretrain_model_path("hfl/chinese-roberta-wwm-ext"),
data_file="train.csv",
max_seq_length=128,
input_schema="example_id:str:1,sent1:str:1,sent2:str:1,label:str:1,cate:str:1,score:str:1",
first_sequence="sent1",
second_sequence="sent2",
label_name="label",
label_enumerate_values="0,1",
is_training=True)
valid_dataset = SingleTowerDataset(
pretrained_model_name_or_path=get_pretrain_model_path("hfl/chinese-roberta-wwm-ext"),
data_file="dev.csv",
max_seq_length=128,
input_schema="example_id:str:1,sent1:str:1,sent2:str:1,label:str:1,cate:str:1,score:str:1",
first_sequence="sent1",
second_sequence="sent2",
label_name="label",
label_enumerate_values="0,1",
is_training=False)
Since we chose hfl/chinese-roberta-wwm-ext before, the pre-trained model will also be automatically downloaded and loaded here.
model training
After processing the data and loading the model, we start training the model. We use the packaged get_application_model function in EasyNLP to build the model during training, and its parameters are as follows:
• app_name: task name, here select text match "text_match"
• pretrained_model_name_or_path: pretrained model name path, here we use the encapsulated get_pretrain_model_path function to process the model name "hfl/chinese-roberta-wwm-ext" to get its path, and automatically download the model
• num_labels: the number of categories, the data set in this example is a binary classification data set
• user_defined_parameters: User-defined parameters, directly fill in the custom parameters just processed user_defined_parameters
model = get_application_model(app_name="text_match",
pretrained_model_name_or_path=get_pretrain_model_path("hfl/chinese-roberta-wwm-ext"),
num_labels=2,
user_defined_parameters=user_defined_parameters)
It can be seen from the log that we have loaded the parameters of the pre-trained model. In the next step, we use the Train class in EasyNLP to create a training instance and perform training.
trainer = Trainer(model=model,
train_dataset = train_dataset,
evaluator=get_application_evaluator(app_name="text_match",
valid_dataset=valid_dataset,
eval_batch_sizee=32,
user_defined_parameters=user_defined_parameters))
trainer. train()
model evaluation
After the training process is over, the trained model is saved in the checkpoint_dir specified at the beginning, and the local path is "./text_match_single_tower_model_dir/". We can evaluate the effect of the trained model. We also first use the get_application_model_for_evaluation method in EasyNLP to build the evaluation model.
model = get_application_model_for_evaluation(app_name="text_match",
pretrained_model_name_or_path="./text_match_single_tower_model_dir/",
user_defined_parameters=user_defined_parameters)
Then we use get_application_evaluator in EasyNLP to initialize the evaluator, and specify the current model under the current device for model evaluation.
evaluator = get_application_evaluator(app_name="text_match",
valid_dataset=valid_dataset,
eval_batch_size=32,
user_defined_parameters=user_defined_parameters)
model.to(torch.cuda.current_device())
evaluator.evaluate(model=model)
EasyNLP is an easy-to-use and rich NLP algorithm framework (https://github.com/alibaba/EasyNLP) developed by the Alibaba Cloud PAI algorithm team based on PyTorch. One-stop NLP development experience from training to deployment. EasyNLP provides a variety of model training and prediction functions, aiming to help natural language developers quickly and easily build models and apply them to production.
This article takes text matching as an example to introduce how to quickly use RoBERTa in PAI-DSW based on EasyNLP to build, train, evaluate and predict Chinese text matching models.
About ROBERTa
RoBERTa is a BERT-based improved pre-training language representation model proposed by the FaceBook (now META) AI Research Institute in July 2019. The full name is Robustly Optimized BERT pretraining Approach. Like BERT, RoBERTa also uses the Transformer encoder structure. However, it adds additional training data and improves the pre-training strategy on the basis of BERT, thus obtaining a more powerful pre-training language model. This makes it significantly better than BERT in natural language understanding (NLU) tasks.
Operating environment requirements
It is recommended that users use: Python 3.6, Pytorch 1.8 image, GPU model P100 or V100, memory at least 32G
EasyNLP installation
It is recommended to download the EasyNLP source code from GitHub for installation. The command is as follows:
!git clone https://github.com/alibaba/EasyNLP.git
!pip install -r EasyNLP/requirements.txt
!cd EasyNLP && python setup.py install
You can use the following command to verify whether the installation is successful:
! which easynlp
If the CLI tool of easynlp has been installed in your system, it means that the EasyNLP code library has been installed.
data preparation
First, you need to enter the specified model directory, download the training and test sets used in this example, and create a folder to save the model, the command is as follows:
!cd examples/appzoo_tutorials/text_match/single_tower
! wget http://atp-modelzoo.oss-cn-hangzhou.aliyuncs.com/release/tutorials/ez_text_match/afqmc_public/train.csv
! wget http://atp-modelzoo.oss-cn-hangzhou.aliyuncs.com/release/tutorials/ez_text_match/afqmc_public/dev.csv
After the data download is complete, you can view the first 5 pieces of data through the following code. Each row is a piece of data, and each column is a field value, including the two sentences that need to be text matched, and the corresponding matching result labels.
print('Training data sample:')
!head -n 5 train.csv
print('Development set data sample:')
!head -n 5 dev.csv
initialization
In the Python 3.6 environment, we first import various libraries required for model operation from the newly installed EasyNLP, and do some initialization. In this tutorial, we use chinese-roberta-wwm-ext. EasyNLP integrates a rich pre-training model library. If you want to try other pre-training models, such as bert, albert, etc., you can also make corresponding modifications in user_defined_parameters. The specific model name can be found in the model list.
# In order to avoid the conflict between the args in EasyNLP and the Jupyter system, it needs to be set manually, otherwise it cannot be initialized.
# If you run the code in the text on the command line or in the py file, you can ignore the following code.
import sys
sys.argv = ['main.py']
import torch.cuda
from easylp.appzoo import SingleTowerDataset
from easylp.appzoo import get_application_predictor, get_application_model, get_application_evaluator, get_application_model_for_evaluation
from easylp.core import Trainer, PredictorManager
from easynlp.utils import initialize_easynlp, get_args, get_pretrain_model_path
from easynlp.utils.global_vars import parse_user_defined_parameters
initialize_easynlp()
args = get_args()
user_defined_parameters = parse_user_defined_parameters('pretrain_model_name_or_path=hfl/chinese-roberta-wwm-ext loss_type=hinge_loss margin=0.45 gamma=32 embedding_size=256')
args.checkpoint_dir = "./text_match_single_tower_model_dir"
Note: If the "Address already in use" error occurs in the above code, you need to run the following code to clean up the program being executed on the port. netstat -tunlp|grep 6000 kill -9 PID (need to be replaced with the corresponding program ID in the execution result of the previous line of code)
load data
We use the SingleTowerDataset that comes with EasyNLP to load the training and test data. The main parameters are as follows:
• pretrained_model_name_or_path: pretrained model name path, here we use the encapsulated get_pretrain_model_path function to process the model name "hfl/chinese-roberta-wwm-ext" to get its path, and automatically download the model
• max_seq_length: the maximum length of the text, if it exceeds, it will be truncated, if it is insufficient, it will be padding
• first_sequence, second_sequence, label_name: used to explain which fields in input_schema are used as input sentence pairs and label columns, etc.
• label_enumerate_values: label type enumeration
• is_training: Whether it is a training process, train_dataset is True, valid_dataset is False
train_dataset = SingleTowerDataset(
pretrained_model_name_or_path=get_pretrain_model_path("hfl/chinese-roberta-wwm-ext"),
data_file="train.csv",
max_seq_length=128,
input_schema="example_id:str:1,sent1:str:1,sent2:str:1,label:str:1,cate:str:1,score:str:1",
first_sequence="sent1",
second_sequence="sent2",
label_name="label",
label_enumerate_values="0,1",
is_training=True)
valid_dataset = SingleTowerDataset(
pretrained_model_name_or_path=get_pretrain_model_path("hfl/chinese-roberta-wwm-ext"),
data_file="dev.csv",
max_seq_length=128,
input_schema="example_id:str:1,sent1:str:1,sent2:str:1,label:str:1,cate:str:1,score:str:1",
first_sequence="sent1",
second_sequence="sent2",
label_name="label",
label_enumerate_values="0,1",
is_training=False)
Since we chose hfl/chinese-roberta-wwm-ext before, the pre-trained model will also be automatically downloaded and loaded here.
model training
After processing the data and loading the model, we start training the model. We use the packaged get_application_model function in EasyNLP to build the model during training, and its parameters are as follows:
• app_name: task name, here select text match "text_match"
• pretrained_model_name_or_path: pretrained model name path, here we use the encapsulated get_pretrain_model_path function to process the model name "hfl/chinese-roberta-wwm-ext" to get its path, and automatically download the model
• num_labels: the number of categories, the data set in this example is a binary classification data set
• user_defined_parameters: User-defined parameters, directly fill in the custom parameters just processed user_defined_parameters
model = get_application_model(app_name="text_match",
pretrained_model_name_or_path=get_pretrain_model_path("hfl/chinese-roberta-wwm-ext"),
num_labels=2,
user_defined_parameters=user_defined_parameters)
It can be seen from the log that we have loaded the parameters of the pre-trained model. In the next step, we use the Train class in EasyNLP to create a training instance and perform training.
trainer = Trainer(model=model,
train_dataset = train_dataset,
evaluator=get_application_evaluator(app_name="text_match",
valid_dataset=valid_dataset,
eval_batch_sizee=32,
user_defined_parameters=user_defined_parameters))
trainer. train()
model evaluation
After the training process is over, the trained model is saved in the checkpoint_dir specified at the beginning, and the local path is "./text_match_single_tower_model_dir/". We can evaluate the effect of the trained model. We also first use the get_application_model_for_evaluation method in EasyNLP to build the evaluation model.
model = get_application_model_for_evaluation(app_name="text_match",
pretrained_model_name_or_path="./text_match_single_tower_model_dir/",
user_defined_parameters=user_defined_parameters)
Then we use get_application_evaluator in EasyNLP to initialize the evaluator, and specify the current model under the current device for model evaluation.
evaluator = get_application_evaluator(app_name="text_match",
valid_dataset=valid_dataset,
eval_batch_size=32,
user_defined_parameters=user_defined_parameters)
model.to(torch.cuda.current_device())
evaluator.evaluate(model=model)
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00