What is the man-machine conversation model

Dialogue Management Model Background

Since the early days of artificial intelligence research, people have been committed to developing highly intelligent human-computer dialogue systems. Alan Turing (Alan Turing) proposed the Turing test in 1950[1], thinking that if humans cannot distinguish between a machine and a human being talking to him, then it can be said that the machine has passed the Turing test and has a high degree of intelligence . The first generation of dialogue systems are mainly rule-based dialogue systems. For example, the ELIZA system [2] developed by MIT in 1966 is a psychological medical chat robot using the template matching method. Another example is the flow chart-based dialogue system that became popular in the 1970s. The state transitions in the dialogue flow are modeled using the finite state automaton model. Their advantage is that the internal logic is transparent and easy to analyze and debug, but they are highly dependent on the manual intervention of experts, and their flexibility and scalability are poor.

With the rise of big data technology, a second-generation data-driven dialogue system based on statistical methods (hereinafter referred to as statistical dialogue system) has emerged. At this stage, reinforcement learning has also begun to be widely researched and used, the most representative of which is the statistical dialogue based on Partially Observable Markov Decision Process (POMDP) proposed by Professor Steve Young of Cambridge University in 2005 system [3]. The system is significantly better than the rule-based dialogue system in terms of robustness. It performs Bayesian inference on the observed speech recognition results, maintains the dialogue state of each round, and then selects the dialogue strategy according to the dialogue state to generate Natural language responses. The POMDP-based dialogue system adopts the framework of enhanced learning, and optimizes the dialogue strategy by continuously interacting with user simulators or real users for trial and error, and obtaining reward scores. The statistical dialogue system is a modular system, which avoids the high dependence on experts, but the disadvantage is that the model is difficult to maintain and the scalability is relatively limited.

In recent years, with the major breakthroughs of deep learning in the fields of image, speech and text, a third-generation dialogue system using deep learning as the main method has emerged. This system still continues the framework of the statistical dialogue system, but each module is A neural network model is used. Due to the strong representation ability of the neural network model, the ability of language classification or generation has been greatly improved, so an important trend of change is that the model of natural language understanding has evolved from the previous production model (such as Bayesian network) to a deep discriminative model (such as CNN, DNN, RNN) [5], the acquisition of dialogue state is no longer obtained by using Bayesian posterior judgment, but directly calculates the maximum conditional probability. In the optimization of dialogue strategies, everyone has also begun to adopt the deep reinforcement learning model [6]. On the other hand, due to the success of end-to-end sequence-to-sequence technology in machine translation tasks, it is possible to design an end-to-end dialogue system. Facebook researchers proposed a task dialogue system based on memory networks [4], which is the third research The end-to-end task-oriented dialogue system in the modern dialogue system proposes a new direction. In general, the third-generation dialogue system is better than the second-generation system, but requires a large amount of labeled data for effective training. Therefore, improving the cross-domain migration and expansion capabilities of the model has become a popular research direction.

Common dialogue systems can be divided into three categories: chat, task-oriented, and question-answer.

The goal of chat dialogue is to generate interesting and informative natural responses so that the human-computer dialogue can continue [7].
Question-and-answer dialogue refers to one question and one answer. The user asks a question, and the system returns the correct answer by analyzing the question and searching the knowledge base [8].
Task-oriented dialogue (hereinafter referred to as task-oriented dialogue) refers to multi-round dialogue driven by tasks. The machine needs to determine the user's goal through understanding, active inquiry, clarification, etc., and returns the correct result after calling the corresponding API query. Complete user needs. Usually, task-based dialogue can be understood as a sequence decision process. During the dialogue process, the machine needs to update and maintain the internal dialogue state by understanding user sentences, and then choose the next optimal action according to the current dialogue state (such as confirming the demand, Ask for constraints, provide results, etc.) to complete the task.

Task-based dialogue systems can be divided into two types structurally. One is the pipeline system, which adopts a modular structure [5] (as shown in Figure 1), and generally includes four key modules:

Natural Language Understanding (NLU): Identify and analyze the user's text input, and obtain computer-understandable semantic tags such as slot values and intentions.
Dialog State Tracking (Dialog State Tracking, DST): According to the dialogue history, maintain the current dialogue state, the dialogue state is the cumulative semantic representation of the entire dialogue history, generally slot-value pairs (slot-value pairs).
Dialog Policy: Output the next system action according to the current dialog state. The general dialog state tracking module and the dialog policy module are collectively referred to as the dialog management module (Dialog manager, DM).
Natural Language Generation (NLG): Convert system actions into natural language output.

This modular system structure is highly interpretable and easy to implement. Most of the practical task-based dialogue systems in the industry adopt this structure. However, its disadvantage is that it is not flexible enough, each module is relatively independent, and it is difficult to jointly optimize and adapt to changing application scenarios. And because the errors between modules will accumulate layer by layer, the upgrade of a single module may also require the entire system to be adjusted together.

Another implementation of the task-based dialogue system is the end-to-end system, which is also a popular direction in the academic circle in recent years9[11] (as shown in Figure 2). The overall mapping relationship of the output has the characteristics of strong flexibility and high scalability, which reduces the labor cost in the design process and breaks the isolation between traditional modules. However, the end-to-end model has high requirements on the quantity and quality of data, and the modeling of processes such as slot filling and API calling is not clear enough. At this stage, the application effect in the industry is limited, and it is still in the process of exploration.

As users' requirements for product experience gradually increase, the actual dialogue scenarios become more complex, and the dialogue management module also needs more improvement and innovation. The traditional dialogue management model is usually established in a clear discourse system (that is, search first, then inquire, and finally end), and generally pre-defines the system action space, user intention space and dialogue ontology, but it is difficult to change user behavior in practice. However, the response ability of the system is very limited, which will lead to the problem of poor extension of traditional dialogue systems (difficult to deal with situations other than predefined). In addition, in many real industry scenarios, there are a large number of cold start problems, lack of sufficient labeled dialogue data, and the cost of data cleaning and labeling is high. In terms of model training, dialogue management models based on deep reinforcement learning generally require a large amount of data. Experiments in most papers show that training a dialogue model usually requires hundreds of complete dialogue sessions. Such low training efficiency hinders It ensures the rapid development and iteration of dialogue systems in practice.

To sum up, in view of the many limitations of the traditional dialogue management model, in recent years, researchers in academia and industry have begun to focus on how to strengthen the practicability of the dialogue management model. Specifically, there are three major problems:

poor scalability
Less labeled data
low training efficiency

We will introduce the latest research results in the near future according to these three directions.

Introduction to the Frontiers of Dialogue Management Model Research
Pain point 1 of the dialogue management model: Poor scalability

As mentioned earlier, the dialog manager consists of two parts: dialog state tracker (DST) and dialog policy (dialog policy). In the traditional DST research, the most representative is the neural belief tracker (neural belief tracker, NBT) [12] proposed by the scholars of Cambridge University in 2017, which uses neural networks to solve the dialogue state tracking of single-domain complex dialogues. question. NBT uses representation learning to encode the last round of system actions, current round of user sentences, and candidate slot value pairs, and calculates the semantic similarity in a high-dimensional space to detect the slot value mentioned by the current round of users. Therefore, NBT can identify semantically similar slot values that have not been seen in the training set with the help of word vector representations of slot value pairs without relying on manual construction of semantic dictionaries, and realize the expansion of slot values. Subsequently, Cambridge scholars further improved NBT13, changing the input slot-value pairs into domain-slot-value triplets, and the results of each round of recognition were accumulated using model learning instead of manual rules, and all data was accumulated using the same A model is trained to achieve knowledge sharing between different fields, and the total parameters of the model do not increase with the number of fields. In the traditional Dialogue Policy research field, the most representative one is the policy optimization based on the ACER method proposed by Cambridge scholars6.

By combining the experience replay technique, the author tried the trust region actor-critic model and the episodic natural actor-critic model respectively, and verified that the deep reinforcement learning algorithm of the AC series has reached the current level in terms of sample utilization efficiency, algorithm convergence and dialogue success rate. best performance.

However, the traditional dialogue management model still needs to be improved in terms of scalability, specifically in three aspects:

how to handle changing user intent,
How to change slots and slot values,
How to handle changing system actions.

changing user intent

In practical application scenarios, it often happens that the dialogue system gives an unreasonable answer because the user's intention is not considered. In the example shown in Figure 3, the user's "confirm" intention is not considered, and new words need to be added to help the system handle this situation.

Once a new user intent that has not been seen in the training set appears, the traditional model outputs a fixed one-hot vector representing the old intent category. To include the new intent category, the vector needs to be changed, and the corresponding new model also needs to be modified. Complete retraining, which reduces the maintainability and scalability of the model. The paper [15] proposed a "teacher-student" learning framework to alleviate this problem. They used the old model and logic rules for new user intentions as the "teacher", and the new model as the "student", forming a "teacher". -student” training framework. The architecture uses knowledge distillation technology. The specific method is: for the old intent set, the probability output of the old model directly guides the training of the new model; for the newly added intent, the corresponding logical rules are used as new labeled data to train the new model. This makes it unnecessary to retrain new models with new interactions with the environment. The paper conducts experiments on the DSTC2 dataset. First, it chooses to deliberately remove the intent of confirm, and then adds it to the dialogue ontology as a new intent, and then verifies whether the new model has good adaptability. Figure 4 is the experimental results. The paper’s new model (i.e. Extended System) is compared with the old model directly on the data training model (i.e. Contrast System) containing all intentions. The experiment proves that the accuracy of the new model’s recognition of new intentions is different in different noises. All cases are good for extending the ability to recognize new intents.

Of course, this architecture still requires some training for the system. [16] proposes a semantic similarity matching model CDSSM that can solve the problem of user intent expansion without relying on labeled data and model retraining. CDSSM first uses the natural description of user intent in the training set data to directly learn an encoder for intent embedding, and embeds any intent description into a high-dimensional semantic space, so that the model can be directly based on the new intent during testing. The natural description of the corresponding intention vector is generated, and then the intention recognition is performed. In the following content, we can see that many models that improve scalability adopt similar ideas, move the label from the output end of the model to the input end, and use the neural network to analyze the label (label name itself or the natural description of the label) ) to perform semantic encoding to obtain a certain semantic vector and then perform semantic similarity matching.

[43] gave another way of thinking, which introduces the role of human customer service into the stage of online operation of the system to solve the problem of user intentions that are not seen in the training set through human-machine collaboration. The model uses an additional neural decision device to judge whether to request manual labor according to the dialog state vector extracted by the current model. If so, the current dialog is distributed to the online human customer service to answer. If not, the model itself makes predictions. Since the judge learned from the data has the ability to make certain judgments on whether the current conversation contains new intentions, and the manual reply is correct by default, this man-machine collaboration method very cleverly solves the problem of unseen users in the online test. Behavioral problems, and can maintain a relatively high dialogue accuracy.

Changing slots and slot values

In the multi-domain or complex domain dialogue state tracking problem, how to deal with the changes of slots and slot values has always been a difficult problem. For some slots, the slot value may not be enumerable, for example, time, place and person name, and even the set of slot values is dynamically changing, such as flights, movies shown in cinemas. In the traditional dialogue state tracking problem, the set of default slots and slot values is usually fixed, which greatly reduces the scalability of the system.

Aiming at the problem that the slot value is not enumerable, Google researchers [17] proposed a candidate set (candidate set) idea. For each slot, maintain a candidate set with a total upper limit, which contains up to k possible slot values in the dialogue so far, and assign a score to each slot value to indicate that the user has used the slot in the current dialogue value preference. The system first uses the bidirectional RNN model to find out the slot value of a certain slot included in the current round of user sentences, and then re-scores and sorts it with the existing slot values in the candidate set, so that each round of DST only needs to be in a limited The judgment is made on the set of slot values, so as to solve the tracking problem of non-enumerable slot values. For the tracking problem of unseen slot values, you can generally use the sequence labeling model [18], or choose a semantically similar matching model such as the Neural Belief Tracker [12].

The above is the case that the slot value is not fixed. What if the slot position in the dialog body also changes? The paper [19] uses the slot description encoder to encode the natural language description of any slot (seen and unseen), and obtains the semantic vector representing the slot, which is input into the Bi-LSTM model together with the user statement, and outputs the identified slot value in the way of sequence annotation, as shown in Figure 5. This paper makes an acceptable assumption that the natural language description of any slot is easy to obtain. Therefore, a concept tag structure that is universal in many fields is designed. The implementation of the slot description encoder is the sum of simple word vectors. The experiment shows that the model can quickly adapt to the new slot, and the scalability of the method is greatly improved compared with the traditional method.

With the development of sequence-to-sequence technology in recent years, it is also a very popular direction to directly use the end-to-end neural network model to generate the result of DST as a sequence. Common techniques such as attention mechanism and copy mechanism (copy mechanism) can be used to improve the generation effect. On the well-known multi-domain dialogue MultiWOZ dataset, the team of Professor Pascale Fung from the Hong Kong University of Science and Technology used the copy network to significantly improve the recognition accuracy of non-enumerable slots [20]. The TRADE model they proposed is shown in Figure 6. Every time a slot value is detected, the model will encode the semantics of the different combinations of domains and slots as the initial position input of the RNN decoder, and the decoder will copy the corresponding network directly. The slot value is generated. Through the generation method, the same model can be used for both non-enumerable slot values and changing slot values, which can achieve the sharing of slot value information between domains and greatly improve the accuracy of the model. Generalization.

A recent obvious trend is to regard multi-domain DST as a machine reading comprehension task, and improve the generative model of TRADE into a discriminative model45. The tracking of non-enumerable slots uses a machine reading comprehension task similar to SQuAD [46] to find the corresponding text span from the dialogue history and questions as slot values, while the tracking of enumerable slots is transformed into a multiple-choice machine reading Understand the task and select the correct value from the candidate values as the predicted slot value. By combining deep context word representations such as ELMO and BERT, these newly proposed models finally achieved the best results on the MultiWOZ dataset.

Changing System Actions

A final aspect of the scalability problem is that the system action space is difficult to predefine. As shown in Figure 7, when designing an electronic product recommendation system, it may not be considered at the beginning that the user will ask how to upgrade the product operating system, but the reality is that you cannot limit the user to only ask the system to solve the problem. The problem. If the system action space is framed in advance, when the user asks a new question, it will lead to a series of irrelevant answers, resulting in a very poor user experience.

In this regard, what we need to consider is how to design a better dialogue strategy network so that the system can quickly expand new actions. The first attempt comes from Microsoft [21], who tried to realize the reinforcement learning of the system on the unconstrained action space by changing the classic DQN structure. The dialogue task in the thesis is a word game task. The action in each round is a sentence, and the number of actions is variable. If you choose different actions, the storyline will have different developments. The author proposes a new model Deep Reinforcement Relevance Network (DRRN), which matches the current dialogue state and each optional system action one by one through semantic similarity matching to obtain the Q function. Specifically: in a certain round of dialogue, each action text of variable length will be encoded by a neural network to obtain a fixed-length system action vector, and the background text of the story will also be obtained by another neural network. A dialogue state vector of fixed length, two vectors The final Q-value is generated by an interaction function such as the dot product. Figure 8 is the paper design model structure. Experiments show that DRRN performs better than traditional DQN (using padding techniques) on the two word games "Saving John" and "Machine of Death".

The paper [22] hopes to solve this problem from the perspective of the dialogue system as a whole. The author proposes an incremental learning dialogue system (Incremental Dialogue System, IDS), as shown in Figure 9. First, the system encodes the dialogue history through the Dialogue Embedding module to obtain the context vector, and then uses a VAE-based Uncertainty Estimation module to evaluate the confidence level of whether the current system can give the correct answer based on the context vector. Similar to the way of active learning, if the confidence is higher than the threshold, the dialogue manager will score all the current optional actions one by one, and predict the probability distribution through the softmax function. Each round of replies are marked (select the correct reply or create a new reply), and the new data obtained is incorporated into the data pool and the model is updated online. Through this human-teaching method, the IDS system not only solves the learning problem of unlimited action space, but also can quickly collect high-quality data, which is very close to the actual production application.

Pain point 2 of dialogue management model: less labeled data

With the diversification of dialogue system application fields, the demand for data is also more diverse. If you want to train a task-based dialogue system, you usually need as much data in this field as possible. Quality labeled data comes at a high cost. To this end, scholars have carried out various research attempts, which can be mainly divided into three lines of thought:

Use machines to automatically label data to reduce the cost of data labeling;
Dialogue structure mining, using unlabeled data as efficiently as possible;
Strengthen the data collection strategy to efficiently obtain high-quality data.

Automatic labeling by machine

Due to the high cost and low efficiency of manual labeling of data, scholars hope to use machine-assisted manual labeling of data. The methods can be roughly divided into two categories: supervised methods and unsupervised methods. The paper [23] proposes an architecture called auto-dialabel, which uses hierarchical clustering unsupervised learning methods to automatically group intents and slots in dialogue data, so as to realize automatic labeling of dialogue data (the specific labels of categories need to be determined manually) . The approach is based on the assumption that expressions of the same intent may share similar background features. The initial features extracted by the model include word vectors, POS annotations, noun word clusters and LDA. Each feature is converted into a vector of the same dimension by an autoencoder and spliced, and then the RBF (radial bias function) function is used to calculate the distance between classes for dynamic hierarchical clustering. The closest classes will be automatically merged until the inter-class distance is greater than a preset threshold. The model framework is shown in Figure 10.

The paper [24] uses a supervised clustering method to achieve machine labeling. The author regards each piece of dialogue data as a graph node, and regards the process of clustering as the process of finding the minimum spanning forest. The model first uses SVM to supervise and train the distance score model between nodes and nodes on the question answering data set, and then combines the structured model and the minimum subtree generation algorithm to infer the category information corresponding to the dialogue data as a latent variable, thereby outputting The best clustering structure represents user intent categories.

Dialogue Structure Mining

Due to the scarcity of high-quality labeled data for training dialogue systems, how to fully mine the hidden dialogue structure or information in unlabeled dialogue data has become one of the current research hotspots. Contribute to the design of dialogue strategies and the training of dialogue models.

The paper [25] proposes an unsupervised method using variational RNN (VRNN) to automatically learn hidden structures in dialogue data. The author gives two models to obtain dynamic information in the dialogue: Discrete-VRNN and Direct-Discrete-VRNN. As shown in Figure 11, x_t is the t-th round of dialogue, h_t represents the hidden variable of the dialogue history, and z_t represents the hidden variable of the dialogue structure (one-dimensional one-hot discrete variable). The difference between the two models is: for D-VRNN, the hidden variable z_t depends on h_(t-1); for DD-VRNN, the hidden variable z_t depends on z_(t-1). VRNN estimates the posterior probability distribution of the hidden variable z_t by maximizing the likelihood of the entire dialogue and using some common techniques of VAE.

Experiments in the paper show that VRNN is superior to traditional HMM methods, and adding dialogue structure information to the reward function also helps the enhanced learning model to converge faster. Figure 12 is a visualization of the hidden variable z_t transition probability in the restaurant field mined by D-VRNN.

CMU scholars [26] also tried to use the VAE method to infer system actions as hidden variables and directly use them in the selection of dialogue strategies, which can alleviate the problems caused by the incompleteness of predefined system actions. As shown in Figure 13, for the sake of simplicity, the paper uses an end-to-end dialogue system framework. The baseline model is a word-level reinforcement learning model (that is, dialogue actions are words in the vocabulary). The dialogue history is encoded by the encoder, and then the decoder is used to Decoding generates a dialogue reply, and the reward function is directly obtained by comparing the generated dialogue reply sentence with the real dialogue reply sentence. The difference between the hidden action model proposed by the author and the baseline model is that there are more posterior inferences of discrete hidden variables between the encoder and the decoder. The dialogue actions are represented by discrete hidden variables without any human intervention definition. The final experiments demonstrate that the latent action-based end-to-end reinforcement learning model outperforms the baseline model in both the diversity of sentence generation and the task completion rate.

Data collection strategy

Recently, Google researchers proposed a method for quickly collecting dialogue data [27] (see Figure 14): first, use two rule-based simulators to interactively generate the outline of the dialogue, that is, the dialogue flow skeleton represented by semantic tags; Then use templates to transcribe semantic tags into natural language dialogues; finally use crowdsourcing to rewrite natural sentences, making the language expression of dialogue data more diverse. This reverse data collection method not only has high collection efficiency, but also has complete data annotation and strong usability, avoiding the cost of collecting field data and a large amount of manual processing.

The above method belongs to the machine-to-machine (M2M) data collection strategy: First, generate semantic labels for dialogue data with a wide coverage, and then crowdsource to generate a large amount of dialogue materials. The disadvantage is that the generated dialogue is relatively limited and cannot cover all the possibilities of the real scene, and the effect depends on the quality of the simulator.
There are two other methods commonly used in dialogue system data collection in the academic circle: human-to-machine dialogue (human-to-machine, H2M) and human-to-human dialogue (human-to-human, H2H). The H2H method requires the user (acted by a crowdsourcer) and the customer service (acted by another crowdsourcer) to conduct multiple rounds of dialogue. The user is responsible for making demands based on some specified dialogue goals (such as buying a plane ticket), and the customer service is responsible for labeling the dialogue labels and Create conversation responses. This mode is called the Wizard-of-Oz framework, and many data sets for dialogue research such as WOZ[5], MultiWOZ[28] are collected in this way. The H2H method can get the dialogue data that is closest to the actual business scenario, but it needs to design different interactive interfaces for different tasks, and it needs a lot of manpower to clean up wrong labels, which is quite expensive. H2M's data collection strategy is to allow users to directly communicate with machines trained to a certain level to collect data online, and use reinforcement learning to continuously improve the dialogue management model. The famous DSTC2&3 data set is collected through this method. The effect of the H2M method generally depends on the initial effect of the dialogue management model, and the data collected online is noisy, and the cleaning cost will be high, which affects the efficiency of model optimization.

The third pain point of the dialogue management model: low training efficiency

With the great success of deep reinforcement learning in the game Go, the method has also been widely used in the field of task-oriented dialogue. For example, the ACER dialogue management method in the paper [6] uses model-free deep reinforcement learning. By combining Experience Replay, reliability domain constraints, pre-training and other techniques, the training efficiency and efficiency of reinforcement learning algorithms in the field of task-based dialogue are greatly improved. stability.

However, simply applying reinforcement learning algorithms cannot satisfy the practical application of dialogue systems. This is mainly because the dialogue field does not have clear rules and reward functions like the game Go, the action space is simple and clear, and a perfect environment simulator can generate hundreds of millions of high-quality interaction data. In dialogue tasks, various slot values and action intentions are generally included, which makes the action space of the dialogue system increase sharply and is difficult to pre-define. The traditional flat reinforcement learning (flat reinforcement learning) method will have a curse of dimensionality due to the one-hot encoding of all system actions, so it is no longer suitable for dealing with complex dialogue problems with very large action spaces. For this reason, scholars have carried out many researches. Research attempts, including model-free RL, model-based RL and human-in-the-loop three directions.

Model-free Reinforcement Learning – Hierarchical Reinforcement Learning

Hierarchical Reinforcement Learning (HRL) is based on the concept of "divide and conquer", which decomposes complex tasks into multiple sub-tasks, and solves the dimension disaster of traditional flat reinforcement learning. The paper [29] applied Hierarchical Reinforcement Learning (HRL) to the field of task-oriented dialogue for the first time. The author used expert knowledge to split the complex dialogue task into multiple subtasks in the temporal dimension. For example, a complex travel problem can be decomposed into Book air tickets, book hotels, rent cars and other sub-problems. According to this split, they designed a two-level dialogue strategy network, one level is responsible for selecting and arranging all subtasks, and the other level is responsible for the execution of specific subtasks.

Their proposed dialogue management model (shown in Figure 15) includes:

Top-level policy, used to select subtasks based on dialog state;
Low-level policy, used to complete a specific dialogue action of a subtask;
Global dialogue state tracking, recording the overall dialogue state. After the entire dialogue task is completed, the top-level policy receives an external reward.

In addition, the model also adds an internal critic module (internal critic), which is used to estimate the possibility of subtask completion (subtask filling degree) according to the dialogue state, and the underlying strategy will receive internal evaluation according to the completion degree of the subtask An intrinsic reward for modules.

In the face of complex dialogue problems, each step of decision-making in traditional reinforcement learning is to select basic system actions, such as asking slot values or confirming constraints, while hierarchical reinforcement learning first selects a large set of basic actions through the top-level strategy, and then The basic actions of the current set are selected through the underlying strategy, and the process is shown in Figure 16. This hierarchical division of the action space can take into account the timing constraint relationship between different subtasks, which is helpful for the completion of the composite dialogue task (composite task). And by adding internal rewards, the paper effectively alleviates the problem of sparse rewards, speeds up the training of reinforcement learning, and also avoids frequent switching between different subtasks in the dialogue to a certain extent, improving the accuracy of action prediction. Of course, the layered design of actions relies more on expert knowledge, and experts need to determine the types of subtasks. Recently, some work on the automatic discovery of dialogue subtasks has appeared. Carry out automatic segmentation to avoid manual construction of the dialogue subtask structure.

Model-free Reinforcement Learning – Fengjiang Reinforcement Learning!

Feudal Reinforcement Learning (FRL) is another reinforcement learning method suitable for large-dimensional problems. Hierarchical reinforcement learning is to divide the dialogue strategy into sub-strategies according to different task stages in the time dimension, thereby reducing the complexity of strategy learning; and frontier reinforcement learning (FRL) is to divide the strategy in the space dimension and limit the responsibility of the sub-strategies. The scope of action divides the "jurisdictional territory", thereby reducing the complexity of sub-strategies. Fengjiang Reinforcement Learning (FRL) does not divide sub-tasks, but applies the abstraction function of the state space to extract useful features from the dialogue state. This abstraction is conducive to the application of Fengjiang Reinforced Learning (FRL) in large-scale problems and the transfer between different fields, and has strong scalability.

For the first time, Cambridge scholars applied Fengjiang enhanced learning [32] to the field of task-oriented dialogue systems, dividing the action space according to whether it is related to the slot, so that only the natural structure of the action space is used without additional expert knowledge. They proposed a border closure strategy structure as shown in Figure 17. The decision-making process of this structure is divided into two steps:

Decide whether the next action requires a slot as a parameter;
According to the decision of the first step and the corresponding slots, different underlying strategies are used to select the next action.

In general, Hierarchical Reinforcement Learning (HRL) and Frontier Reinforcement Learning (HRL) split the high-dimensional complex action space in different ways to solve the problem of low training efficiency caused by the large dimension of traditional RL action space. Hierarchical Reinforcement Learning (HRL) divides tasks reasonably, which is more in line with human understanding, but requires expert knowledge to split subtasks. Frontier Reinforcement Learning (FRL) splits complex problems directly by considering the logical structure of the action itself, without considering the mutual constraints between different subtasks.

Model-based reinforcement learning

The above discussion belongs to model-free reinforcement learning, which obtains a large amount of weakly supervised data by interacting with the environment through trial and error, and then trains a value network or policy network without caring about the environment itself. The opposite is model-based reinforcement learning, and its learning process is shown in Figure 18. Its characteristic is to directly model the environment, use the data obtained by interacting with the environment to learn a probability transfer function of state and reward, that is, the environment model, and then the system can interact with the environment model to generate more training data, so model-based Reinforcement learning is generally more efficient than model-free reinforcement learning, especially in scenarios where interaction with the environment is expensive. But its effectiveness depends on how well the environment is modeled.

Using model-based reinforcement learning to improve training efficiency is a recent research hotspot. Microsoft first applied the classic Deep Dyna-Q (DDQ) algorithm to dialogue [33]. As shown in Figure 19c, before DDQ training starts, a small number of The existing dialogue data pre-trains the policy model and the environment model (world model), and then the training of DDQ continues through three steps:

Direct reinforcement learning - update the policy model and store dialogue data through online dialogue interaction with real users;
Train the environment model - update the environment model with collected real dialogue data;
Planning - Training a policy model using dialogue data obtained from interacting with an environment model.

The environment model (as shown in Figure 20) is a neural network that probabilistically models the state transitions and rewards of the environment. The input is the current dialogue state and system actions, and the output is the next round of user actions, environment rewards and dialogue termination variables. The environment model enables DDQ to reduce the demand for human-computer interaction data in online reinforcement learning (as shown in Figure 19a), and also avoids the problem of low quality interaction with user simulators (as shown in Figure 19b).

The environment model is similar to the user simulator in the dialog field, they can be used to simulate the actions of real users and interact with the dialog management module of the system. But the difference between the two is that the user simulator is essentially the external environment of the system and is used to simulate real users, while the environment model is a part of the system and belongs to the internal model of the system.

Based on the work of DDQ, Microsoft researchers made more extensions: in order to improve the authenticity of the dialogue data generated by the environment model, they proposed [34] to use the idea of confrontation training to improve the quality of dialogue data generation; for when Using data that interacts with the real environment, when to use data that interacts with the environment model, the paper [35] discusses feasible solutions; in order to include human interaction, the paper [36] gives a unified dialog framework. This idea of human-teaching is also a hot spot in the industry to build dialogue management models. We will give more explanations in the next section.


We hope to fully introduce human knowledge and experience to generate high-quality data and improve model training efficiency. Human-in-the-loop reinforcement learning [37] is a method of introducing humans into the robot training process. Through a well-designed human-computer interaction method, humans can efficiently guide the training of reinforcement learning models. In order to further improve the training efficiency of the task-oriented dialogue system, designing an effective human-in-the-loop method according to the characteristics of the dialogue problem has become a new direction for researchers to explore.

Google researchers proposed a composite learning method combining human teaching and reinforcement learning [37] (as shown in Figure 21), adding a human teaching stage between supervised pre-training and online reinforcement learning, allowing people to intervene in labeling, The covariate shift problem caused by supervised pre-training is avoided [42]. Amazon researchers also proposed a similar human teaching framework [37]: in each round of dialogue, the system recommends 4 replies for customer service experts to choose; then the customer service experts decide whether to choose one of the 4 replies or edit a new one Reply; finally, the customer service expert sends the selected or edited reply to the user. Using this method, developers can quickly update the dialogue system capabilities, which is suitable for landing.

The above is that the system passively accepts people to mark the data, but a good system should also learn to actively ask questions and seek help from people. The paper [40] proposes a companion learning architecture (as shown in Figure 22), adding the role of a teacher (that is, a person) to the traditional reinforcement learning framework, and the teacher can correct the reply of the dialogue system (that is, the student) (Figure 22). switch on the left), and evaluate students' responses in the form of internal rewards (switch on the right in the figure). For the realization of active learning, the author proposes the concept of dialogue decision certainty (decision certainty), and uses the dropout technique to sample the student policy network multiple times to obtain an approximate estimate of the maximum probability of desirable actions, and then calculates several dialogues with the maximum probability The running average of the rounds is used as the decision confidence of the student policy network. If the degree of confidence is lower than the target value, then according to the gap between the degree of confidence and the target value, it is decided whether the teacher will participate in correcting mistakes and providing a reward function.

The key to active learning is to estimate the confidence of the dialogue system in its own decision-making. In addition to the above-mentioned dropout method for the policy network, there is also a method of calculating the Jensen-Shannon divergence of the policy network distribution with hidden variables as conditional variables [22], A method of judging based on the success rate of the current system dialogue [36].

Conversation management framework of Conversational AI team
In order to ensure stability and interpretability, most dialogue management modules in the industry currently use a rule-based approach. The Alibaba-Dharma Academy Conversational AI team started trying to model dialogue management last year and conducted in-depth exploration. In the construction of a real dialogue system, we need to solve two problems:

How to obtain a large amount of dialogue data for a specific scene?
How to use algorithms to give full play to the value of data?

For the entire modeling framework design, we currently plan a four-step route (as shown in Figure 23):

The first step is to use the dialog studio independently developed by the Conversational AI team to quickly build a dialog engine (called TaskFlow) based on a rule-based dialog flow, and use a similar dialog flow to build a user simulator. After building the user simulator and dialogue engine, the two use the M2M method to continuously interact and accumulate a large amount of dialogue data.

In the second step, after having a certain amount of dialogue data, we use supervised learning to train a neural network to build a dialogue management model that is basically equivalent to the capabilities of the rule dialogue engine, and realize the preliminary modeling of dialogue management. The design of the model adopts two methods of semantic similarity matching and end-to-end generation to achieve scalability. For dialogue tasks with a large action space, HRL is used to divide actions.

The third step is to have a preliminary dialogue management model. In the development stage, we let the system interact with an improved user simulator or artificial intelligence trainer, and continuously enhance the system's dialogue ability through the off-policy ACER reinforcement learning algorithm.

In the fourth step, after the man-machine dialogue experience has reached preliminary practicality, it can go online and run, introduce human factors, collect real user interaction data, and at the same time easily introduce user feedback through some UI design, and continuously update the enhanced model. Precipitating a large amount of human-computer dialogue data will also be further analyzed and mined for customer insights.

At present, the dialogue management model based on reinforcement learning that we have created can achieve a completion rate of 80% of the dialogue interaction with the user simulator in the medium-scale and complex dialogue task of booking a conference room, as shown in Figure 24.

This review makes a detailed introduction around the latest cutting-edge research on the Dialog Management (DM) model, and divides three major directions for the pain points of traditional dialog management:

Poor scalability;
Less labeled data;
The training efficiency is low.

In terms of scalability, we introduce common methods for dealing with changing user intentions, dialogue ontology, and system action spaces, mainly including semantic similarity matching methods, knowledge distillation methods, and sequence generation methods; for the problem of scarcity of labeled data, we introduce machine There are three parts: automatic labeling, effective mining of dialogue structure, and efficient data collection strategy; in view of the low efficiency of RL model training in traditional DM, the academic circle has tried to introduce HRL, FRL and other methods to divide the action space into layers, and also use model-based RL models the environment to improve training efficiency, and introducing human-in-the-loop into the dialogue system training framework is also a very active research direction at present. Finally, we made a more detailed report on the current progress of the Alibaba-DAMO Academy Conversational AI team in DM modeling. We hope this review can provide some inspiration and thinking for your dialogue management research.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us