What help does AutoML provide
Write on top
The implementation of AI applications is inseparable from algorithm model training. The process of model training requires different engines, tools, and resources depending on the field, which is often time-consuming and labor-intensive. There are many products and applications in the artificial intelligence laboratory, and there are many model training processes behind them, so that the online products can be continuously iterated, making users feel more intelligent. From data production to model training and deployment, we have done some exploration work in AutoML, and accelerated the iteration of a complete AI application from an engineering perspective.
Background
At present, the implementation of an AI application generally has the following parts.
As can be seen from the above figure, a complete AI application development process is very lengthy and cumbersome, requiring a lot of manpower and material resources. The dotted line here is the part that we have concluded that can be automated.
status quo
State of the AutoML industry (outside the group)
The above table is a comparison between traditional machine learning and autoML, as well as some representative companies.
One sentence summary: Traditional machine learning requires experts to participate in every step of machine learning, while AutoML expects to solve multiple problems in the same way.
business status
All the models currently used have been optimized by algorithm experts, and the training parameters are relatively stable and basically do not need to be modified. Therefore, the demand for automatic parameter tuning of the model is not strong. The real pain point is that the workload is heavy, the model efficiency is not high, the interaction cost at each stage is high, and data security is limited. The Auto AI concept we put forward is based on actual business scenarios, hoping to simplify the implementation of our AI applications through automation, platformization, and data-based means, on the premise of safety and efficiency.
I will elaborate below:
Automation: The mission of Auto AI is to become a fully managed full-process automation AI application platform. The service will cover the entire machine learning workflow for labeling and preparing data, selecting algorithms, training algorithms, tuning and optimizing for deployment, making predictions, and taking action. With the idea of automating the pipeline, your models can be put into production faster with less effort and at a lower cost.
Platformization: Auto AI will integrate all workflows of machine learning into platform-based product tools to help scientists realize model data collation, preprocessing process selection, and model evaluation within the AutoAI platform. In actual work, it is very difficult for AutoML to completely replace algorithm engineers. Therefore, we spend more time on how to enable scientists to train and release models more efficiently. Different algorithm types will be operated on specialized platforms, such as NLP algorithm platform, ASR algorithm platform, image algorithm platform and so on.
Dataization: A common neglect in the field of AI at present is that we lack data tracking and statistics. This has led to many blind spots in the production of our entire AI application, such as how much money we spend on data labeling, how much is the utilization rate of these manually labeled data, and how much does this data help to improve the quality of our model? How accurate is manual labeling? If we get this data, we can even help improve the quality of our manual data annotations, which can further improve the quality of our models.
Our desired development trajectory is:
High-quality and efficient data labeling capabilities;
Provide algorithm personnel with the ability to customize pre-processed data and feature data;
Try different models to help algorithm personnel find the most suitable model training configuration;
In the iteration stage of the model, the cycle and quality of the model iteration can be accelerated through automation;
Give the model an intuitive evaluation index;
Models can be easily deployed;
Introduce a regression testing platform to help us observe the effect of the deployed model.
architecture
From an architectural point of view, we have opened up the closed loop of the life cycle of the entire AI application.
Through the Ark process platform, the processes that originally required manual series connection are automatically scheduled through the automation platform.
The following is a description of each link in the above figure:
data production
With the vigorous development of supervised model training represented by deep learning, massive and high-quality labeled data can bring significant improvement to the model, so data production is a booster for the development of the entire AI application. The artificial intelligence laboratory has a powerful data annotation platform.
Built-in various unmanned vehicles, images, audio, text and other annotation data types to meet the requirements of high-precision data annotation.
Advanced ability to support 2D/3D image/point cloud linkage
Supports 2D/3D tracking group labeling capabilities, and there is no limit to the number of labeling items in each group.
Support custom template configuration to meet the needs of custom labeling.
Support task distribution, perfect task management and task tracking.
Closed training environment - the guardian of data security
Currently AutoAI supports many types of training environments.
The tensorflow training environment of Ali machine learning platform.
In AILabs, NLP algorithms are basically trained on the TensorFlow of Alibaba's machine learning platform. AutoAI encapsulates the tensorflow training environment of the Ali machine learning platform, adding algorithm version management in different environments.
Customize docker distributed training
The acoustic training of AILabs requires the use of a specific acoustic model training framework, but there is currently no corresponding framework available on the machine learning platform. Algorithm engineers of AILabs teamed up with students on the machine learning platform to create a dock environment with a custom acoustic training framework, and use this docker environment for model training and data processing.
PAI-speech is a training engine based on MR/MPI and the machine learning team.
Users use PAI-speech to perform data processing and model training on the customized custom PAI speech docker.
The generated model is stored on the mounted OSS.
In AILabs, there is a natural pursuit of data security. How to facilitate the development and debugging of algorithm engineers without exposing the training data? The more straightforward idea is that we create an isolated training environment, and algorithm users can operate through the console to isolate high-risk operations, such as data export. wait. In this way, in the isolated environment, we do not restrict the user's operations, but isolate all dangerous operations, smoothing the contradiction between safety and ease of use.
business platform
We have incubated multiple business platforms for different algorithm business lines.
Data production platform
As mentioned before, we already have a very good data annotation platform. How does the data enter the labeling platform, and how does the marked data flow back to asset management? According to different business parties, they generally have different business needs.
How to improve the efficiency of data production is a very important part of AI applications. Improving the efficiency of data production can promote the cycle of model iteration.
Process arrangement through the Ark process platform can easily help you import data into the Dianjin data labeling platform for manual labeling.
NLP algorithm platform
A platform specially trained for NLP text algorithms:
Acoustic training platform
Deploy and go live
In the actual production environment, version control for deployment is very important. The current deployment and launch process is as follows:
In this process, there are a lot of manual communication costs. Another insurmountable problem is that the current evaluation data is often out-of-date, and it may not be updated for several months. Obviously, such evaluation data cannot accurately reflect the current changes in online data.
Therefore, we propose the following improved model:
In the whole process, except for the final acceptance personnel to judge whether it is online or not, there is no human participation in the whole process, and the latest online data is used to reflect the adaptability of the model to the latest data.
characteristic
Having said all of the above, let's see what we can do with autoAI.
Active Learning - A Booster for Model Optimization
Active learning has a wide range of applications in NLP and ASR AI applications in artificial intelligence laboratories. There are 2 significant benefits through Active Learning:
Reduce the amount of manually labeled data;
Compared with traditional random sampling and manual labeling, combined with active learning, it can achieve better results faster.
Let's take the active learning of ASR we do as an example:
Component Description:
The data set to be labeled: consists of three parts of data: the original data label generated according to the full data sampling rules; the unreliable results in the automatic labeling; the badcase generated by the model evaluation.
Feature file extraction: Both labeled data and raw data are asset data, which cannot be directly applied and developed by algorithms. Therefore, it is necessary to convert the labeled data/raw data into feature files from the data asset platform (Zangjin Pavilion) and export them to autoAI training.
autoAI training: The PAI-based acoustic custom training environment mentioned above.
Dual Asr engine: Acoustic models with two different algorithms trained by labeled data.
Selection strategy: If the two Asr engines are consistent, and the confidence exceeds the threshold, enter the data set to be trained.
The annotated results of the two existing ASR models are compared, and the results are consistent and the prediction results with high confidence are directly added to the training set.
If the results are inconsistent or the confidence is low, enter the data pool to be labeled and wait for manual labeling.
Specific steps:
The online prediction method of the dual ASR model tries to automatically mark the training data, and according to the selection strategy, the unconfirmed data is handed over to manual labeling again. After the double ASR verification is passed, the data can directly skip manual labeling and proceed to the next step of training, which greatly reduces labor costs and computing time.
Combine manual and automatically labeled data for model training.
The badcases after the model evaluation flow back into the data pool to be labeled. This part of the badcases can verify whether the new model has generated new badcases and whether the previous badcases have been repaired.
To judge whether the performance reaches the target, repeat steps 1-3.
One-click model training
AutoAI has currently integrated deep learning algorithms commonly used in NLP, requiring only a minimal amount of work and machine learning expertise, and training can be completed with one click. There is no complicated configuration here, real one-click model training. The parameters required by the algorithm have been encapsulated into the training template by us, and the training template is selected according to different business requirements. As shown in the screenshot of the NLP training platform above.
detailed model details
Detailed and complete model evaluation details accurately reflect the learning fit of the model.
False Positive Examples can reflect the badcase of the model and help algorithm scientists understand the learning fit of the model more accurately.
Fully Managed Deployment
The trained model can be deployed to the prediction server for prediction service with one click. At the same time, it supports grayscale, A/B testing, and the distribution of thousands of end-to-end models in the future.
planning
At the current stage, AutoAI undertakes most of the model training for artificial intelligence experiments, including NLP, acoustics, images, etc. In addition to satisfying our algorithm engineers with faster and better training models, we will focus on the following directions:
in conclusion
efficiency improvement
By introducing AutoAI, we have greatly shortened the training cycle of a model; simplified the online process of complex businesses; and increased the number of model iterations.
Safety
Data security is the first priority of artificial intelligence laboratories in artificial intelligence laboratories. How to ensure that user privacy data does not flow out, and how to ensure that marked asset data can be used in a safe environment. The whole process of AutoAI and the closed training environment allow all data transfers to be carried out within the platform. Users can operate all data on the platform, and the data includes feature files, and the model data will not leave the platform.
The implementation of AI applications is inseparable from algorithm model training. The process of model training requires different engines, tools, and resources depending on the field, which is often time-consuming and labor-intensive. There are many products and applications in the artificial intelligence laboratory, and there are many model training processes behind them, so that the online products can be continuously iterated, making users feel more intelligent. From data production to model training and deployment, we have done some exploration work in AutoML, and accelerated the iteration of a complete AI application from an engineering perspective.
Background
At present, the implementation of an AI application generally has the following parts.
As can be seen from the above figure, a complete AI application development process is very lengthy and cumbersome, requiring a lot of manpower and material resources. The dotted line here is the part that we have concluded that can be automated.
status quo
State of the AutoML industry (outside the group)
The above table is a comparison between traditional machine learning and autoML, as well as some representative companies.
One sentence summary: Traditional machine learning requires experts to participate in every step of machine learning, while AutoML expects to solve multiple problems in the same way.
business status
All the models currently used have been optimized by algorithm experts, and the training parameters are relatively stable and basically do not need to be modified. Therefore, the demand for automatic parameter tuning of the model is not strong. The real pain point is that the workload is heavy, the model efficiency is not high, the interaction cost at each stage is high, and data security is limited. The Auto AI concept we put forward is based on actual business scenarios, hoping to simplify the implementation of our AI applications through automation, platformization, and data-based means, on the premise of safety and efficiency.
I will elaborate below:
Automation: The mission of Auto AI is to become a fully managed full-process automation AI application platform. The service will cover the entire machine learning workflow for labeling and preparing data, selecting algorithms, training algorithms, tuning and optimizing for deployment, making predictions, and taking action. With the idea of automating the pipeline, your models can be put into production faster with less effort and at a lower cost.
Platformization: Auto AI will integrate all workflows of machine learning into platform-based product tools to help scientists realize model data collation, preprocessing process selection, and model evaluation within the AutoAI platform. In actual work, it is very difficult for AutoML to completely replace algorithm engineers. Therefore, we spend more time on how to enable scientists to train and release models more efficiently. Different algorithm types will be operated on specialized platforms, such as NLP algorithm platform, ASR algorithm platform, image algorithm platform and so on.
Dataization: A common neglect in the field of AI at present is that we lack data tracking and statistics. This has led to many blind spots in the production of our entire AI application, such as how much money we spend on data labeling, how much is the utilization rate of these manually labeled data, and how much does this data help to improve the quality of our model? How accurate is manual labeling? If we get this data, we can even help improve the quality of our manual data annotations, which can further improve the quality of our models.
Our desired development trajectory is:
High-quality and efficient data labeling capabilities;
Provide algorithm personnel with the ability to customize pre-processed data and feature data;
Try different models to help algorithm personnel find the most suitable model training configuration;
In the iteration stage of the model, the cycle and quality of the model iteration can be accelerated through automation;
Give the model an intuitive evaluation index;
Models can be easily deployed;
Introduce a regression testing platform to help us observe the effect of the deployed model.
architecture
From an architectural point of view, we have opened up the closed loop of the life cycle of the entire AI application.
Through the Ark process platform, the processes that originally required manual series connection are automatically scheduled through the automation platform.
The following is a description of each link in the above figure:
data production
With the vigorous development of supervised model training represented by deep learning, massive and high-quality labeled data can bring significant improvement to the model, so data production is a booster for the development of the entire AI application. The artificial intelligence laboratory has a powerful data annotation platform.
Built-in various unmanned vehicles, images, audio, text and other annotation data types to meet the requirements of high-precision data annotation.
Advanced ability to support 2D/3D image/point cloud linkage
Supports 2D/3D tracking group labeling capabilities, and there is no limit to the number of labeling items in each group.
Support custom template configuration to meet the needs of custom labeling.
Support task distribution, perfect task management and task tracking.
Closed training environment - the guardian of data security
Currently AutoAI supports many types of training environments.
The tensorflow training environment of Ali machine learning platform.
In AILabs, NLP algorithms are basically trained on the TensorFlow of Alibaba's machine learning platform. AutoAI encapsulates the tensorflow training environment of the Ali machine learning platform, adding algorithm version management in different environments.
Customize docker distributed training
The acoustic training of AILabs requires the use of a specific acoustic model training framework, but there is currently no corresponding framework available on the machine learning platform. Algorithm engineers of AILabs teamed up with students on the machine learning platform to create a dock environment with a custom acoustic training framework, and use this docker environment for model training and data processing.
PAI-speech is a training engine based on MR/MPI and the machine learning team.
Users use PAI-speech to perform data processing and model training on the customized custom PAI speech docker.
The generated model is stored on the mounted OSS.
In AILabs, there is a natural pursuit of data security. How to facilitate the development and debugging of algorithm engineers without exposing the training data? The more straightforward idea is that we create an isolated training environment, and algorithm users can operate through the console to isolate high-risk operations, such as data export. wait. In this way, in the isolated environment, we do not restrict the user's operations, but isolate all dangerous operations, smoothing the contradiction between safety and ease of use.
business platform
We have incubated multiple business platforms for different algorithm business lines.
Data production platform
As mentioned before, we already have a very good data annotation platform. How does the data enter the labeling platform, and how does the marked data flow back to asset management? According to different business parties, they generally have different business needs.
How to improve the efficiency of data production is a very important part of AI applications. Improving the efficiency of data production can promote the cycle of model iteration.
Process arrangement through the Ark process platform can easily help you import data into the Dianjin data labeling platform for manual labeling.
NLP algorithm platform
A platform specially trained for NLP text algorithms:
Acoustic training platform
Deploy and go live
In the actual production environment, version control for deployment is very important. The current deployment and launch process is as follows:
In this process, there are a lot of manual communication costs. Another insurmountable problem is that the current evaluation data is often out-of-date, and it may not be updated for several months. Obviously, such evaluation data cannot accurately reflect the current changes in online data.
Therefore, we propose the following improved model:
In the whole process, except for the final acceptance personnel to judge whether it is online or not, there is no human participation in the whole process, and the latest online data is used to reflect the adaptability of the model to the latest data.
characteristic
Having said all of the above, let's see what we can do with autoAI.
Active Learning - A Booster for Model Optimization
Active learning has a wide range of applications in NLP and ASR AI applications in artificial intelligence laboratories. There are 2 significant benefits through Active Learning:
Reduce the amount of manually labeled data;
Compared with traditional random sampling and manual labeling, combined with active learning, it can achieve better results faster.
Let's take the active learning of ASR we do as an example:
Component Description:
The data set to be labeled: consists of three parts of data: the original data label generated according to the full data sampling rules; the unreliable results in the automatic labeling; the badcase generated by the model evaluation.
Feature file extraction: Both labeled data and raw data are asset data, which cannot be directly applied and developed by algorithms. Therefore, it is necessary to convert the labeled data/raw data into feature files from the data asset platform (Zangjin Pavilion) and export them to autoAI training.
autoAI training: The PAI-based acoustic custom training environment mentioned above.
Dual Asr engine: Acoustic models with two different algorithms trained by labeled data.
Selection strategy: If the two Asr engines are consistent, and the confidence exceeds the threshold, enter the data set to be trained.
The annotated results of the two existing ASR models are compared, and the results are consistent and the prediction results with high confidence are directly added to the training set.
If the results are inconsistent or the confidence is low, enter the data pool to be labeled and wait for manual labeling.
Specific steps:
The online prediction method of the dual ASR model tries to automatically mark the training data, and according to the selection strategy, the unconfirmed data is handed over to manual labeling again. After the double ASR verification is passed, the data can directly skip manual labeling and proceed to the next step of training, which greatly reduces labor costs and computing time.
Combine manual and automatically labeled data for model training.
The badcases after the model evaluation flow back into the data pool to be labeled. This part of the badcases can verify whether the new model has generated new badcases and whether the previous badcases have been repaired.
To judge whether the performance reaches the target, repeat steps 1-3.
One-click model training
AutoAI has currently integrated deep learning algorithms commonly used in NLP, requiring only a minimal amount of work and machine learning expertise, and training can be completed with one click. There is no complicated configuration here, real one-click model training. The parameters required by the algorithm have been encapsulated into the training template by us, and the training template is selected according to different business requirements. As shown in the screenshot of the NLP training platform above.
detailed model details
Detailed and complete model evaluation details accurately reflect the learning fit of the model.
False Positive Examples can reflect the badcase of the model and help algorithm scientists understand the learning fit of the model more accurately.
Fully Managed Deployment
The trained model can be deployed to the prediction server for prediction service with one click. At the same time, it supports grayscale, A/B testing, and the distribution of thousands of end-to-end models in the future.
planning
At the current stage, AutoAI undertakes most of the model training for artificial intelligence experiments, including NLP, acoustics, images, etc. In addition to satisfying our algorithm engineers with faster and better training models, we will focus on the following directions:
in conclusion
efficiency improvement
By introducing AutoAI, we have greatly shortened the training cycle of a model; simplified the online process of complex businesses; and increased the number of model iterations.
Safety
Data security is the first priority of artificial intelligence laboratories in artificial intelligence laboratories. How to ensure that user privacy data does not flow out, and how to ensure that marked asset data can be used in a safe environment. The whole process of AutoAI and the closed training environment allow all data transfers to be carried out within the platform. Users can operate all data on the platform, and the data includes feature files, and the model data will not leave the platform.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00