Machine learning practice based on Spark and TensorFlow
EMR E-Learning Platform
The EMR E-Learning platform is based on big data and AI technology. It uses algorithms to build machine learning models based on historical data for training and prediction. At present, machine learning is widely used in many fields, such as face recognition, natural language processing, recommendation system, computer vision, etc. In recent years, the improvement of big data and computing power has led to the rapid development of AI technology.
The three important elements in machine learning are algorithm, data and computing power. EMR itself is a big data platform, on which there are many kinds of data, such as traditional data warehouse data and image data; EMR has a strong scheduling ability, which can well suspend and schedule GPU and CPU resources; Combined with machine learning algorithm, it can become a better AI platform.
The typical AI development process is shown in the following figure: first, data collection, mobile phone, router or log data into the big data framework Data Lake; Next is data processing. The collected data needs to be processed through traditional big data ETL or feature engineering; The second is model training. The data processed by feature engineering or ETL will be trained; Finally, evaluate and deploy the training model; The results of the model prediction will be input to the big data platform for processing and analysis, and the whole process will be repeated.
The following figure shows the process of AI development. On the left is a single machine or cluster, which is mainly used for AI training and evaluation, including data storage; The right side is big data storage, mainly for big data processing, such as feature engineering, etc. At the same time, the machine learning model transmitted on the left side can be used for prediction.
The current situation of AI development mainly includes the following two points:
• The operation and maintenance of the two clusters are complex: as can be seen from the figure, the two clusters involved in AI development are separate and need to be maintained separately. The operation and maintenance costs are complex and prone to errors.
• Low training efficiency: the left and right clusters need a lot of data transmission and model transmission, resulting in high end-to-end training delay.
As a unified big data platform, EMR contains many features. The lowest infrastructure layer, which supports GPU and CPU machines; The data storage layer includes HDFS and Alibaba Cloud OSS; The data access layer includes Kafka and Flume; The resource scheduling layer computing engine includes YARN, K8S and Zookeeper; The core of the computing engine is the E-learning platform, which is based on the currently popular open source system Spark. The Spark here uses Jindo Spark, which is the version of the EMR team based on the modification and optimization of Spark and is applicable to AI scenarios. In addition, there is PAI TensorFlow on Spark; Finally, the calculation and analysis layer provides data analysis, feature engineering, AI training and Notebook functions for users to use.
The EMR platform has the following features:
• Unified resource management and scheduling: support fine-grained resource scheduling and allocation of CPU, Mem and GPU, and support the resource scheduling framework of YARN and K8S;
• Multiple framework support: including TensorFlow, MXNet, Cafe, etc;
• Spark's general data processing framework: provides Data Source API to facilitate the reading of various data sources, and MLlib pipeline is widely used in feature engineering;
• Spark+deep learning framework: integrated support of Spark and deep learning framework, including efficient data transmission between Spark and TensorFlow, Spark resource scheduling model supporting distributed deep learning training;
• Resource monitoring and alarm: EMR APM system provides complete application and cluster monitoring with multiple alarm modes;
• Ease of use: Jupyter notebook and Python multi-environment deployment support, end-to-end machine learning training process, etc.
EMR E-Learning integrates PAI TensorFlow, which supports the optimization of deep learning and large-scale sparse scenes.
TensorFlow on Spark
After market research, it is found that most customers use the open source computing framework Spark in the data ETL and feature engineering stages before in-depth learning, and TensorFlow is widely used in the later stages, so they have the goal of organically combining TensorFlow and Spark. TensorFlow on Spark mainly includes six specific design objectives in the figure below.
TensorFlow on Spark is actually the encapsulation of PySpark application framework from the bottom. The main functions implemented in the framework include: first scheduling the user feature engineering task, and then scheduling the deep learning TensorFlow task. In addition, it is also necessary to efficiently and quickly transfer the feature engineering data to the underlying PAI TensorFlow Runtime for deep learning and machine learning training. Because Spark does not support the heterogeneous scheduling of resources at present, if the customer is running a distributed TensorFlow, it needs to run two tasks (Ps task and Worker task) at the same time, and generate different Spark executors according to the resources required by the customer. Ps task and Worker task register services through Zookeeper. After the framework is started, the feature engineering tasks written by the user will be scheduled to the executor for execution. After execution, the framework will transfer the data to the PAI TensorFlow Runtime at the bottom for training. After training, the data will be saved to the Data Lake to facilitate the later model release.
In machine learning and deep learning, data interaction is a point that can improve efficiency. Therefore, TensorFlow on Spark has made a series of optimization in the data interaction part. Specifically, Apache Arrow is used for high-speed data transmission, and the training data is fed directly to the API TensorFlow Runtime to speed up the whole training process.
The fault-tolerant mechanism of TensorFlow on Spark is shown in the following figure: the bottom layer relies on TensorFlow's Checkpoints mechanism, and users need to periodically add the training model to Data Lake. When a TensorFlow is restarted, the latest Checkpoint will be read for training. The fault-tolerant mechanism will have different processing methods according to different modes. For distributed tasks, Ps and Worker tasks will be started, and the two tasks directly exist in the daemon process to monitor the operation of corresponding tasks; For MPI tasks, the Spark Barrier Execution mechanism is used for fault tolerance. If a task fails, it will mark the failure and restart all tasks, and reconfigure all environment variables; The TF task is responsible for reading the latest Checkpoint.
The functions and ease of use of TensorFlow on Spark are mainly reflected in the following points:
• Multiple deployment environments: support the specified conda, package python runtime virtual env support the specified docker
• TensorFlow architecture support: support distributed TensorFlow native PS architecture and distributed Horovod MPI architecture
• TensorFlow API support: support distributed TensorFlow Estimator high-level API and distributed TensorFlow Session low-level API
• Fast support for various frameworks access: new AI frameworks can be added according to customer needs, such as MXNet
Many EMR customers come from Internet companies. Advertising and push business scenarios are common. The following figure is a typical advertising push business scenario. The whole process is that EMR customers push the log data to the Data Lake in real time through Kafka. TensorFlow on Spark is responsible for the first half of the process, in which the real-time data and offline data can be ETL and feature engineering through Spark tools such as SparkSQL, MLlib, etc. After the data training, it can be efficiently fed to PAI TensorFlow Runtime through TensorFlow framework for large-scale training and optimization, Then store the model in Data Lake.
At the API level, TensorFlow on Spark provides a base class, which contains three methods that users need to implement: pre_ Train, shutdown, and train. pre_ Train is the data reading, ETL, feature engineering and other tasks that users need to do. It returns the DataFrame object of Spark; The shutdown method realizes the release of users' long connection resources; The train method is the code that users have previously implemented in TensorFlow, such as the selection of models, optimizers, and optimization operators. Finally, through pl_ The submit command is used to submit the TensorFlow on Spark task.
The following figure is an example of the recommendation system FM, which is a relatively common recommendation algorithm. The specific scenario is to score movies and recommend potential movies to users according to the customer's previous movie ratings, movie types and release time. On the left is a feature project. Users can use the Spark data source API to read movie and scoring information. All Spark operations, such as join, ETL processing, are natively supported; On the right is TensorFlow, which is used to select models and optimizers. At present, the code of the whole system has been opened to Github.
Finally, the EMR E-Learning platform closely combines big data processing, deep learning, machine learning, data lake, and GPUs function features to provide a one-stop big data and machine learning platform; TensorFlow on Spark provides efficient data interaction process and complete machine learning training process, combines Spark with TensorFlow, and helps users accelerate training with PAI TensorFlow; At present, the E-Learning platform serves different customers in the public cloud. The success stories are that the CPU cluster size exceeds 1000, and the GPU cluster size exceeds 100.
The EMR E-Learning platform is based on big data and AI technology. It uses algorithms to build machine learning models based on historical data for training and prediction. At present, machine learning is widely used in many fields, such as face recognition, natural language processing, recommendation system, computer vision, etc. In recent years, the improvement of big data and computing power has led to the rapid development of AI technology.
The three important elements in machine learning are algorithm, data and computing power. EMR itself is a big data platform, on which there are many kinds of data, such as traditional data warehouse data and image data; EMR has a strong scheduling ability, which can well suspend and schedule GPU and CPU resources; Combined with machine learning algorithm, it can become a better AI platform.
The typical AI development process is shown in the following figure: first, data collection, mobile phone, router or log data into the big data framework Data Lake; Next is data processing. The collected data needs to be processed through traditional big data ETL or feature engineering; The second is model training. The data processed by feature engineering or ETL will be trained; Finally, evaluate and deploy the training model; The results of the model prediction will be input to the big data platform for processing and analysis, and the whole process will be repeated.
The following figure shows the process of AI development. On the left is a single machine or cluster, which is mainly used for AI training and evaluation, including data storage; The right side is big data storage, mainly for big data processing, such as feature engineering, etc. At the same time, the machine learning model transmitted on the left side can be used for prediction.
The current situation of AI development mainly includes the following two points:
• The operation and maintenance of the two clusters are complex: as can be seen from the figure, the two clusters involved in AI development are separate and need to be maintained separately. The operation and maintenance costs are complex and prone to errors.
• Low training efficiency: the left and right clusters need a lot of data transmission and model transmission, resulting in high end-to-end training delay.
As a unified big data platform, EMR contains many features. The lowest infrastructure layer, which supports GPU and CPU machines; The data storage layer includes HDFS and Alibaba Cloud OSS; The data access layer includes Kafka and Flume; The resource scheduling layer computing engine includes YARN, K8S and Zookeeper; The core of the computing engine is the E-learning platform, which is based on the currently popular open source system Spark. The Spark here uses Jindo Spark, which is the version of the EMR team based on the modification and optimization of Spark and is applicable to AI scenarios. In addition, there is PAI TensorFlow on Spark; Finally, the calculation and analysis layer provides data analysis, feature engineering, AI training and Notebook functions for users to use.
The EMR platform has the following features:
• Unified resource management and scheduling: support fine-grained resource scheduling and allocation of CPU, Mem and GPU, and support the resource scheduling framework of YARN and K8S;
• Multiple framework support: including TensorFlow, MXNet, Cafe, etc;
• Spark's general data processing framework: provides Data Source API to facilitate the reading of various data sources, and MLlib pipeline is widely used in feature engineering;
• Spark+deep learning framework: integrated support of Spark and deep learning framework, including efficient data transmission between Spark and TensorFlow, Spark resource scheduling model supporting distributed deep learning training;
• Resource monitoring and alarm: EMR APM system provides complete application and cluster monitoring with multiple alarm modes;
• Ease of use: Jupyter notebook and Python multi-environment deployment support, end-to-end machine learning training process, etc.
EMR E-Learning integrates PAI TensorFlow, which supports the optimization of deep learning and large-scale sparse scenes.
TensorFlow on Spark
After market research, it is found that most customers use the open source computing framework Spark in the data ETL and feature engineering stages before in-depth learning, and TensorFlow is widely used in the later stages, so they have the goal of organically combining TensorFlow and Spark. TensorFlow on Spark mainly includes six specific design objectives in the figure below.
TensorFlow on Spark is actually the encapsulation of PySpark application framework from the bottom. The main functions implemented in the framework include: first scheduling the user feature engineering task, and then scheduling the deep learning TensorFlow task. In addition, it is also necessary to efficiently and quickly transfer the feature engineering data to the underlying PAI TensorFlow Runtime for deep learning and machine learning training. Because Spark does not support the heterogeneous scheduling of resources at present, if the customer is running a distributed TensorFlow, it needs to run two tasks (Ps task and Worker task) at the same time, and generate different Spark executors according to the resources required by the customer. Ps task and Worker task register services through Zookeeper. After the framework is started, the feature engineering tasks written by the user will be scheduled to the executor for execution. After execution, the framework will transfer the data to the PAI TensorFlow Runtime at the bottom for training. After training, the data will be saved to the Data Lake to facilitate the later model release.
In machine learning and deep learning, data interaction is a point that can improve efficiency. Therefore, TensorFlow on Spark has made a series of optimization in the data interaction part. Specifically, Apache Arrow is used for high-speed data transmission, and the training data is fed directly to the API TensorFlow Runtime to speed up the whole training process.
The fault-tolerant mechanism of TensorFlow on Spark is shown in the following figure: the bottom layer relies on TensorFlow's Checkpoints mechanism, and users need to periodically add the training model to Data Lake. When a TensorFlow is restarted, the latest Checkpoint will be read for training. The fault-tolerant mechanism will have different processing methods according to different modes. For distributed tasks, Ps and Worker tasks will be started, and the two tasks directly exist in the daemon process to monitor the operation of corresponding tasks; For MPI tasks, the Spark Barrier Execution mechanism is used for fault tolerance. If a task fails, it will mark the failure and restart all tasks, and reconfigure all environment variables; The TF task is responsible for reading the latest Checkpoint.
The functions and ease of use of TensorFlow on Spark are mainly reflected in the following points:
• Multiple deployment environments: support the specified conda, package python runtime virtual env support the specified docker
• TensorFlow architecture support: support distributed TensorFlow native PS architecture and distributed Horovod MPI architecture
• TensorFlow API support: support distributed TensorFlow Estimator high-level API and distributed TensorFlow Session low-level API
• Fast support for various frameworks access: new AI frameworks can be added according to customer needs, such as MXNet
Many EMR customers come from Internet companies. Advertising and push business scenarios are common. The following figure is a typical advertising push business scenario. The whole process is that EMR customers push the log data to the Data Lake in real time through Kafka. TensorFlow on Spark is responsible for the first half of the process, in which the real-time data and offline data can be ETL and feature engineering through Spark tools such as SparkSQL, MLlib, etc. After the data training, it can be efficiently fed to PAI TensorFlow Runtime through TensorFlow framework for large-scale training and optimization, Then store the model in Data Lake.
At the API level, TensorFlow on Spark provides a base class, which contains three methods that users need to implement: pre_ Train, shutdown, and train. pre_ Train is the data reading, ETL, feature engineering and other tasks that users need to do. It returns the DataFrame object of Spark; The shutdown method realizes the release of users' long connection resources; The train method is the code that users have previously implemented in TensorFlow, such as the selection of models, optimizers, and optimization operators. Finally, through pl_ The submit command is used to submit the TensorFlow on Spark task.
The following figure is an example of the recommendation system FM, which is a relatively common recommendation algorithm. The specific scenario is to score movies and recommend potential movies to users according to the customer's previous movie ratings, movie types and release time. On the left is a feature project. Users can use the Spark data source API to read movie and scoring information. All Spark operations, such as join, ETL processing, are natively supported; On the right is TensorFlow, which is used to select models and optimizers. At present, the code of the whole system has been opened to Github.
Finally, the EMR E-Learning platform closely combines big data processing, deep learning, machine learning, data lake, and GPUs function features to provide a one-stop big data and machine learning platform; TensorFlow on Spark provides efficient data interaction process and complete machine learning training process, combines Spark with TensorFlow, and helps users accelerate training with PAI TensorFlow; At present, the E-Learning platform serves different customers in the public cloud. The success stories are that the CPU cluster size exceeds 1000, and the GPU cluster size exceeds 100.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00