Use E-MapReduce to build a data lake on the cloud

Data Lake

Data lake was proposed in 15 years, and has become extremely popular in the last two or three years. In Gartner's magic quadrant, data lake is a technology with great investment and exploration value.

What is the data lake? Previously, we used data warehouse to manage structured data. After the rise of Hadoop, a large number of unstructured and structured data were uniformly stored on HDFS. However, with the accumulation of data, some data may not have suitable application scenarios for collection, so we may store them first, and then we will further develop and mine them when the business needs.

With the continuous growth of data volume, we can use object storage such as OSS and HDFS for unified storage. At the same time, we may meet with different computing scenarios, such as Ad hoc query, offline computing, real-time computing, machine learning, and deep learning scenarios. In different computing scenarios, we also need to face the choice of different engines. In different scenarios, we need to uniformly monitor, authorize, audit, and account system.

The first part is data acquisition (the leftmost frame in the figure), which is mainly used to collect relational databases. Log users click to stream to the unified storage, use different computing services to process and calculate the data, and apply the computing results to AI analysis platform for machine learning or in-depth learning, and finally use the results for business, and use search, source data management and other capabilities to achieve the value-added effect of data. In addition to calculation and storage, data also requires a series of control and audit means.

It has been more than 10 years since the birth of big data technology. At the beginning, people built open source software on their own IDC. With the continuous growth of business, the rapid accumulation of data and the rapid fluctuation of business, there may be explosive business at once.

The offline IDC procurement cycle is very long, and it is difficult to meet the demand of computing resources with the rapid growth of business. At the same time, there are peaks and valleys in the business. The calculation tasks of business in the daytime may be relatively small (most of them are Ad hoc queries). At night, some resources may need to be expanded for offline report calculation. In this case, it is difficult to match the computing power in IDC mode.

About five or six years ago, a large number of enterprises have begun to migrate to the cloud. When business data continues to grow, enterprises can quickly add instances to meet business growth needs through cloud supply chain capabilities. If you build your own Hadoop cluster or EMR on the cloud, there will also be some problems, because hdfs is essentially used, and the storage cost will increase linearly with the growth of data. At the same time, when using local disks on the cloud, its operation and maintenance process is very complex.

For large-scale clusters (clusters with hundreds and thousands of units), bad disk is a routine event. How to deal with this routine event is also a very challenging matter. Therefore, it has gradually evolved into a data lake architecture centered on OSS. With the hierarchical storage capability of OSS, different data can be realized, with different storage methods and consumption costs. At the same time, we know that the operation and maintenance of HDFS's NameNode under the HA scenario is a very complex matter. When the cluster size exceeds 100, how to make the NameNode more stable becomes a very challenging matter, which may require a lot of energy and manpower to maintain. Maintaining the HA architecture may be a difficult problem that cannot be solved in the long run. Then using OSS is another option. By using the storage architecture of cloud services, we can avoid the unsolved problems in the HDFS architecture.

Use EMR to build enterprise-level data lake services

EMR is positioned to take advantage of Alibaba Cloud's ecosystem, 100% open source, and provide enterprises with stable and reliable open source big data services. EMR was launched in June 2016 and has been iterating to EMR4.4. Based on EMR, users can choose to use more than 10 ECS instances to create elastic clusters in minutes.

EMR supports Alibaba Cloud OSS, and its self-developed Jindo FS significantly improves OSS performance; At the same time, EMR also integrates Alibaba Cloud ecosystem. DataWorks and PAI can seamlessly connect on EMR. At the same time, for storage products (such as log service, MaxCompute, etc.), EMR can be used as the calculation engine to calculate the data stored inside. All EMR components are Apache open source versions. With the continuous upgrading and iterative evolution of the community version, the EMR team will make a series of optimization and improvement on the application and performance of components such as Spark, Hadoop, Kafka, etc.

EMR adopts a semi-managed architecture. Under this architecture, users can get very similar experience with offline IDC. Users can actually log on to ECS service nodes in the cluster to deploy and manage their own ECS servers. At the same time, they provide a series of enterprise-level features, including alerts and diagnostics on the host job service level like APM, and also support MIT, Kerberos, RAM, and HAS as authentication platforms, Ranger is also used as a unified authority management platform.

The figure below shows the whole open source big data ecosystem of EMR, including some software and hardware.

There are several levels.

For example, JindoFS is on the storage layer (OSS). JindoFS is a set of components developed by the EMR team. This component is mainly used to speed up the reading and calculation of OSS data. According to the actual comparison test, the performance of JindoFS is far better than the offline HDFS service.

Delta Lake is the technical computing engine and platform of databricks' open source data lake. The EMR team has made a series of optimization on the docking of Presto, Kudu and Hive around Delta Lake, and has also significantly improved its performance compared with the open source version. It is worth mentioning that EMR's Flink is the enterprise version of Ververica, which has better performance in performance, management and maintainability.

EMR is mainly divided into four node types (Master, Core, Task, Gateway).

The master node mainly deploys services such as NameNode, ResourceManager, and Hmaster of Hbase. These services can achieve centralized and unified cluster management. When HA is turned on when creating a production cluster, a highly available cluster will be automatically created.

The core node mainly deploys Yarn's NodeManager and HDFS's DataNode. From this point of view, it can do both calculation and storage. For data reliability, considering the data stored on the node, the node cannot perform elastic scaling and bidding instances.

The Task node only deploys NodeManager, which can be elastically scaled in the data lake scenario. When all user data is stored in the object storage uniformly, users can use the elastic scalability of the TASK node to quickly respond to business changes and realize the elastic expansion of computing resources. At the same time, ECS preemptive instances can be used to reduce costs. The Task node also supports GPU instances. In many machine learning or deep learning scenarios, the scenario calculation cycle is very short (only calculated once in a few days or weeks). However, because GPU instances are expensive, manual resizing can greatly reduce the cost.

The Gateway node is mainly used to deploy various client components such as Spark, Hive, Flink, etc., so that different departments can use different clients or client configurations to achieve complete isolation, while avoiding frequent user login to the cluster for operation.


HDFS has been born for more than 10 years, and its community supporting functions are relatively mature and complete. However, we also see its shortcomings in use. For example, the architecture of HA is too complex (if you want to implement HA, you need to deploy JournalNode, ZKFC), and when the cluster size is very large, you need to consider the Federation of HDFS. When the business scale is large, the period of DataNode-Decision will also be very long. When the host failure or disk failure requires offline nodes, the period can be as long as 1 to 2 days, and even special personnel need to be arranged to manage DataNode-Decision. Restarting a NameNode may take half a day.

What are the advantages of OSS? OSS is the service-oriented object storage on Alibaba Cloud. Its management and operation and maintenance costs are very low. At the same time, there are many different types of hierarchical data storage (such as standard, low-frequency, and archive). In this way, OSS can effectively reduce the user's cost of use. It does not require users to pay attention to NameNode and Federation (because it is service-oriented), and the data reliability is very good (providing 11 9 data reliability). So we can see that a large number of customers are using OSS to build enterprise data lakes. The typical feature of OSS is its openness. Basically, all cloud products will support OSS as the storage behind it.

At the same time, OSS also has problems. At the beginning, object storage was mainly used to cooperate with business systems to store data in big data scenarios. Because OSS is designed for general scenarios, it will face performance problems when adapting to big data computing engines (Spark, Flink). When the rename operation is performed, the move operation is actually performed, and the file copy is actually performed. Unlike the Linux file system, the rename operation is not fast enough. When the list operation is performed, all objects will be requested, and the speed is extremely slow when the number is too large; Consistency (the period of final consistency) will also be relatively long. When reading and writing, data inconsistency may occur.

JindoFS, developed by EMR itself, is based on the open source ecosystem. Basically, all computing engines can use JindoFS to read, calculate and query OSS. On the one hand, JindoFS can give full play to the advantages of OSS: massive data (EB level) storage, and at the same time, it can play a flexible feature: when you use OSS semantics, almost all computing engines (such as other computing products or BI reporting tools) can quickly obtain data, which is a common interface.

JindoFS is also used on a large scale in the cloud. When processing HDFS and OSS data, it can avoid the performance problems of renaming, list and other operations on files.

The architecture of Jindo FS is shown in the figure below. The primary service is Namespace Service and the secondary service is Storage Service. The master service can be deployed on one or more nodes, the slave service will be deployed on each node, and the client service will be deployed on each EMR machine. When reading and writing data, it will first send a request to the master service through the slave service to obtain the location of the file. If the file does not exist locally, it will be obtained from the OSS and cached locally. JindoFS implements the HA architecture. Its HA is also online. It is implemented locally through RocksDB and remotely through OTS. Therefore, Jindo FS has both performance and high reliability. JindoFS can also manage and design permissions through Ranger. Using the JindoFS SDK can easily migrate offline HDFS data to OSS for archiving or use.

JindoFS supports Block and Cache modes. When JindoFS is used in Block mode, its source data will be placed on the local RocksDB and the remote OTS. It is no longer the common source data of OSS. When the data volume is large (more than a few hundred T), the performance of Block mode will be better, but the universality will be relatively poor. Users can only obtain the location and details of file blocks through the source data of JindoFS. At the same time, the block mode of JindoFS also supports users to specify which are hot data, cold data and temperature data. JindoFS can effectively reduce the complexity of operation and maintenance.

The cache mode uses local storage, and the semantics also use the semantics of OSS itself, such as oss://bucket/path 。 The advantage of using the Cache mode is that it is very versatile. It can be used not only on EMR, but also on other computing engines. Its disadvantage lies in its performance. When your data volume is very large, the performance is relatively poor compared with Block mode.

The above two modes should be used selectively based on their own business judgment.

Elastic expansion

The EMR can scale elastically according to the time and the load of the cluster (Yarn's indicator collection, user can specify manually). At the same time, when doing elastic scaling, you can select multiple identification types to avoid job failure due to specific identification types due to inventory, and you can also use preemptive instances to reduce costs.

EMR's data lake scheme

As shown in the figure below, this is an offline computing architecture. Data can be synchronized to object storage through Kafka, log service, data integration (data integration in DataWorks) or Lightning Cube (Alibaba Cloud offline migration product) and other methods. Data can be read and calculated directly in EMR. DataWorks or Airflow can be used in workflow scheduling. The top data applications include data reports, data screens, APIs, and so on.

The advantage of this offline architecture is that it can support a large scale (EB level) data storage. OSS can seamlessly connect to the big data computing engine in EMR, while ensuring excellent performance.

Combined with the capabilities of OSS, we can achieve high-performance reading. At the control level, the authority management can be unified for all open source components through Ranger.

The following figure shows the scenario of using EMR to do Ad hoc. This scenario has data inflow of real-time calculation. Real-time computing is accelerated through JindoFS, and then connected to BI reports and large data screens through open source computing engines such as Presto and Impala. This scenario is often seen in real-time data warehouse business.

New functions and features of EMR

Customer stories

Typical cases of migrating from IDC to the cloud

The scale of customer data is PB. After going to the cloud, it is found that the real hot data accounts for about 10% of the total data. Through the hierarchical storage of object storage, the cost is effectively reduced. When the cluster size is large, the master pressure will be greater, and the number of masters for this user is large. EMR will dynamically expand and shrink the master based on the cluster size and load judgment. At the same time, it can also customize the number of masters to calculate and deploy the cluster. In addition, the customer uses the ability of elastic scaling. There is a relatively independent Spark cluster to service AI and ETL. The Spark cluster is a elastic scaling cluster. In big data development, customers use DataWorks to develop big data. Screenshot of 5.51.28.png on August 21, 2020

High data computing performance, authority management ability, increase enterprise efficiency

Multi-computing clusters have a large number of clusters. At the same time, customers have unified and abstracted cluster authority management and source data management for centralized management. For example, Hive Meta and JindoFS Meta are source data shared by multiple clusters. The number of clusters and cluster nodes in the daytime is small, but the capacity needs to be expanded to meet the computing capacity in the evening business peak. Airflow is used for workflow scheduling.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us