This article has three parts:
1) Data lakes
2) EMR data lake solution
3) Customer case studies
Alibaba Cloud E-MapReduce is designed to leverage the Alibaba Cloud ecosystem, which is 100% open source and provides enterprises with stable and highly reliable big data services. Initially released in June 2016, EMR's latest version is 4.4. With EMR, use more than 10 types of ECS instances in order to create auto scaling clusters in minutes.
EMR supports OSS, and its self-developed Jindo FS greatly improves the performance of OSS. At the same time, EMR integrates with the Alibaba Cloud ecosystem, for example, DataWorks and PAI can be seamlessly connected in EMR. You can also use EMR as a computing engine to compute stored data for storage services, such as log service and MaxCompute. All EMR components are Apache open source versions. With the continuous upgrade of the community version, the EMR team will make a series of optimizations and improvements in application and performance for components such as Spark, Hadoop, and Kafka.
EMR adopts semi-hosted architecture. Users can log on to the ECS server node in the cluster in order to deploy and manage their own ECS servers, providing a very similar experience to that in the on-premises data center. It also offers a series of enterprise-grade features, including alerting and diagnosis at the service level for host jobs as in APM. It also supports MIT, Kerberos, RAM, and HAS as authentication platforms. Ranger is also used as a unified permission management platform.
The following figure shows the overall open-source big data ecosystem of EMR, which covers both software and hardware.
Several planes are involved here.
For example, JindoFS is based on the storage layer (OSS).
JindoFS is a set of components developed by the EMR team. This component is used to accelerate the reading and computing of OSS data. In actual comparative tests, the performance of JindoFS is much better than that of offline HDFS.
Delta Lake is a technical computing engine and platform for open-source data lakes. The EMR team made a series of optimizations based on the deployment of Delta Lake in Presto, Kudu, and Hive, and significantly improved performance compared to the open-source version. The Flink of EMR is the enterprise version of Verica and provides better performance, management, and maintainability.
EMR consists of four node types: master, core, task, and gateway.
Services such as NameNode, ResourceManager, and Hmaster of Hbase are deployed on master nodes to achieve centralized cluster management. You may enable the high availability (HA) feature when creating a production cluster to automatically create high availability clusters.
Core nodes mainly accommodate the Yarn NodeManager and HDFS DataNode. From this perspective, core nodes can perform both computing and storage. For data reliability, core nodes cannot implement auto scaling and spot instances.
Only a NodeManager is deployed on the task node. Therefore, scale the data lake accordingly. When all user data is stored in OSS, use the auto scaling feature of task nodes to quickly respond to business changes and flexibly scale computing resources. Also, use ECS preemptible instances to reduce costs. Task nodes also support GPU instances. In many machine learning or deep learning scenarios, the computing period is very short (and once in a few days or weeks). However, GPU instances are expensive, so manual scaling of instances greatly reduce costs.
Gateway nodes are used to hold various client components, such as Spark, Hive, and Flink. Departments can use different clients or client configurations for isolation. This also prevents users from frequently logging on to the cluster and performing operations.
It has been more than 10 years since HDFS was launched. Its community supporting functions are relatively mature and perfect. However, it has some defects. For example, the architecture of HA is too complex (if HA is required, JournalNode and ZKFC must be deployed). When a cluster is too large, the Federation of HDFS is used. When the scale of operation is large, the DataNode-Decommission cycle is also very long. If the host fails or the disk fails, the node needs to be offline for a period of up to 1-2 days, even requiring special personnel to manage the DataNode-Decommission. Restarting a NameNode may take half a day.
What are the advantages of OSS? OSS is service-oriented object storage in Alibaba Cloud with very low management and O&M costs. OSS provides multiple hierarchical data storage types (such as standard object storage, infrequent access storage, and archive storage). OSS effectively reduces user costs. Users do not need to pay attention to NameNode and Federation (because they are service-oriented), and the data reliability is very good (reliability of 11 consecutive nines). Therefore, many customers use OSS to build enterprise data lakes. OSS is typically characterized by high openness. Almost all cloud products support OSS as backend storage.
OSS also has some problems. In the beginning, OSS was mainly used to store data in big data scenarios in conjunction with business systems. Because OSS is designed for general scenarios, performance problems are encountered when it is adapted to big data computing engines (Spark and Flink). When a rename operation is performed, the move operation is actually performed and the file is really copied. OSS is unlike the Linux file system, which is fast enough to complete the rename operation. List operation requests all objects. When there are too many objects, the speed is extremely slow. The eventual consistency cycle is relatively long. When data is read or written, data inconsistency may occur.
JindoFS is developed based on the open-source ecosystem. You may use JindoFS to read data from OSS and query data in almost all computing engines. On the one hand, JindoFS delivers the advantages of OSS: storage of EB of data (level). JindoFS also offers high flexibility: When you use OSS semantics, all the computing engines such as other computing services or BI report tools can obtain data quickly. JindoFS is a generic API.
JindoFS is widely used in the cloud. When processing data in HDFS and OSS, it avoids performance problems with performing rename, list, and other operations on files.
The following figure shows the architecture of the JindoFS. A namespace is the master service, and storage is the slave service. The master service is deployed on one or more nodes. The slave service is deployed on every node. The client service is deployed on each EMR machine. When data is read or written, the system first sends a request to the master service through the slave service to obtain the location of the file. If the file does not exist locally, the file is obtained from the OSS and cached locally. JindoFS implements HA architecture. Local HA is implemented through RocksDB, and remote HA is implemented through OTS. Therefore, JindoFS can achieve both performance and reliability. JindoFS uses Ranger for permission management and design. Use JindoFS SDK to migrate data from on-premises HDFS to OSS for archiving or using.
JindoFS supports block storage and cache modes. If you use JindoFS in the block storage mode, its source data is stored in the local RocksDB and remote OTS. The Block mode delivers better performance but is less universal. Customers may only use the source data of JindoFS to obtain the location and detailed information of file blocks. JindoFS in the block storage mode also allows you to specify hot data, cold data, and warm data. Moreover, JindoFS can effectively simplify O&M.
The cache mode uses local storage, and semantics are also based on OSS, such as oss: /bucket/path. The advantage of the cache mode is its universality. This mode is used not only in EMR but also in other computing engines. Its disadvantage lies in its performance. When a large amount of data is involved, the performance is relatively poor compared to the block storage mode.
You may select the modes based on business requirements.
EMR supports auto scaling based on time and cluster load (Yarn metrics are collected and can be manually specified). When you use auto scaling, select multiple recognition types to avoid job failures caused by insufficient resources. Also, use preemptible instances to reduce costs.
In this article, we will discuss how you can use EMR Spark Relational Cache to accelerate the query process by quickly extracting the target data from a cube that contains a large volume of data. Specifically, you can do so through things such as columnar storage, file indexing, and Z-order. With these, again, you can quickly identify your target data by using filtering to greatly reduce the actual I/O data volume, avoid I/O bottlenecks, and optimize the overall query performance.
We will look at how you can use file indexing to improve efficiency by narrowing the scope of queries. Following this, we will discuss what file indexing does and what it means for you.
If the total data volume is quite large, the number of files to be stored will also be high. In this case, even if we can get better filtering results by using the footers of Parquet, we may still need to start certain tasks to read these footers. In fact, in an actual implementation of Spark, the number of footer reading tasks is normally similar to the number of files. Therefore, scheduling these tasks can be time-consuming especially when cluster computing resources are limited. Therefore, to tackle this issue and further reduce the scheduling overhead of Spark jobs and improve execution efficiency, you can index files.
File indexes are similar to independent footers. With file indexes, you can collect the maximum and minimum data values of each field in each file in advance, and then store these values in a hidden data table. By doing this, you will only need to read the independent table and perform filtering at the file level to obtain the target file.
This is an independent stage, because a file corresponds to only one record in this hidden table. Therefore, the number of tasks that are required for reading the hidden table would be much less than the overhead of reading footers of all data files, and following this, the number of tasks in subsequent stages can also be significantly reduced. In access scenarios with Relational Cache, the overall acceleration effect would this solution would be quite obvious.
EMR is an all-in-one enterprise-ready big data platform that provides cluster, job, and data management services based on open-source ecosystems, such as Hadoop, Spark, Kafka, Flink, and Storm.
Alibaba Cloud Elastic MapReduce (EMR) is a big data processing solution that runs on the Alibaba Cloud platform. EMR is built on Alibaba Cloud ECS instances and is based on open-source Apache Hadoop and Apache Spark. EMR allows you to use the Hadoop and Spark ecosystem components, such as Apache Hive, Apache Kafka, Flink, Druid, and TensorFlow, to analyze and process data. You can use EMR to process data stored on different Alibaba Cloud data storage service, such as Object Storage Service (OSS), Log Service (SLS), and Relational Database Service (RDS).
This topic describes how to use MapReduce to read and write data in JindoFileSystem (JindoFS).
This topic describes how to migrate data from Hadoop Distributed File System (HDFS) to JindoFileSystem (JindoFS) that stores data in Object Storage Service (OSS).
Alibaba Clouder - August 10, 2020
Alibaba EMR - May 7, 2020
Alibaba EMR - July 20, 2022
Alibaba EMR - November 4, 2020
Alibaba EMR - May 13, 2022
Alibaba EMR - September 2, 2022
More Posts by Alibaba Clouder