Empowering Open-source Cloud Ecosystems: Development of Alibaba Cloud's Open-source Big Data Platform

By Xiali

Alibaba Cloud E-MapReduce (EMR) is a platform as a service (PaaS) big data processing solution built on Alibaba Cloud Elastic Compute Service (ECS). EMR is based on the open-source Hadoop, Spark, HBase, Hive, and Flink ecosystems. It allows you to use open-source technologies to develop cloud-based big data solutions for a variety of scenarios, such as data warehousing, offline batch processing, online streaming, real-time queries, and machine learning. At the big data ecosystem session of the 2019 Apsara Conference held in Hangzhou, Xiali, a senior product expert at Alibaba, shared his knowledge about how Alibaba Cloud EMR empowers open-source cloud ecosystems.

His presentation was divided into four parts:

Development Milestones
Current Situation of the Cloud
Best Practices of Open-source Cloud Ecosystems
Prospects for Open-source Big Data Platforms

Development Milestones

When Alibaba started to develop an open-source big data platform in 2015, three roadmaps were available: the open-source Hadoop system, Cloudera Distribution Hadoop (CDH) and Hortonworks Data Platform (HDP), and MaxCompute (formerly ODPS). At that time, Alibaba Cloud wanted to draw on the experience of Amazon EMR to build an open-source big data platform that would deeply integrate big data capabilities and cloud-native capabilities.

Alibaba Cloud started to independently develop an open-source big data platform in June 2015 and built the first image-script integrated version, which was launched in a short time and published to GitHub. By using this version, users were able to set up a Spark environment in the shortest possible time. The APIs used at the time were similar to the current orchestration service. However, these APIs only supported one-time setup, which made maintenance extremely complex.

Alibaba Cloud EMR was launched as an independent cloud service in November 2015. The MapReduce programming model was designed by Google based on the big data theory. The letter E in the name EMR indicates the elasticity of the service.

Now, four years after the launch of Alibaba Cloud EMR, EMR 4.0 is about to be released. The new version will support Hadoop 3.0 and other new features.

Alibaba Cloud EMR will provide a basic platform for open-source ecosystems. This platform allows you to select from a variety of open-source products while using custom capabilities as needed. In addition, Alibaba Cloud EMR is intended to provide the computing capabilities of Alibaba's cloud intelligence platform to external users, allowing them to use cloud-native capabilities and elastic scheduling capabilities.

In the future, Alibaba Cloud EMR will integrate diverse open-source technologies and capabilities to provide increased optimization. This will empower capability adaptation and improve stability and performance. In addition, Alibaba Cloud hopes to better integrate cloud native through EMR. This will allow you to obtain higher performance by integrating your open-source or user-created big data solution with cloud-native technology and infrastructure.

In short, Alibaba Cloud EMR has reached three development milestones. The first milestone was that Alibaba Cloud EMR was initially developed as a big data solution similar to Amazon EMR.

The second milestone came when Alibaba Cloud EMR was adjusted to pay more attention to resident clusters and improve job performance and scheduling capabilities. This adjustment was made one year after Alibaba Cloud EMR was launched because the purely dynamic operation mode of Amazon EMR did not suit the scenarios in China.

The third milestone came when Alibaba Cloud EMR was adjusted for the second time to better support all types of business scenarios. After the second adjustment, Alibaba Cloud EMR provided complete web console capabilities, ensured cluster high availability and security, and supported peripheral software of Impala, Kafka, and Druid. Alibaba Cloud EMR also supports deep learning through the machine learning algorithms optimized by Alibaba. Alibaba Cloud EMR is still constantly adjusted to provide a more sophisticated big data platform, more intelligent service capabilities, and lighter weight underlying capabilities. This allows the computing platform to provide all its capabilities to external users.

2. Current Situation of the Cloud

Overview of Cloud Ecosystems

The following figure shows the big data ecosystem of Alibaba Cloud. Two types of data sources are available: (1) Open-source systems, such as Hadoop distributed file systems (HDFS) and Kafka and (2) Services provided by Alibaba, such as Object Storage Service (OSS), Server Load Balancer (SLB), ApsaraDB for RDS, and Message Queue (MQ). All data can be computed by using open-source components, such as Hive, Spark, Flink, Presto, and TensorFlow, and also by using MaxCompute, Flink, and TensorFlow developed by Alibaba. This data can be used in the systems of Alibaba, such as DataWorks, DataV, and Quick BI. At present, a cloud-based big data solution can be viewed as a semi-hosted service, allowing you to perform O&M and use related supporting services on the Alibaba Cloud platform.

Diverse Storage Options

In Alibaba Cloud, big data can be stored in Apache HDFS, Alibaba HDFS, and OSS. Apache HDFS supports three storage modes: Elastic Block Storage (EBS), D1, and combined I1 and I2. EBS reliably stores data in cloud disks. However, it creates multiple data replicas in the background, which increases costs. In addition, EBS requires a network connection for data retrieval, which reduces performance. D1 and combined I1 and I2 store data in local disks, which results in high cost performance. However, the stored data is easily lost, and the maintenance costs are high. Alibaba HDFS stores data reliably at reasonable costs. All the stored data is transmitted through networks. Local computing is not provided. OSS provides standard data storage. NativeOSS, which is an optimized version of OSS released by Alibaba, allows you to directly read and write data from and to OSS in Hadoop. NativeOSS stores data reliably at low cost. However, its performance is relatively low. Alibaba further improved NativeOSS and developed JindoFS. In addition to reliable and low-cost data storage, JindoFS features high performance and good versatility. However, this increases storage costs.

Elasticity Practices

Elasticity is key to cloud computing because it helps maximize the value of the cloud. To leverage cloud elasticity, each cloud service provider deploys big data capabilities in an architecture that consists of primary nodes and a group of worker nodes, also known as task nodes. Task nodes only implement computing and do not store data. This allows you to scale task nodes in and out when you run computing tasks on the cloud. You can also stop instances as needed to reduce costs. For example, you can buy extra task nodes to run more computing tasks. After these tasks are completed, you can release the extra task nodes. On the Alibaba Cloud platform, task nodes can be scaled in and out by time or by load.

Cluster Architecture

The following figure shows the cloud-based cluster architecture recommended by Alibaba Cloud. In the left part of the figure, the Hadoop cluster stores data in OSS at the bottom layer. The upper layer consists of independent computing clusters, such as Hive, Spark, and Presto. These clusters can be released as needed. In the right part of the figure, the Hadoop cluster receives requests through external gateways and clients. If you do not use OSS or a combination of OSS and HDFS to store data on the cloud, this architecture can help you overcome data storage barriers.

3. Best Practices of Open-source Cloud Ecosystems

Selection and Optimization of Storage Methods

In 2015, Alibaba only provided cloud disks to store the data of big data platforms deployed in Alibaba Cloud. Cloud disks include ultra-storage disks like solid state disks (SSDs). This resulted in extremely high costs. In 2017, the Alibaba Cloud intelligence team and ECS team worked together to develop the local disk model and the D1 data storage mode, both of which were later adapted to common scenarios in China. In 2016, Alibaba Cloud EMR was integrated with OSS, but its user base was small due to limited bandwidth. Since then, Alibaba Cloud EMR has been improved to support data storage through JindoFS and Alibaba HDFS.

IaaS Layer Upgrade

The infrastructure as a service (IaaS) layer has been upgraded multiple times to make Alibaba Cloud EMR more user-friendly. The IaaS layer was upgraded from the first-generation D1 and I1 to the second-generation D2 and I2. This upgrade increased the network bandwidth and significantly improved the performance of local disks. However, it also increased O&M costs. The system now implements a complete automatic O&M process, including hardware monitoring, warning, notification, and replacement. The hot disk replacement feature helps improve the user experience provided by O&M supporting services.

JindoFS: An Optimized Storage Access Solution

JindoFS is designed to make better use of architectures with separated storage and computing. In JindoFS, all data is stored in OSS, and all computing tasks run in dynamic clusters. Computing can be scaled in and out as needed. JindoFS supports high-performance data access and provides unlimited elastic storage capabilities at low cost. JindoFS uses local caching to significantly reduce latency and improve performance and efficiency. This solves the daunting challenge posed by network bandwidth between OSS and compute clusters JindoFS is based on an architecture with separated storage and computing, which prevents the loss of cached data.

Integration and Enhancement of More Products

Alibaba Cloud EMR is integrated with more products, such as Spark, Flink, TensorFlow, Elasticsearch, and DataWorks, and new enhancements have been made to the system.

4. Prospects for Open-source Big Data Platforms

Alibaba Cloud EMR will be improved to provide more platform-based solutions and better empower customers' business scenarios. For example, the real-time data warehouse solution and Spark Streaming SQL can synchronize data from business databases to Apache Kudu in real time. They also provide online analytical processing (OLAP) capabilities to analyze the data in business databases in real-time.

In the future, Alibaba Cloud EMR will be integrated with Kubernetes to help you further drive down costs. Then, you will be able to complete all kinds of tasks in Alibaba Cloud by using your own resources. For example, you can add Kubernetes nodes to supplement Hadoop nodes when running computing tasks during off-peak hours. During peak hours, you can release the Kubernetes nodes so that they can continue to handle your businesses. This makes better use of resources and improves computing capabilities without creating extra costs.

Alibaba provides a function to transmit cold data to dynamic storage through a private line. This allows you to build both multi-cloud and hybrid cloud environments without having to modify your offline data centers. In addition, Alibaba Cloud EMR allows you to dynamically leverage elastic cloud computing to quickly deliver resources for offline clusters.

Community

Empowering Open-source Cloud Ecosystems: Development of Alibaba Cloud's Open-source Big Data Platform

Development Milestones

2. Current Situation of the Cloud

Overview of Cloud Ecosystems

Diverse Storage Options

Elasticity Practices

Cluster Architecture

3. Best Practices of Open-source Cloud Ecosystems

Selection and Optimization of Storage Methods

IaaS Layer Upgrade

JindoFS: An Optimized Storage Access Solution

Integration and Enhancement of More Products

4. Prospects for Open-source Big Data Platforms

Read previous post:

Read next post:

Alibaba EMR

You may also like

Comments

Alibaba EMR

Related Products

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Managed Service for Prometheus

Data Lake Storage Solution