Help cloud open source ecosystem

1. Development process

In 2015, when Alibaba first started to build an open source big data platform, there were three options before it, namely using the open source Hadoop system, CDH and HDP, and then ODPS (now MaxCompute). At that time, AWS on the other side of the ocean had a big data product called EMR, so Alibaba Cloud also hoped to learn from the experience of AWS to build an open source big data platform, hoping to deeply combine big data capabilities with cloud native capabilities.

Alibaba Cloud began to develop its own open source big data platform in June 2015 and implemented the first "mirror + script" version. This version can build the Spark environment in the shortest time, and this version Go live and publish it on GitHub soon. The API used at that time was very similar to the current orchestration service, but the disadvantage of this service is that it can only be built once, so it is very troublesome to maintain.

In November 2015, Alibaba Cloud officially launched the E-MapReduce independent cloud product to the market. Everyone knows that the idea of MapReduce comes from Google, which represents the theory of big data, so Alibaba named this big data product E-MapReduce, so that everyone can know the main function of the product when they see the name.

Four years have passed since Alibaba Cloud E-MapReduce was launched. E-MapReduce 4.0 will be released soon, and the new version will support Hadoop 3.0 and other new functions.

Today, Alibaba Cloud E-MapReduce will provide a basic platform for the open source ecosystem. On this platform, you can choose a variety of open source products without restricting your ability to use customization. In addition, Alibaba Cloud E-MapReduce also hopes to export the computing power of Alibaba's entire cloud intelligence platform to everyone, providing cloud native capabilities and elastic scheduling capabilities for everyone. In the future, E-MapReduce will gradually integrate various open source technologies and capabilities, and provide you with better optimized technologies based on these open source technologies. On the one hand, it will improve stability and on the other hand, it will also improve performance. Ability to adapt. The last point is to achieve the combination of cloud-native. It is often difficult for everyone to use open source or self-built big data solutions to combine with cloud-native technology or infrastructure or to obtain high performance. Therefore, Alibaba hopes to use E-MapReduce to better Combined with cloud native technology.

To summarize the development history of Alibaba E-MapReduce, an AWS EMR Like product was implemented at the very beginning. After one year of operation, it was found that the pure dynamic method of AWS was not suitable for domestic scenarios, so the first adjustment was made, and more attention was paid to regular It resides in the cluster and strengthens the job scheduling ability. After the first adjustment, Alibaba Cloud found that the capabilities of E-MapReduce still could not meet the needs, so in the second adjustment, it provided a complete web console capability, and supported the high availability and high security of the cluster, and also It supports software in various scenarios such as Impala, Kafka, and Druid, which in turn can better support various business scenarios. In addition, it also supports deep learning scenarios, and provides machine learning algorithms optimized by Alibaba itself on the platform. Today, E-MapReduce is still continuing to adjust, hoping to provide a more complete big data platform, more intelligent service capabilities, and make the bottom layer lighter, and also enable the overall capabilities of the computing platform to be exported to the outside world.

2. Current status on the cloud

Ecological overview on the cloud

The figure below shows an overview of Alibaba Cloud's big data ecosystem. In terms of data sources, open sources include HDFS and Kafka, while Alibaba provides services such as OSS, SLS, RDS, and message queues. All data can be calculated through open source Hive, Spark, Flink, Rresto, TensorFlow, and Alibaba's MaxCompute, Flink/TensorFlow and other services, and can also be integrated with Alibaba's own systems such as DataWorks, DataV, and QuickBI. At present, the big data solution on the cloud can be regarded as a semi-managed service, and Alibaba Cloud can help customers with operation and maintenance and provide operation and maintenance support services.

Various storage options

On Alibaba Cloud, there are three main options for big data storage, namely Hadoop HDFS, Alibaba HDFS, and OSS. Hadoop HDFS has three storage methods. EBS cloud disk stores reliable data, but there are multiple data copies in the background, so the cost is high, and the performance of obtaining data through the network is low; D1 local disk and I1/I2 local word disk have relatively high performance. The cost is also relatively low, but the data is easy to lose, and the operation and maintenance cost is high. Another option is Alibaba HDFS, which has reliable data and medium cost, and all data is transmitted through the network without local computing. OSS standard storage can be directly read and written in Hadoop after being transformed and optimized by Alibaba. This is the so-called NativeOSS. NativeOSS stores data reliably, with low cost and good versatility, but its performance is relatively low. Therefore, NativeOSS is further strengthened and JindoFS is implemented. JindoFS achieves reliable data, low cost, high performance and good versatility, but requires additional storage costs.

flexible practice

Computing on the cloud needs to give full play to its elastic capabilities, otherwise the true value of the cloud cannot be fully realized. In order to give full play to the elastic capabilities of the cloud, all big data capabilities of major cloud manufacturers will have Master nodes and a set of work node Tasks. Task nodes only perform calculations but do not store data. Therefore, when performing calculation tasks on the cloud, Task nodes can be elastically scaled, and the cost can also be reduced through Stop Instance. Purchase Task nodes during the peak period of computing tasks, and release Task nodes after the peak period passes. Alibaba Cloud also provides customers with a set of scaling mechanisms, which can be scaled according to time or load.

cluster architecture

The figure below shows the cloud cluster architecture recommended by Alibaba Cloud. The Hadoop cluster shown on the left side of the figure is the established Hadoop cluster. OSS is used for data storage at the bottom layer, and there are several independent computing clusters on top of OSS, such as Hive, Spark, and Presto, and all of these clusters are Flexible and destructible. The Hadoop cluster is also established on the right side, and the Gateway and Client are provided on the outside to accept requests. In addition, many customers may not use OSS on the cloud or use OSS and HDFS in combination. This cluster architecture can help users overcome data storage barriers.

3. Best practices of open source ecology on the cloud

Storage Selection and Optimization

In 2015, if you want to deploy a big data platform on Alibaba Cloud, you can only choose cloud disk storage, such as using high-efficiency storage disks such as SSDs, so the cost will be very high. Around 2017, Alibaba Cloud's intelligent team worked with the ECS team to make local disk models, and later cooperated with the ECS team to make D1, and adapted some domestic scenarios that are more adaptable. In 2016, E-MapReduce was integrated with OSS. At that time, due to bandwidth limitations, fewer customers used it. Up to now, based on previous development and cooperation experience, E-MapReduce can choose to use JindoFS, Alibaba HDFS, etc. for storage.

IaaS layer upgrade

In order to allow customers to better use E-MapReduce, the IaaS layer has undergone several upgrades. The first generation is D1 and I1, and the second generation is D2 and I2, which provide higher network bandwidth and local disks provide extremely high performance, but also increase the cost of operation and maintenance. Through the hot replacement of the disk, it provides a better experience of the whole set of operation and maintenance support link, and provides the whole set of hardware monitoring, early warning, notification, replacement and other operations to complete the active operation and maintenance process.

Storage access optimization scheme JindoFS

The purpose of JindoFS is to allow everyone to better use the architecture that separates storage and computing. Under the JindoFS architecture, all data will be stored on OSS, and all calculations will be performed on dynamic clusters, and calculations can be scaled at any time. JindoFS provides customers with high-performance data access capabilities, and cost-effective, Unlimited scalable elastic storage capacity. The biggest challenge here is the network bandwidth between OSS and computing clusters. In the JindoFS solution, local caching technology greatly reduces latency and improves performance efficiency. At the same time, because JindoFS adopts an architecture based on the separation of storage and computing, customers do not have to worry about the loss of cached data.

More product integrations and enhancements

Alibaba Cloud E-MapReduce integrates more products, such as Spark, Flink, TensorFlow, Elaticsearch, Dataworks, etc., and makes enhancements based on these products.

4. Development prospect of open source big data platform

Alibaba Cloud E-MapReduce hopes to implement more solutions based on the platform, hoping to better empower customers' business scenarios. For example, in the real-time data warehouse solution and Spark Streaming SQL, real-time synchronization of business database data to kudo is realized, and the ability to perform real-time OLAP analysis on the data in the business database can be realized.

In the future, EMR will realize the integration with K8S, hoping to help customers better save costs, allowing users to move their own resources to complete various tasks within Alibaba Cloud, and supporting customers to add K8S nodes to As a computing supplement in the Hadoop node, the cluster is returned to the business during the peak of the business, so that the computing power can be increased without additional costs, and resources can be more fully utilized.

Many users hope to achieve multi-cloud and hybrid cloud, so Alibaba hopes to provide customers with the same offline IDC usage, transfer cold data to dynamic storage through dedicated lines, and use E-MapReduce to dynamically empower and offline The ability to use clusters in combination, and at the same time make full use of offline and online capabilities.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us