Cloud native open source big data application practice

With the high integration of open source technology and cloud native, Alibaba Cloud's open source big data platform has accumulated rich practical experience in functionality, ease of use and security. It has successfully served thousands of enterprises, helping them focus on their core business advantages, shorten the development cycle, simplify the difficulty of operation and maintenance, and expand more business innovation. On October 29, Alibaba Cloud released the solution of "How to build a cloud native open source big data platform", and invited technical experts from Alibaba Cloud, Weimiao and Inmobi to speak for you and present the cloud practice.

This article mainly shares the actual battle of cloud native open source big data application from data lake, real-time data warehouse, retrieval analysis and other scenarios

Guest to share: Liu Yuquan, Alibaba Cloud intelligent big data solution architect

Video address: https://yqh.aliyun.com/live/bigdataop

1、 Preface

With more and more data from all walks of life, self-built big data infrastructure has gradually exposed various problems, such as long procurement cycle, high operation and maintenance costs, and complex technology stack. The choice of cloud becomes the better choice to resist such problems. Enterprises can not only enjoy the powerful infrastructure capability on the cloud, but also the rich ecological computing and storage flexibility on the cloud, which can reduce costs and increase efficiency for enterprises.

APP in the Internet generally has various solution requirements such as content recommendation, real-time risk control, information retrieval, etc., and the realization of such requirements requires strong computing and storage capabilities.



2、 Cloud native open source big data unified platform

The cloud native open source big data unified platform is an IaaS infrastructure built on the cloud and based on cloud native resources, such as ECS OSS, such as virtual machines and cloud storage. The cloud native open source big data platform provides basic functions such as elastic scaling, intelligent diagnosis, data development, monitoring and alarm.

At the same time, the entire cloud native big data platform product can be built based on the open source Apache Hadoop ecosystem and the K8S resource management platform, and the product system can be roughly divided into two categories:

• Semi-trusted form

Taking E-MapReduce (EMR for short, the same below) product as a representative, EMR provides users with mainstream open source big data components, such as Hive, Spark, Kafka, Presto, Clickhouse, etc. Users can freely use them together, and EMR provides product control functions to facilitate users to use open-source big data products.

• Full custody form

For example, Flink, Spark, Hadoop, Kafka, Elasticsearch and other mainstream computing engines or platforms provide fully hosted service services. The fully managed real-time computing platform Flink Veronica, Spark's original Databricks, Cloudera's CDP platform and Alibaba Cloud Elasticsearch provide different users with different methods of using open source products.

At the same time, Alibaba Cloud also has a centralized product, Data Lake - "Data Lake Building DLF", which provides users with unified metadata access, management and other functions, and combines with other products to provide a complete data lake solution.

3、 Open source big data on the cloud

Open source big data platform E-MapReduce

The big data cloud mainly uses the open source big data platform EMR, which is a big data processing system solution running on the Alibaba Cloud platform. Based on the open source components, it has been optimized and enhanced, and its performance is far higher than that of the open source version. It also follows the upgrade of the open source version, adapts to various components, and has sufficient stability while ensuring compatibility:

• Compatible with open source big data components

For Spark, Hadoop, Kafka and other components, optimization and enhancement have been made based on the open source version, and the performance has been greatly improved.

• Semi-trusted form

For the half-managed architecture, users have a large degree of autonomous and controllable participation, and can seamlessly migrate with existing big data resources.

• Cloud native

Alibaba Cloud's native ecosystem supports dozens of ECS instance families, including computing, memory, universal, big data, and GPU heterogeneous computing. It matches different big data scenarios, and provides minute-level cluster creation and expansion. It also supports flexible scaling and bidding instances.

Because big data scenarios generally have obvious data peaks and valleys. For example, tasks in the early morning need a relatively high SLA guarantee. In the daytime, it may be a resource valley, mainly completing some development tasks. We can achieve the effect of cost saving through elastic capacity reduction.

• Cloud native support for Alibaba Cloud's object storage OSS

JindoFS is adopted to accelerate the performance of OSS and reduce the cost of data storage. Through in-depth integration with other Alibaba Cloud products, EMR can be used on DataWorks as the engine for job computing and data storage. The DLF can be built by integrating the data lake to realize the unified metadata management of multiple engines in the data lake scenario.

• Enterprise features

For example, the APM of EMR is a monitoring alarm and diagnosis at the cluster host service job level. It supports Kerberos and RAM as the authority management of the authentication platform Ranger to ensure data security. Alibaba Cloud enterprise resource groups and labels are convenient for enterprises to do cost accounting.



Open source big data cloud solution

Moving open source big data to the cloud is to move the self-built or other big data platforms in IDC to Alibaba Cloud, and continue the open source technology stack through EMR products, linking the ecosystem of Alibaba Cloud and open source big data.

According to the different data scale and budget, data and tasks can be efficiently migrated to the cloud by lightning cube, private line and public network according to the plan, and the whole data ecosystem of Alibaba Cloud can be integrated after the cloud:

• Integrated DataWorks

Provide an efficient, safe and reliable one-stop big data development and governance platform.

• Integrated object storage OSS

All computing engines in EMR support OSS as storage, which can be used like HDFS, and JindoFS is used to speed up OSS data reading and writing. JindoFS is an important component of the data lake. Now, a large number of users are using JindoFS to build a data lake on the cloud, realize tiered storage of databases, and reduce storage costs.

• Integrated data lake to build DLF

By default, EMR supports the use of data lake to build DLF for metadata management, which is convenient for metadata management in the data lake scenario. AliCloud data lake builds DLF and uses AliCloud object storage OSS as the unified storage of the cloud data lake

On the cloud, you can use a variety of computing engines, face different big data computing scenarios, and use a unified data lake storage scheme to avoid the complexity of data synchronization and some operation and maintenance costs.



Cloud native elasticity: 20% cost optimization

The cloud's native flexibility can bring 20% cost optimization.

Traditional big data computing services have strong periodicity, such as obvious peaks and valleys, high load in the morning and low load in the day. When users plan clusters, the traditional mode is to plan clusters according to the peak, with a certain degree of redundancy of resources.

The data lake solution of EMR JindoFS+OSS achieves basically the same performance as HDFS, realizes the upgrade of cloud big data architecture for customers, realizes the separation of storage and calculation, enables customers to enjoy the dividends of computing elastic expansion and storage elastic expansion, and enables customers to focus more on the development of application layer.

After using the storage and accounting separation architecture, the cluster scales elastically according to the business cycle and load. The method of using fixed resource pool and elastic resource pool is as follows:

For relatively fixed computing resources, use fixed resources to ensure that resources can be locked to complete the calculation.

For peak or burst computing tasks, elastic resource pool is adopted to reduce the waste of computing resources.

4、 On-cloud real-time application

Typical real-time application scenarios

After so many years of development of big data, the ability of large-scale computing is no longer a problem, and timeliness has become an important feature.

In the scenario of Alibaba Double 11, all data analysis on the day is basically real-time and updated at the second level; Because more and more offline processing can not meet the development of business, more real-time processing capabilities are needed, such as real-time data warehouse, real-time large screen, real-time report, etc.

Market launch students adjust the launch strategy in real time based on the statistical effect of real-time launch; Real-time recommendation, which calculates the user's interest based on the user's real-time behavior, and then helps the user select the appropriate content; Based on the user's behavior characteristics, real-time risk control judges whether the user is a cheating user in real time, and carries out some punishment operations on the cheating user.

Real-time computing Flink

Alibaba Cloud Real-time Computing Flink is a one-stop real-time big data analysis platform based on Apache Flink. It provides standard SQL, reduces the threshold of business development, and helps enterprises upgrade to real-time and intelligent big data computing.

As a streaming computing engine for real-time computing, Flink can process a variety of real-time data, including online service logs, IOT sensor data, and Binlog in RDS.

Flink subscribes to Kafka, analyzes and processes the real-time data in the message queue, and then writes the analysis results to different data stores in real time. For example, Clickhouse, Hologres, Elasticsearch and other products support the upper data application through data services.

Real-time computing Flink is based on the platform base, provides Serverless service, and fully managed container support:

• Computing engine runtime

It includes the in-depth optimization of self-developed streaming state storage engine Gemini, SQL operator and job scheduling, and the rich out-of-the-box Flink connector development platform, which provides the full life cycle management of job care.

• AutoPilot is an intelligent tuning

Under the premise of ensuring the stability of upstream and downstream performance of each operator and stream job, the job parallelism and resource allocation can be adjusted, and then the job can be optimized globally to solve various performance tuning problems such as full-link backpressure and resource waste caused by insufficient throughput.

• Prometheus full-link monitoring alarm

All Flink implementation tasks are to ensure 7 × It runs 24 hours, so a complete monitoring system is needed to provide complete task operation monitoring indicators, and then check the health status of task operation. When the operation is abnormal, timely notify relevant personnel to intervene.

Best practices of cloud open source real-time data warehouse

On the cloud, you can realize a set of best practices of full-link real-time data warehouse through Flink, Kafka and Clickhouse.

Collect business logs and business data into the message engine Kafka in real time, use Flink to conduct real-time ETL processing and summary analysis of the data, and then save the results to Clickhouse, which supports the application of big data in the upper layer, such as real-time reports, real-time marketing analysis, and AB experiment functions.

5、 Cloud retrieval application

Typical retrieval application scenarios

Now every user in the mobile internet is querying various kinds of information every day. For example, nearby restaurants, hotels, your shopping orders, and logistics information applications.

Then it is necessary to help users obtain information efficiently and provide a information retrieval service for massive data.

For example, we search for the products we are interested in when shopping, find the characteristic restaurants and cafes nearby when friends have dinner, and research and develop students. When the business system generates abnormal log exceptions, we analyze and troubleshoot problems through the log.

Alibaba Cloud Elasticsearch

All the above scenarios require an information retrieval service, and Elasticsearch has a powerful full-text retrieval capability, which can realize complex combination conditions and fuzzy queries, and then easily deal with retrieval queries of various text and geographic location information.

Alibaba Cloud Elasticsearch provides fully hosted ELK services, 100% compatible with open source, and provides X-pack commercial plug-ins for free, which are available on demand. At the same time, in-depth function and kernel performance optimization, provide richer analysis and retrieval capabilities, and more secure and highly available services:

• Overall cost reduction

Compared with self-built ES, due to the overall cloud hosting, there is no need for the operation and maintenance investment of the underlying resources to achieve lower operation and maintenance costs

• Cluster control

Realize the cluster elastic expansion and contraction, and Eyou intelligent operation and maintenance unified monitoring.

• Capability difference with self-built ES

Free X-pack plug-in, NLP word segmentation plug-in and vector retrieval plug-in are provided

• Safe and highly available

The X-pack security component and field-level security control meet the requirements of high availability and automatic data backup. The city-wide multi-live architecture achieves 99.999999% service reliability.

Best practices for information retrieval applications on the cloud

In the process of operation, each enterprise will produce a large amount of data, including structured and semi-structured data.

For example, industry knowledge, geographic location information, order information, and audio and video data. These data may be stored in database RDS, object storage OSS, or big data storage engine.

Through the data integration tool, these data can be synchronized to the message engine or data warehouse, and the data can be processed in real time through Flink. Maxcompute or EMR Spark, Hive, and then perform offline calculation. The results will be saved in Elasticsearch, providing retrieval services for the upper data applications, such as full-text search and address search.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us