How Alibaba Cloud Helps Build a New Cloud-Native Data Lake System

Under the surging wave of informatization, big data technology is constantly updated and iterative, data management tools have developed rapidly, and related concepts have also been born. Since the concept of data lake was launched in 2011, its conceptual positioning, architecture design, and related technologies have been rapidly developed and practiced. Next-generation basic data platform. This article introduces how to combine Alibaba Cloud's basic services and rich computing engines to create a new cloud-native data lake system.

Cloud-native data lake system

The arrival of the cloud-native era has led the data lake to enter a new stage of "cloud-lake symbiosis". In this context, Alibaba Cloud has launched an enterprise-level data lake solution based on cloud-native technology. The solution adopts a storage-computing separation architecture. The storage layer is built on Alibaba Cloud Object Storage OSS and is integrated with Alibaba Cloud's data lake analysis and data lake construction. , E-MapReduce, DataWorks, and other computing engines are seamlessly connected, and compatible with the rich open-source computing engine ecosystem.

(1) Data lake storage uses cloud-based object storage OSS plus JindoFS to replace HDFS to increase data scale, reduce storage costs, and achieve a separate architecture for computing and storage;
(2) The data lake construction service provides unified metadata and unified authority management, and supports multiple engine access;
(3) The cloud nativization of computing engines such as Spark on EMR can make better use of elastic computing resources;
(4) Data development and governance platform on the cloud Dataworks solves the problems of data lake metadata governance, data integration, and data development.

Alibaba Cloud's cloud-native data lake system can support EB-level data lakes, store more than 100,000 Databases, 100 million Tables, and 1 billion-level Partition, support more than 3 billion metadata service requests per day, and support more than 10 open-source computing engines and Cloud-native data warehouse engines such as MaxCompute and Hologres.

At the same time, the storage cost of Alibaba Cloud data lake is more than 10 times lower than that of high-efficiency cloud disks, the query performance is more than 3 times faster than that of traditional object storage, and the query engine has extremely high elasticity, which can start more than 1,000 Spark Executors within 30 seconds. . It can be seen that Alibaba Cloud's powerful storage and computing capabilities have jointly created an industry-leading data lake system. It can be seen that if you want to take the lead in the era of big data, you need to have a system that can quickly connect to a variety of different computing platforms while retaining the original information of the data.

E-MapReduce

Alibaba Cloud Elastic MapReduce (E-MapReduce) is based on the open-source ecosystem, including Hadoop, Spark, Kafka, Flink, Storm, and other components. It is a one-stop enterprise big data platform that provides you with services such as clusters, jobs, and data management. It runs on A set of system solutions for big data processing on the Alibaba Cloud platform. E-MapReduce is built on the Alibaba Cloud ECS elastic virtual machine and based on the open-source Apache Hadoop and Apache Spark, you can easily use other peripheral systems in the Hadoop and Spark ecosystems (such as Apache Hive, Apache Kafka, Flink, Druid, TensorFlow, etc.) to analyze and process your own data. You can also easily process data from other Alibaba Cloud cloud data storage systems, such as OSS, SLS, and RDS, through E-MapReduce.

Data Lake Storage OSS

Alibaba Cloud Object Storage OSS is the unified storage layer of the data lake. It is based on a 99.9999999999% (12 9s) durable reliability design. It can store data of any scale and can be connected to business applications and various computing and analysis platforms. It is very suitable for Enterprises to build data lakes based on OSS. Compared with HDFS, OSS can store a large number of small files and greatly reduces the unit storage cost through advanced technologies such as hot and cold tiering, high-density storage, and high-compression algorithm. At the same time, OSS is ecologically friendly to Hadoop and seamlessly connects to Alibaba Cloud computing platforms. For data analysis scenarios, OSS provides functions such as OSS Select, Shallow Copy, and multi-version to speed up data processing and enhance data consistency.

Alibaba Cloud Object Storage Service (OSS) provides industry-leading scalability, durability, and performance. Customers of all sizes and industries can use it to store and protect data for any number of use cases, such as backup and recovery, content distribution, data lakes, websites, mobile applications, data archives, and IoT devices.

Data Lake Governance

DataWorks comprehensive data governance can provide Alibaba Cloud customers with a unified data view, which can be used to grasp the current situation of data assets, help improve data quality, improve the efficiency of data acquisition, ensure data security compliance, and improve the analysis efficiency of data query. It can effectively support the construction of offline big data warehouse, the query and analysis of data federation, the low-frequency interactive query of massive data and the construction of intelligent reports, and the realization of data lake solutions.

Based on big data computing engines such as MaxCompute/EMR/Hologres, DataWorks, the big data development, and governance platform, provides customers with a professional, efficient, safe, and reliable one-stop big data development and governance platform. It comes with Alibaba's data center and data governance best practices, enabling the digital transformation of various industries. Every day, tens of thousands of data/algorithm engineers within Alibaba Group are using DataWorks to undertake 99% of the group's data business construction.

Data Lake Formation

A data lake is a centralized repository that can store structured and unstructured data at any scale, supporting big data and AI computing. As a core component of cloud-native data lake architecture, data lake construction can easily and quickly build cloud-native data lake solutions, provide unified management of metadata on the lake, enterprise-level permission control, and seamlessly connect to multiple computing engines.

• Easy data collection: systematic data collection capabilities, massive storage services, support for structured/semi-structured/unstructured data sources
• More flexible architecture: separation of computing and storage, more flexible resource planning and architecture, reducing cost waste, improving efficiency, and responding to rapid business changes
• Easy data management: unified storage, hot and cold layered life cycle management, to solve the operation and maintenance problems such as data scattered in various clusters and data copying
• Easy to extract value: connect multiple computing and analysis platforms through the data lake, break data silos and gain insight into business value

Cloud-native data lake analytics

Using cloud-native data lake analytics, you can analyze your data on Alibaba Cloud very cost-effectively and efficiently through standard SQL and existing business intelligence (BI) tools without ETL.

Cloud-native data lake analysis is a serverless structure that can interactively analyze services on the cloud. For enterprise users, there is no need for infrastructure and management costs, no need to maintain instances, and pay-as-you-go. Zero startup time, transparent upgrade, QoS elastic service. DLA completely uses SQL to interact with the server, is compatible with standard SQL, and supports rich built-in functions. DLA supports multi-channel data source access analysis and provides analysis capabilities for diverse and heterogeneous data sources. Customers can not only analyze the data in Alibaba Cloud OSS and Table Store but also perform correlation analysis between the two data. Comprehensive integration of MPP and DAG technologies, super horizontal analysis and expansion capabilities, vectorized execution optimization, and operator pipeline optimization. It has good resource isolation and priority scheduling capabilities.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00