Alibaba Cloud Full Link Data Lake Development and Governance Solution

Alibaba Cloud has launched a full-link data lake solution, which mainly includes core products such as the open source big data platform E-MapReduce (EMR)+the one-stop big data development and governance platform DataWorks+data lake building DLF+object storage OSS.

Recently, Alibaba Cloud EMR launched a new version of Datalake, which is 100% compatible with community big data open source components and has strong flexibility. It supports DLF construction in data lakes, OSS and OSS-HDFS storage in data lakes, and supports Delta Lake, Hudi and Iceberg lake formats. At the same time, the new version of Datalake docks with Alibaba Cloud's one-stop big data development and governance platform DataWorks, precipitates Alibaba's methodology of big data construction for more than ten years, completes the development and governance capabilities of full-link data lakes from entering the lake, modeling, development, scheduling, governance, security and other aspects, and helps customers improve the efficiency of data application.

In addition, the solution provides comprehensive data lake capabilities such as "unified metadata management, data entry, data storage, cache acceleration, elastic computing, containers, data analysis, task scheduling, operation and maintenance management, and security". It has passed the special evaluation of the big data capability of the Ministry of Industry and Information Technology of the People's Republic of China, and won the "special evaluation certificate of cloud native data lake basic capability".

Alibaba Cloud full-chain data lake development and governance solution architecture

Alibaba Cloud's full-chain data lake development and governance solution uses OSS/OSS – HDFS as the data lake storage, DLF as the data lake construction and management tool, JindoFS for lake cache acceleration, EMR as the elastic computing engine for lake computing, and DataWorks for data development and governance. Each module of DataWorks is deeply integrated with DataLake, so as to realize one-stop data lake development and governance.

EMR new data lake cluster

Introduction to core operation and maintenance management and control capabilities


1. Elastic scaling supports two modes: cluster load and time

2. Elastic scaling groups support multiple instance specifications

3. Support preemptive instances (more than 80% lower than pay-as-you-go costs)

4. Support cost optimization mode (flexible proportion of pay-as-you-go+monthly package)

Cluster control capability

1. Create and expand the cluster at the minute level, without manually deploying and starting the service

2. Perfect cluster monitoring and alarm system, covering hardware and engine services, and supporting the configuration of alarm templates

Advantages of the new version of data lake compared with Hadoop cluster

Better performance

• Speed up

The expansion speed of the new data lake cluster node group has been significantly improved, and the expansion speed of a single batch of large-scale nodes has been increased by 80%

• Support concurrency

Support the parallel expansion and contraction of task node (task node type) and multi-node group, which can cover multiple use scenarios, and the business efficiency is doubled.

More complete functions

• More scalable

Scale by time and scale by load can be configured at the same time. Support priority offline nodes with low load. The configuration rules do not depend on whether to run elastic scaling activities, and can flexibly modify the configuration (only affect the next trigger).

• The execution logic is closer to the use scenario

Investigate the real use scenarios of users from multiple aspects, and the function execution logic design is closer to the business reality. For example:

1) Resilient scaling strategy supports multiple instance selection and pop up in sequence (in the case of insufficient inventory). Resilient scaling supports graceful offline configuration and offline selection of target nodes by load by default (reduce the impact of scaling on cluster tasks)

2) When multiple elastic rules of the same node group are triggered at the same time, they will take effect in order according to the user rules by default (flexible response to multiple use scenarios)

• Operational experience optimization

More abundant configuration prompts and operation guidance, and new pre-verification logic for configuration items to reduce user learning costs and operation failure probability.

Less cost

• Better flexibility and scalability, and wider coverage of functions

Elastic scaling takes effect faster and supports more functions. It can help users to manage hardware resources more quickly and better, set relevant policies according to business needs, automatically change the cluster size, and reduce the waste of hardware resources.

• Further reduce costs by flexibly configuring preemptive instances

When adding a node group, it provides a complete preemptive instance configuration strategy and a bottom-up strategy for users to configure. Users can flexibly configure according to their business demands. By configuring preemptive instances, costs can be further reduced.

DataWorks full-link development governance capability introduction

DataWorks is based on EMR-Datalake, EMR-ClickHouse, CDP and other big data engines to provide a unified full-link big data development and governance platform for data lake/data warehouse/lake warehouse integration and other solutions. As the builder of Alibaba's data center, DataWorks has continued to precipitate Alibaba's big data construction methodology since 2009, and has worked with tens of thousands of government/finance/retail/internet/energy/manufacturing customers through six full-link data governance capabilities, including intelligent data modeling, global data integration, efficient data development, active data governance (data quality, data map, etc.), comprehensive data security, and rapid analysis services, Assist in the digital upgrading of the industry.

Intelligent data modeling

DataWorks intelligent data modeling precipitates Alibaba's data center modeling methodology. Based on dimensional modeling, it interprets the business data from the business perspective from the four aspects of data warehouse planning, data standards, dimensional modeling, and data indicators, so as to make the construction of data warehouse evolve towards standardization and sustainable development. The intelligent data modeling capability for Datalake will be officially released in August 2022.

Global data integration

DataWorks data integration is a commercial team of open source DataX, which supports offline synchronization between 50+data sources in the data lake scenario, including data sources such as HDFS, Hive, HBase, OSS, Kafka, and databases such as MySql, Oracle, and SQL Server. At the same time, we provide network connectivity solutions for various synchronization scenarios such as IDC>>cloud, cloud vendors>>cloud vendors, cloud products>>cloud products, cloud accounts>>cloud accounts, so that customers can still maintain high-speed and stable data mobility between complex network environments and rich heterogeneous data sources.

Efficient data development

DataWorks data development (DataStudio) and operation and maintenance center are oriented to EMR-Datalake, EMR-CK, CDH and other engines, providing the main interface for visual development, giving users the powerful ability of intelligent code development, multi-engine hybrid workflow, and standardized task release, helping users easily build data lakes, offline data warehouses, real-time data warehouses, and ad hoc analysis systems to ensure the efficiency and stability of data production.

Data development - core development scheduling capability

• Support EMR Hive, EMR MR, EMR Spark SQL, EMR Spark, EMR shell, EMR Presto, EMR Impala, and EMR Spark Streaming.

• Large-scale scheduling and stability capability far beyond open source (double-11 single-day 10-million-level task instances)

• Multiple scheduling cycles of minute/hour/day/week/month

• Business process global parameters/node context transfer parameters

Data development - multiple visual data object management and control nodes

• Visual resource file upload (HDFS/OSS)

• Visual management UDF (Java)

• Visual table building (support HDFS/OSS)

• Merge, assign, sequence, loop, branch and other control nodes.

• Mixed scheduling of multiple scheduling cycles

• Visual business process orchestration

Data development - intelligent SQL editor

• Grammar highlighting

• Automatic completion of keywords

• Table/field information prompt

• Function information prompt

Task operation and maintenance - operation diagnosis

Running diagnostics can help users quickly locate the cause of task errors, such as

• Upstream dependency incomplete

• Insufficient scheduling resources

• Data quality rule interception

• Baseline break

At the same time, it has the ability to supplement data, which is convenient for users to quickly handle the operation and maintenance situation. In terms of alarm, the operation and maintenance center supports multiple alarm modes

• Support multi-channel alarm such as Webhook (stapling, WeChat, flybook), phone, SMS, email, etc

• Support the configuration of alarm personnel based on the duty table,

Task operation and maintenance - intelligent baseline

Intelligent baseline is an original monitoring technology created by DataWorks and has a national patent. Users do not need to configure the alarm time of each task, but only need to configure the alarm time of the final output node. The intelligent baseline will be based on the historical task operation. When the core task may not be produced on time, it will give early warning to ensure the production stability of the core task.

Active data governance

DataWorks data governance includes data governance center, data quality, data map and other products, covering the data life cycle before, during and after the event. Through data governance health classification, quality rules, data big blood relationship and other capabilities, the written data governance specification is implemented into a platform-based product capability, so that data governance is no longer a "phased project", but a "sustainable operation project".

Data quality

The EMR HIVE node supports DataWorks data quality rules, with 37 built-in data quality rule templates. It can perform visualization and batch data quality rule configuration to improve the efficiency of data quality rule configuration. At the same time, the module is deeply integrated with the data development scheduling, which can trigger the operation of rules through scheduling, save computing resources and find problems in time.

• Support 37 built-in data quality template rules

• Support batch configuration of rules and rule templates

• Support binding scheduling engine and blocking business process in case of quality alarm

• Support dynamic threshold (top paper technology, algorithm automatically determines alarm threshold)

• Support SQL custom rules

• Support SMS, email and nail alarm

• Support for customized data quality reports

• Support quality problem handling records

At the same time, data quality supports strong and weak rule settings for flexible operation and maintenance control.

• Strong rules can directly block the operation of downstream tasks, prevent the problem data from polluting the downstream and waste the computing resources executed by the downstream

• Weak rules, only warning, not blocking task operation, for some non-core tasks.

Data map

The data map supports the complete EMR-Datalake metadata system, which can quickly search for table names and field names, and browse the upstream and downstream relationships based on the table and field kinship to quickly find tables, including:

• Support table basic information, business description information, output information, etc

• Support detailed information and change records of zones and fields

• Support output information analysis of tables (including scheduling tasks for writing data to tables or creating partitions)

• Support blood relationship information analysis of tables and fields (real-time analysis)

• Support hierarchical classification and collection of tables

• Support global search, navigation search by category, and filtering by category

Table basic information:

Table consanguinity information:

Comprehensive data security

In terms of data security, DataWorks supports the security management of the data lifecycle of the Datalake engine. It includes the following five aspects:

Data transmission security

• Data source access control

Data storage security

• Storage encryption

• Data backup

Data processing security

Ranger refined data authorization control

Standardize the development process, and implement independent identity management for the development environment and production environment

Data exchange security

Data desensitization

General data security

RBAC permission model

Operational behavior audit

LDAP authentication management

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us