The MaxCompute data lakehouse solution combines the flexibility of a data lake with the robust capabilities of a data warehouse. By integrating MaxCompute with Data Lake Formation (DLF), it establishes a comprehensive data management platform. This topic outlines the management of data assets within the MaxCompute and DLF data lakehouse using Dataphin.
Background information
The MaxCompute data lakehouse is realized through MaxCompute and the data lake, supporting two primary construction methods:
Construct a data lakehouse using MaxCompute, DLF, and Object Storage Service (OSS): In this approach, DLF manages the entire metadata (schema) of the data lake. MaxCompute utilizes DLF's metadata management for OSS, enhancing the handling of semi-structured formats like Delta Lake, Hudi, AVRO, CSV, JSON, PARQUET, and ORC within OSS. For details on DLF and OSS, see Data Lake Formation (DLF) and Object Storage Service (OSS).
Construct a data lakehouse using MaxCompute and Hadoop: This method encompasses local data center configuration, cloud-based virtual machine deployment, and Alibaba Cloud E-MapReduce. Once the VPC network connecting MaxCompute and the Hadoop platform is established, MaxCompute can directly access Hive's global meta service, mapping metadata to an external MaxCompute project.
Prerequisites
Before managing the MaxCompute and DLF and OSS data lakehouse with Dataphin, ensure the following requirements are met:
Activate the DLF service.
Activate the OSS service.
Activate the MaxCompute service and create a MaxCompute project.
create externalproject -source dlf -name external_project -- Required. The name of the external project to be created. -ref maxcompute_project -- The name of the created MaxCompute project -comment "DLF" -region "cn-hangzhou" -- The RegionID of the region where DLF is located. For more information about RegionID, see Get RegionID and VPC ID. -db metadat_store -- The name of the DLF metadatabase. -endpoint "dlf-share.cn-hangzhou.aliyuncs.com" -- The Endpoint information of DLF -ossEndpoint "oss-cn-hangzhou-internal.aliyuncs.com"; -- The Endpoint of the region where OSS is located
MaxCompute access authorization
For building a data lakehouse with MaxCompute and Hadoop, the authorization process is as follows:
Authorize MaxCompute to create an Elastic Network Interface (ENI) within the user's VPC for network connectivity. Log on to Alibaba Cloud with the VPC owner account and authorize it with a single click.
For building a data lakehouse with MaxCompute, DLF, and OSS, the authorization process is as follows:
Without authorization, the MaxCompute project account cannot access DLF. Authorization can be performed using the following methods:
One-click authorization: Use this method when the account creating the MaxCompute project is the same as the one deploying DLF. It is advisable to authorize DLF with a single click.
Custom authorization: This method is suitable when the account creating the MaxCompute project is the same or different from the one deploying DLF. For more information, refer to Custom authorize DLF.
Manage MaxCompute data lakehouse through Dataphin
DLF facilitates metadata discovery and management for OSS. MaxCompute can create external projects based on DLF, registering the managed metadata into MaxCompute's external projects. Dataphin enables data processing (offline development and standardized modeling), metadata management, access control, security auditing, data quality assessment, and computing resource administration for the data lakehouse built on MaxCompute and DLF.
Create MaxCompute computing source and bind it to Dataphin project
Create a MaxCompute computing source and register the external project of MaxCompute. As the external project lacks computing resources, specify an additional MaxCompute project for task execution, quality rule enforcement, security rule scanning, and security policy implementation. For guidance on creating a MaxCompute computing source, see Create MaxCompute computing source.
After the computing source is created, establish a project and register the new computing source as the MaxCompute computing source.
Standardized modeling and data processing based on the data of MaxCompute data lakehouse external projects
Once the MaxCompute computing source is created and linked to the Dataphin project, standardized modeling can generate logical tables from the source tables in the external project. MaxCompute SQL tasks can utilize the computing resources of the associated internal project to execute and support data operations in the external project.
View metadata information and manage permissions of the data lakehouse
Facilitate viewing metadata information.
Enable asset search and query for data tables and fields in external projects.
Allow data preview.
Generate select statements and DDL statements.
Facilitate requesting permissions for tables and fields in external projects.
Audit data quality and manage security of the data lakehouse
Enable configuration of data table quality rules for physical tables in external projects.
Support MaxCompute SQL task execution for quality rule verification.
Facilitate security rule scanning and security policy implementation.