MaxCompute allows you to build a data lakehouse by using MaxCompute and Hadoop for unified management, storage, and analysis of large amounts of data. The data lakehouse provides an integrated data platform that not only processes structured and semi-structured data but also is engaged in high-concurrency data analysis scenarios. This topic describes how to build a data lakehouse by using MaxCompute and Hadoop and manage data lakehouse projects.
Prerequisites
MaxCompute is activated and a MaxCompute project is created. For more information, see Activate MaxCompute and Create a MaxCompute project.
NoteIf MaxCompute is activated, you can directly use MaxCompute. If MaxCompute is not activated, we recommend that you enable the Hive-compatible data type edition when you activate MaxCompute.
If you want to build a data lakehouse by using MaxCompute and Hadoop, we recommend that the virtual private cloud (VPC) in which Hadoop is deployed is in the same region as MaxCompute. This way, you are not charged for cross-region network connections.
Before you build a data lakehouse by using MaxCompute and Hadoop, make sure that the high availability (HA) feature is enabled for Hadoop. For more information, contact O&M engineers of the Hadoop cluster.
Limits
The data lakehouse is supported in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Singapore, and Germany (Frankfurt).
The VPC in which Hadoop is deployed must be in the same region as MaxCompute.
Procedure
To build the data lakehouse, perform the following steps:
Step 1: Authorize MaxCompute to access your cloud resources
If you want to build a data lakehouse by using MaxCompute and Hadoop, complete authorization by using the following method: Authorize MaxCompute to create an elastic network interface (ENI) in your VPC. After the ENI is created, MaxCompute is connected to your VPC. You can log on to the Resource Access Management (RAM) console with the Alibaba Cloud account to which the VPC belongs and complete authorization on the Cloud Resource Access Authorization page.
Step 2: Create a data lakehouse in the DataWorks console
Log on to the DataWorks console, and select a region in which the data lakehouse is supported.
NoteFor more information about the regions in which the data lakehouse is supported, see Limits.
In the left-side navigation pane of the DataWorks console, click Lake and Warehouse Integration (Data Lakehouse).
On the Lake and Warehouse Integration (Data Lakehouse) page, click Start.
On the Create Data Lakehouse page, configure the parameters. The following tables describe the parameters.
Table 1. Create Data Warehouse Parameter
Description
External Project Name
The custom name of the external project. The name must meet the following requirements:
The name can contain only letters, digits, and underscores (_), and must start with a letter.
The name must be 1 to 128 characters in length.
NoteFor more information about external projects, see Project.
MaxCompute Project
Select a MaxCompute project from the drop-down list. If no MaxCompute project exists, you can click Create Project in MaxCompute Console to create a project. For more information, see Create a MaxCompute project.
NoteIf you cannot select a project from the MaxCompute Project drop-down list, you must associate the project with a DataWorks workspace in the DataWorks console. For more information, see Associate a compute engine with a workspace.
Table 2. Create Data Lake Connection
Parameter
Description
Heterogeneous Data Platform Type
Alibaba Cloud E-MapReduce/Hadoop Cluster: Select this option if you want to build a data lakehouse by using MaxCompute and Hadoop.
Alibaba Cloud DLF + OSS: Select this option if you want to build a data lakehouse by using MaxCompute, DLF, and OSS.
Alibaba Cloud E-MapReduce/Hadoop Cluster
Network Connection
Select a connection from MaxCompute to the VPC in which an external data source is deployed from the drop-down list. The external data source can be an Alibaba Cloud EMR Hadoop cluster or a self-managed Hadoop cluster. You can also establish a connection. For more information about the parameter, see Step c Establish a network connection between MaxCompute and the destination VPC of "Access over a VPC (dedicated connection)" in Network connection process.
NoteYou are not charged for network connections in the public review phase of the data lakehouse.
For more information about network connections, see Terms.
External Data Source
An external data source stores the information that is required for establishing a connection from MaxCompute to an external data source. The information includes Uniform Resource Locators (URLs), port numbers, and user authentication information. Select an EMR Hadoop cluster or a self-managed Hadoop cluster from the drop-down list. You can also create an EMR Hadoop cluster or a self-managed Hadoop cluster. For more information, see Step 3 of "Create an external data source" in Manage external data sources.
Table 3. Create External Data Source
Parameter
Description
Select MaxCompute Project
Select the desired MaxCompute project from the drop-down list. You can view the name of the MaxCompute project on the Projects page.
External Data Source Name
The custom name of the external data source. The name must meet the following requirements:
The name can contain only lowercase letters, digits, and underscores (_).
The name must be less than 128 characters in length.
Network Connection Object
The network connection from MaxCompute to the VPC in which the E-MapReduce (EMR) Hadoop cluster or the self-managed Hadoop cluster is deployed. For more information, see Network connection process.
NameNode Address
The IP addresses and port numbers of the active and standby NameNode processes in the Hadoop cluster. In most cases, the port number is 8020. For more information, contact the Hadoop cluster administrator.
HMS Service Address
The Hive Metastore Service (HMS) IP addresses and port numbers of the active and standby NameNode processes in the Hadoop cluster. In most cases, the port number is 9083. For more information, contact the Hadoop cluster administrator.
Cluster Name
The name of the cluster. For a HA Hadoop cluster, the value of this parameter is the same as the name of a NameNode process. For a self-managed Hadoop cluster, you can obtain the cluster name based on the dfs.nameservices parameter in the hdfs-site.xml file.
Authentication Type
MaxCompute uses account mappings to obtain metadata and data from Hadoop clusters. The mapped Hadoop accounts are protected by using an authentication mechanism, such as Kerberos authentication. Therefore, the files that contain authentication information of the mapped accounts are required. Select an authentication type based on your business requirements. For more information, contact O&M engineers of the Hadoop cluster.
No Authentication: Select this option if the Kerberos authentication mechanism is disabled for the Hadoop cluster.
Kerberos Authentication: Select this option if the Kerberos authentication mechanism is enabled for the Hadoop cluster.
Configuration File: You can click Upload KRB5.conf File to upload the krb5.conf file of the Hadoop cluster.
NoteIf the Hadoop cluster runs in a Linux operating system, the krb5.conf file is stored in the
/etcdirectory for the HDFS NameNode processes on the master node of the Hadoop cluster.hmsPrincipals: the identity of the HMS service. You can run the
list_principalscommand on the Kerberos terminal of the Hadoop cluster to obtain the value of this parameter. Example value:hive/emr-header-1.cluster-20****@EMR.20****.COM,hive/emr-header-2.cluster-20****@EMR.20****.COMNoteThe value of this parameter is a comma-delimited string. One principal is mapped to one HMS IP address.
Add Account Mapping
Account: the Alibaba Cloud account that can access the Hadoop cluster by using MaxCompute.
Kerberos Account: the Kerberos-authenticated Hadoop user account that is allowed to access the HMS service.
Upload File: You can click Upload Keytab File to upload the keytab file of the Kerberos-authenticated Hadoop user account. For more information about how to create a keytab file, see Create a keytab configuration file.
Table 4. Create Data Mapping
Parameter
Description
External Data Source Object
By default, this parameter is set to the value of External Data Source.
Destination Database
The database in the Hadoop cluster.
Step 3: Manage the data lakehouse in the DataWorks console
Use the data lakehouse
In the left-side navigation pane of the DataWorks console, click Lake and Warehouse Integration (Data Lakehouse).
On the Lake and Warehouse Integration (Data Lakehouse) page, find the external project that you want to use.
Use the data lakehouse.
Click Use Data Lakehouse in the Actions column of the external project.
Update the external project.
Click Project configuration in the Actions column of the external project. In the Project configuration dialog box, update the information about the external project.
NoteYou can change the database name of the external data source that is mapped to the external project, and select another external data source. If you want to delete an external data source, go to the Manage External Data Sources tab in the MaxCompute console, find the external data source, and click Delete in the Actions column. You cannot update an external data source.
Delete the external project.
Click Delete in the Actions column of the external project.
NoteThe external project is logically deleted and enters the silent state. The external project is completely deleted after 15 days. You cannot create an external project with the same name during this period of time.
View the metadata of an external project in the data lakehouse
In the left-side navigation pane of the DataWorks console, click Workspaces.
On the Workspaces page, find the workspace that is mapped to your external project and choose Shortcuts > Data Map in the Actions column.
On the Data Map page, enter the table name in the search box and click Search. You can also go to the All Data tab, select your external project from the Project drop-down list, enter the table name in the search box, and then click Search.
NoteThe Apply for Permission and View Lineage features of the table are unavailable.
The metadata in the table is updated on a T+1-day basis. In this case, the changes you made on the table in the external project that is mapped to the external data source, such as the Hive database, are synchronized to Data Map of DataWorks after T+1 days. Metadata in MaxCompute is updated in real time.
References
For more information about how to build a data lakehouse that supports the Delta Lake or Apache Hudi storage mechanism based on Hadoop clusters, see Delta Lake or Apache Hudi storage mechanism based on Hadoop clusters.