Build a data lakehouse by using MaxCompute and Hadoop - MaxCompute

MaxCompute allows you to build a data lakehouse by using MaxCompute and Hadoop for unified management, storage, and analysis of large amounts of data. The data lakehouse provides an integrated data platform that not only processes structured and semi-structured data but also is engaged in high-concurrency data analysis scenarios. This topic describes how to build a data lakehouse by using MaxCompute and Hadoop and manage data lakehouse projects.

Prerequisites

MaxCompute is activated and a MaxCompute project is created. For more information, see Activate MaxCompute and Create a MaxCompute project.
Note
- If MaxCompute is activated, you can directly use MaxCompute. If MaxCompute is not activated, we recommend that you enable the Hive-compatible data type edition when you activate MaxCompute.
- If you want to build a data lakehouse by using MaxCompute and Hadoop, we recommend that the virtual private cloud (VPC) in which Hadoop is deployed is in the same region as MaxCompute. This way, you are not charged for cross-region network connections.
Before you build a data lakehouse by using MaxCompute and Hadoop, make sure that the high availability (HA) feature is enabled for Hadoop. For more information, contact O&M engineers of the Hadoop cluster.

Limits

The data lakehouse is supported in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Singapore, and Germany (Frankfurt).
The VPC in which Hadoop is deployed must be in the same region as MaxCompute.

Procedure

To build the data lakehouse, perform the following steps:

Step 1: Authorize MaxCompute to access your cloud resources
Step 2: Create a data lakehouse in the DataWorks console
Step 3: Manage the data lakehouse in the DataWorks console

Step 1: Authorize MaxCompute to access your cloud resources

If you want to build a data lakehouse by using MaxCompute and Hadoop, complete authorization by using the following method: Authorize MaxCompute to create an elastic network interface (ENI) in your VPC. After the ENI is created, MaxCompute is connected to your VPC. You can log on to the Resource Access Management (RAM) console with the Alibaba Cloud account to which the VPC belongs and complete authorization on the Cloud Resource Access Authorization page.

Step 2: Create a data lakehouse in the DataWorks console

Log on to the DataWorks console, and select a region in which the data lakehouse is supported.
Note
For more information about the regions in which the data lakehouse is supported, see Limits.
In the left-side navigation pane of the DataWorks console, click Lake and Warehouse Integration (Data Lakehouse).
On the Lake and Warehouse Integration (Data Lakehouse) page, click Start.

On the Create Data Lakehouse page, configure the parameters. The following tables describe the parameters.

Table 1. Create Data Warehouse
Parameter	Description
External Project Name	The custom name of the external project. The name must meet the following requirements: The name can contain only letters, digits, and underscores (_), and must start with a letter. The name must be 1 to 128 characters in length. Note For more information about external projects, see Project.
MaxCompute Project	Select a MaxCompute project from the drop-down list. If no MaxCompute project exists, you can click Create Project in MaxCompute Console to create a project. For more information, see Create a MaxCompute project. Note If you cannot select a project from the MaxCompute Project drop-down list, you must associate the project with a DataWorks workspace in the DataWorks console. For more information, see Associate a compute engine with a workspace.

Table 2. Create Data Lake Connection

Parameter		Description
Heterogeneous Data Platform Type		Alibaba Cloud E-MapReduce/Hadoop Cluster: Select this option if you want to build a data lakehouse by using MaxCompute and Hadoop. Alibaba Cloud DLF + OSS: Select this option if you want to build a data lakehouse by using MaxCompute, DLF, and OSS.
Alibaba Cloud E-MapReduce/Hadoop Cluster	Network Connection	Select a connection from MaxCompute to the VPC in which an external data source is deployed from the drop-down list. The external data source can be an Alibaba Cloud EMR Hadoop cluster or a self-managed Hadoop cluster. You can also establish a connection. For more information about the parameter, see Step c Establish a network connection between MaxCompute and the destination VPC of "Access over a VPC (dedicated connection)" in Network connection process. Note You are not charged for network connections in the public review phase of the data lakehouse. For more information about network connections, see Terms.
	External Data Source	An external data source stores the information that is required for establishing a connection from MaxCompute to an external data source. The information includes Uniform Resource Locators (URLs), port numbers, and user authentication information. Select an EMR Hadoop cluster or a self-managed Hadoop cluster from the drop-down list. You can also create an EMR Hadoop cluster or a self-managed Hadoop cluster. For more information, see Step 3 of "Create an external data source" in Manage external data sources.

Table 3. Create External Data Source

Parameter	Description
Select MaxCompute Project	Select the desired MaxCompute project from the drop-down list. You can view the name of the MaxCompute project on the Projects page.
External Data Source Name	The custom name of the external data source. The name must meet the following requirements: The name can contain only lowercase letters, digits, and underscores (_). The name must be less than 128 characters in length.
Network Connection Object	The network connection from MaxCompute to the VPC in which the E-MapReduce (EMR) Hadoop cluster or the self-managed Hadoop cluster is deployed. For more information, see Network connection process.
NameNode Address	The IP addresses and port numbers of the active and standby NameNode processes in the Hadoop cluster. In most cases, the port number is 8020. For more information, contact the Hadoop cluster administrator.
HMS Service Address	The Hive Metastore Service (HMS) IP addresses and port numbers of the active and standby NameNode processes in the Hadoop cluster. In most cases, the port number is 9083. For more information, contact the Hadoop cluster administrator.
Cluster Name	The name of the cluster. For a HA Hadoop cluster, the value of this parameter is the same as the name of a NameNode process. For a self-managed Hadoop cluster, you can obtain the cluster name based on the dfs.nameservices parameter in the hdfs-site.xml file.
Authentication Type	MaxCompute uses account mappings to obtain metadata and data from Hadoop clusters. The mapped Hadoop accounts are protected by using an authentication mechanism, such as Kerberos authentication. Therefore, the files that contain authentication information of the mapped accounts are required. Select an authentication type based on your business requirements. For more information, contact O&M engineers of the Hadoop cluster. No Authentication: Select this option if the Kerberos authentication mechanism is disabled for the Hadoop cluster. Kerberos Authentication: Select this option if the Kerberos authentication mechanism is enabled for the Hadoop cluster. Configuration File: You can click Upload KRB5.conf File to upload the krb5.conf file of the Hadoop cluster. Note If the Hadoop cluster runs in a Linux operating system, the krb5.conf file is stored in the `/etc` directory for the HDFS NameNode processes on the master node of the Hadoop cluster. hmsPrincipals: the identity of the HMS service. You can run the `list_principals` command on the Kerberos terminal of the Hadoop cluster to obtain the value of this parameter. Example value: `hive/emr-header-1.cluster-20**@EMR.20.COM,hive/emr-header-2.cluster-20@EMR.20.COM` Note The value of this parameter is a comma-delimited string. One principal is mapped to one HMS IP address. Add Account Mapping Account: the Alibaba Cloud account that can access the Hadoop cluster by using MaxCompute. Kerberos Account: the Kerberos-authenticated Hadoop user account that is allowed to access the HMS service. Upload File**: You can click Upload Keytab File to upload the keytab file of the Kerberos-authenticated Hadoop user account. For more information about how to create a keytab file, see Create a keytab configuration file.

Table 4. Create Data Mapping

Parameter	Description
External Data Source Object	By default, this parameter is set to the value of External Data Source.
Destination Database	The database in the Hadoop cluster.

Step 3: Manage the data lakehouse in the DataWorks console

Use the data lakehouse

In the left-side navigation pane of the DataWorks console, click Lake and Warehouse Integration (Data Lakehouse).
On the Lake and Warehouse Integration (Data Lakehouse) page, find the external project that you want to use.
- Use the data lakehouse.
  Click Use Data Lakehouse in the Actions column of the external project.
- Update the external project.
  Click Project configuration in the Actions column of the external project. In the Project configuration dialog box, update the information about the external project.
  Note
  You can change the database name of the external data source that is mapped to the external project, and select another external data source. If you want to delete an external data source, go to the Manage External Data Sources tab in the MaxCompute console, find the external data source, and click Delete in the Actions column. You cannot update an external data source.
Delete the external project.
Click Delete in the Actions column of the external project.
Note
The external project is logically deleted and enters the silent state. The external project is completely deleted after 15 days. You cannot create an external project with the same name during this period of time.

View the metadata of an external project in the data lakehouse

In the left-side navigation pane of the DataWorks console, click Workspaces.
On the Workspaces page, find the workspace that is mapped to your external project and choose Shortcuts > Data Map in the Actions column.
On the Data Map page, enter the table name in the search box and click Search. You can also go to the All Data tab, select your external project from the Project drop-down list, enter the table name in the search box, and then click Search.
Note
- The Apply for Permission and View Lineage features of the table are unavailable.
- The metadata in the table is updated on a T+1-day basis. In this case, the changes you made on the table in the external project that is mapped to the external data source, such as the Hive database, are synchronized to Data Map of DataWorks after T+1 days. Metadata in MaxCompute is updated in real time.

References

For more information about how to build a data lakehouse that supports the Delta Lake or Apache Hudi storage mechanism based on Hadoop clusters, see Delta Lake or Apache Hudi storage mechanism based on Hadoop clusters.