Connect MaxCompute to your Hadoop cluster to query and analyze structured and semi-structured data across both systems without moving data. The data lakehouse provides an integrated data platform that supports both structured and semi-structured data processing and high-concurrency data analysis scenarios. This topic describes how to set up the connection and manage the resulting data lakehouse project.
How it works
To build the data lakehouse, you create three objects in sequence:
-
A network connection — links MaxCompute to the virtual private cloud (VPC) where your Hadoop cluster runs, using an elastic network interface (ENI).
-
An external data source — stores the Hive Metastore Service (HMS) address, NameNode addresses, cluster name, and authentication credentials for the Hadoop cluster.
-
An external project — maps the Hadoop database to a MaxCompute project, giving you a unified query interface.
RAM authorization must be completed before you create these objects.
Prerequisites
Before you begin, ensure the following requirements are met.
Environment requirements:
-
MaxCompute is activated and a MaxCompute project is created. See Activate MaxCompute and Create a MaxCompute project.
-
If MaxCompute is not yet activated, enable the Hive-compatible data type edition when you activate it.
-
High availability (HA) is enabled for the Hadoop cluster. Contact the O&M engineers of the Hadoop cluster to confirm.
Network requirements:
-
Deploy the Hadoop cluster in a VPC in the same region as MaxCompute. Cross-region connections incur additional charges.
-
The data lakehouse is supported in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Singapore, and Germany (Frankfurt).
Permission requirements:
-
Log on to the Resource Access Management (RAM) console with the Alibaba Cloud account that owns the VPC and complete authorization on the Cloud Resource Access Authorization page. This grants MaxCompute permission to create an ENI in your VPC, which connects MaxCompute to the Hadoop cluster's network.
Create a data lakehouse
-
Log on to the DataWorks console and select a supported region.
-
In the left-side navigation pane, click Lake and Warehouse Integration (Data Lakehouse).
-
On the Lake and Warehouse Integration (Data Lakehouse) page, click Start.
-
On the Create Data Lakehouse page, configure the parameters described in the following tables. Authentication type options:
-
No Authentication: select this if Kerberos is disabled on the Hadoop cluster.
-
Kerberos Authentication: select this if Kerberos is enabled. Configure the following sub-parameters:
Sub-parameter Description Configuration File Upload the krb5.conffile from the Hadoop cluster. On Linux, the file is in the/etcdirectory on the master node's NameNode processes.hmsPrincipals The identity of the HMS service. Run list_principalson the Kerberos terminal to get this value. Provide a comma-delimited string with one principal per HMS IP address. Example: `hive/emr-header-1.cluster-20****@EMR.20****.COM,hive/emr-header-2.cluster-20****@EMR.20****.COM`Add Account Mapping Map Alibaba Cloud accounts to Kerberos-authenticated Hadoop user accounts. For each mapping, provide: Account (the Alibaba Cloud account that accesses the Hadoop cluster via MaxCompute), Kerberos Account (the Kerberos-authenticated Hadoop user authorized to access HMS), and Upload File (the Kerberos keytab file for that account; see Create a keytab configuration file).
Table 1: Create data warehouse
Parameter Description External Project Name A custom name for the external project. Must contain only letters, digits, and underscores (_), start with a letter, and be 1–128 characters in length. For more information about external projects, see Project. MaxCompute Project Select a MaxCompute project from the drop-down list. If no project exists, click Create Project in MaxCompute Console. If the drop-down list is empty, associate the MaxCompute project with a DataWorks workspace first. See Associate a compute engine with a workspace. Table 2: Create data lake connection
Parameter Description Heterogeneous Data Platform Type Select Alibaba Cloud E-MapReduce/Hadoop Cluster to connect to a Hadoop cluster. Select Alibaba Cloud DLF + OSS to connect to MaxCompute using DLF and OSS instead. Network Connection (EMR/Hadoop only) Select the network connection from MaxCompute to the VPC where the external data source is deployed. To create a new connection, follow Step c "Establish a network connection between MaxCompute and the destination VPC" in Network connection process. Network connections are free during the public review phase of the data lakehouse. External Data Source Select an existing EMR Hadoop cluster or self-managed Hadoop cluster from the drop-down list. To create one, see Table 3: Create external data source below. Table 3: Create external data source
Parameter Description Select MaxCompute Project Select the MaxCompute project to associate with this external data source. External Data Source Name A custom name for the external data source. Must contain only lowercase letters, digits, and underscores (_), and be fewer than 128 characters in length. Network Connection Object The network connection from MaxCompute to the VPC where the EMR Hadoop cluster or self-managed Hadoop cluster is deployed. See Network connection process. NameNode Address The IP addresses and port numbers of the active and standby NameNode processes. The default port is 8020. Contact the Hadoop cluster administrator for the exact values. HMS Service Address The IP addresses and port numbers of the Hive Metastore Service (HMS) processes. The default port is 9083. Contact the Hadoop cluster administrator for the exact values. Cluster Name For an HA Hadoop cluster: the same as the NameNode process name. For a self-managed Hadoop cluster: the value of the dfs.nameservicesparameter inhdfs-site.xml.Authentication Type Select based on your Hadoop cluster's security configuration. See the authentication options below. Table 4: Create data mapping
Parameter Description External Data Source Object Defaults to the external data source selected in Table 2. Destination Database The database in the Hadoop cluster to map to this external project. -
Manage data lakehouse projects
Use, update, or delete an external project
-
In the left-side navigation pane of the DataWorks console, click Lake and Warehouse Integration (Data Lakehouse).
-
On the Lake and Warehouse Integration (Data Lakehouse) page, find the external project and choose an action from the Actions column:
-
Use Data Lakehouse: start using the data lakehouse.
-
Project configuration: update the external project. You can change the mapped database name or select a different external data source. External data sources themselves cannot be updated. To delete an external data source, go to the Manage External Data Sources tab in the MaxCompute console, find the external data source, and click Delete in the Actions column.
-
Delete: delete the external project.
WarningDeleting an external project is a two-phase process. The project is logically deleted and enters a silent state immediately. It is permanently deleted after 15 days. During this period, you cannot create a new external project with the same name.
-
View metadata of an external project
-
In the left-side navigation pane of the DataWorks console, click Workspaces.
-
On the Workspaces page, find the workspace mapped to your external project and choose Shortcuts > Data Map in the Actions column.
-
On the Data Map page, search for tables by name. Alternatively, go to the All Data tab, select your external project from the Project drop-down list, and search by table name.
The following features are unavailable for external project tables: Apply for Permission and View Lineage.
Metadata for Hive database tables is updated on a T+1-day basis. Changes made to tables in the external data source (such as the Hive database) appear in DataWorks Data Map the following day. Metadata for MaxCompute tables is updated in real time.
What's next
To build a data lakehouse that uses Delta Lake or Apache Hudi storage formats with Hadoop clusters, see Delta Lake or Apache Hudi storage mechanism based on Hadoop clusters.