MaxCompute allows you to create external data sources and use the external data sources to connect to Hadoop clusters. After the connection is established, you can implement the lakehouse solution. This topic describes how to create, view, and delete an external Hadoop data source.

Background information

You can map external Hadoop data sources to external MaxCompute projects. The mappings help you use MaxCompute to query data from one external Hadoop data source or multiple external Hadoop data sources at a time. MaxCompute allows you to create, view, and delete external Hadoop data sources. For more information, see the following topics:

Usage notes

  • You can create external data sources for MaxCompute in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Shenzhen), China (Zhangjiakou), and Singapore (Singapore).
  • You can bind one external data source to only one external MaxCompute project.
  • You can create, view, and delete external data sources, but you cannot update the external data sources.

Create an external data source

  1. Log on to the MaxCompute console and select a region.
  2. On the Manage External Data Sources tab, click Create External Data Source.
  3. In the Create External Data Source dialog box, configure the parameters and click OK. The following table describes the parameters.
    Parameter Description
    Select MaxCompute Project Select the desired MaxCompute project from the drop-down list. You can view the name of the MaxCompute project on the Project management tab.
    External Data Source Name The custom name of the external data source. The name must meet the following requirements:
    • The name can contain only lowercase letters, digits, and underscores (_).
    • The name must be less than 128 characters in length.
    Network Connection Object The network connection from MaxCompute to the virtual private cloud (VPC) in which the E-MapReduce (EMR) Hadoop cluster or the self-managed Hadoop cluster is deployed. For more information, see VPC connection scheme.
    NameNode Address The IP addresses and port numbers of the active and standby NameNode processes in the Hadoop cluster. In most cases, the port number is 8020. For more information, contact the Hadoop cluster administrator.
    HMS Service Address The Hive Metastore Service (HMS) IP addresses and port numbers of the active and standby NameNode processes in the Hadoop cluster. In most cases, the port number is 9083. For more information, contact the Hadoop cluster administrator.
    Cluster Name The name of the cluster. For a high availability (HA) Hadoop cluster, the value of this parameter is the same as the name of a NameNode process. For a self-managed Hadoop cluster, you can obtain the cluster name based on the dfs.nameservices parameter in the hdfs-site.xml file.
    Authentication Type MaxCompute uses account mappings to obtain metadata and data from Hadoop clusters. The mapped Hadoop accounts are protected by using an authentication mechanism, such as Kerberos authentication. Therefore, the files that contain authentication information of the mapped accounts are required. Select an authentication type based on your business requirements. For more information, contact O&M engineers of the Hadoop cluster.
    • No Authentication: Select this option if the Kerberos authentication mechanism is disabled for the Hadoop cluster.
    • Kerberos Authentication: Select this option if the Kerberos authentication mechanism is enabled for the Hadoop cluster.
      • Configuration File: You can click Upload KRB5.conf File to upload the krb5.conf file of the Hadoop cluster.
        Note If the Hadoop cluster runs in a Linux operating system, the krb5.conf file is stored in the /etc directory for the HDFS NameNode processes on the master node of the Hadoop cluster.
      • hmsPrincipals: the identity of the HMS service. You can run the list_principals command on the Kerberos terminal of the Hadoop cluster to obtain the value of this parameter. Example value:
        hive/emr-header-1.cluster-20****@EMR.20****.COM,hive/emr-header-2.cluster-20****@EMR.20****.COM
        Note The value of this parameter is a comma-delimited string. One principal is mapped to one HMS IP address.
      • Add Account Mapping
        • Account: the Alibaba Cloud account that can access the Hadoop cluster by using MaxCompute.
        • Kerberos Account: the Kerberos-authenticated Hadoop user account that is allowed to access the HMS service.
        • Upload File: You can click Upload Keytab File to upload the keytab file of the Kerberos-authenticated Hadoop user account. For more information about how to create a keytab file, see Create a keytab configuration file.

View or delete an external data source

  1. Log on to the MaxCompute console and select a region.
  2. On the Manage External Data Sources tab in the MaxCompute console, find the external data source that you want to view or delete and click Details or Delete in the Actions column to view or delete the external data source.
    Note If the external data source is bound to an external project, you can delete the external data source only after you delete the external project or unbind the external data source from the external project.