All Products
Search
Document Center

E-MapReduce:Access a Hadoop cluster with Kerberos authentication enabled

Last Updated:Dec 05, 2025

This topic describes how to configure a Serverless StarRocks instance to securely access a Hadoop cluster with Kerberos authentication enabled, enabling efficient data querying and analysis while ensuring data access security and performance.

Prerequisites

  • Instance and cluster preparation:

    • You have created an EMR Serverless StarRocks instance. For more information, see Create an instance.

    • You have created a self-managed Hadoop cluster or an EMR on ECS cluster that includes HDFS and Hive services with Kerberos Authentication enabled (such as DataLake or Custom type). For more information, see Create a cluster.

      This topic uses an EMR-5.18.1 DataLake cluster created on EMR on ECS as an example.

  • Network connectivity:

    • Make sure that the Serverless StarRocks instance and the Hadoop cluster are in the same VPC or that the networks are connected.

    • Configure security group rules to allow the StarRocks instance to access the relevant ports of the Hadoop cluster.

      Important

      When configuring security group rules, select the necessary ports to open for the Port Range based on your actual requirements.

Procedure

Step 1: Configure StarRocks instance parameters

  1. Go to the Parameter Configuration page.

    1. Log on to the E-MapReduce console.

    2. In the left-side navigation pane, choose EMR Serverless > StarRocks.

    3. In the top navigation bar, select a region based on your business requirements.

    4. Find the desired instance and click the name of the instance.

    5. Click the Parameter Configuration tab.

  2. On the Parameter Configuration page, click Add Configuration Item to add the following configuration items.

    File

    Configuration Item

    Description

    Reference Value

    hdfs-site.xml

    dfs.data.transfer.protection

    The data transfer protection level that ensures data security during transmission. The parameter value must be consistent with that of the DataLake cluster.

    Note

    You can search for and view the dfs.data.transfer.protection parameter value in the hdfs-site.xml file of the HDFS service in the DataLake cluster.

    integrity

    dfs.datanode.kerberos.principal

    The Kerberos principal name of the DataNode. The parameter value must be consistent with that of the DataLake cluster.

    Note

    You can search for and view the dfs.datanode.kerberos.principal parameter value in the hdfs-site.xml file of the HDFS service in the DataLake cluster.

    hdfs/_HOST@EMR.C-AAA**********CCC.COM

    dfs.namenode.kerberos.principal

    The Kerberos principal name of the NameNode. The parameter value must be consistent with that of the DataLake cluster.

    Note

    You can search for and view the dfs.namenode.kerberos.principal parameter value in the hdfs-site.xml file of the HDFS service in the DataLake cluster.

    hdfs/_HOST@EMR.C-AAA**********CCC.COM

    core-site.xml

    hadoop.security.authentication

    Enables the Kerberos authentication mechanism.

    kerberos

    hive-site.xml

    hive.metastore.sasl.enabled

    Specifies whether to enable SASL authentication. The default value is true.

    true

    hive.metastore.kerberos.principal

    The Kerberos principal name of the Hive Metastore. The parameter value must be consistent with that of the DataLake cluster.

    Note

    You can search for and view the hive.metastore.kerberos.principal parameter value in the hive-site.xml file of the Hive service in the DataLake cluster.

    hive/_HOST@EMR.C-AAA**********CCC.COM

(Optional) Step 2: Additional configuration for HA mode HDFS cluster

If you need to access an EMR cluster in HA mode, you must also add the following configurations to the hdfs-site.xml file on the Parameter Configuration page of the StarRocks instance.

Note

You can go to the Configuration tab of the HDFS service under the Services tab of the EMR cluster to find the values of the related parameters in the Hdfs-site.xml file.

Parameter

Description

Reference Value

dfs.nameservices

The name of the HDFS service. You can set a custom name.

hdfs-cluster

dfs.ha.namenodes.hdfs-cluster

The custom name of the NameNode. Separate multiple names with commas (,). The hdfs-cluster is the custom name specified for dfs.nameservices.

nn1,nn2,nn3

dfs.namenode.rpc-address.hdfs-cluster.nn1

The address used by the NameNode for remote procedure calls (RPCs). The nn represents the name of the NameNode configured in dfs.ha.namenodes.xxx.

master-1-1.c-aaa**********ccc.cn-beijing.emr.aliyuncs.com:8020

dfs.namenode.rpc-address.hdfs-cluster.nn2

master-1-2.c-aaa**********ccc.cn-beijing.emr.aliyuncs.com:8020

dfs.namenode.rpc-address.hdfs-cluster.nn3

master-1-3.c-aaa**********ccc.cn-beijing.emr.aliyuncs.com:8020

dfs.client.failover.proxy.provider.hdfs-cluster

The provider that the client uses to connect to the NameNode.

org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

Step 3: Configure Kerberos authentication

  1. Configure the kerberos.keytab file.

    1. Obtain the Base64-encoded Keytab file.

      1. Log on to the EMR cluster using SSH. For more information, see Log on to a cluster.

      2. Use the following command to Base64-encode the hive.keytab file required for StarRocks to access Hive, and ensure that the encoded content has no line feeds.

        base64 -w 0 /etc/taihao-apps/hive-conf/keytab/hive.keytab
        Note

        Ensure that the hive.keytab file is the Keytab file required for Serverless StarRocks to access Hive. If the Keytab file content is incorrect, authentication may fail.

    2. On the Parameter Configuration page of the StarRocks instance, click kerberos.keytab on the left.

    3. Enter the generated Base64 string in the Content configuration item of kerberos.keytab.

  2. Configure the krb5.conf file.

    1. On the Parameter Configuration page of the StarRocks instance, click krb5.conf on the left.

    2. Enter the following content in content. The content should be consistent with the content of the krb5.conf file in the DataLake cluster.

      Note

      You can log on to the EMR cluster and then run the cat /etc/krb5.conf command to obtain the content of the krb5.conf file in the DataLake cluster.

      [libdefaults]
        default_realm = EMR.C-AAA**********CCC.COM
        dns_lookup_realm = false
        dns_lookup_kdc = false
        ticket_lifetime = 24h
        renew_lifetime = 7d
        forwardable = true
        rdns = false
        dns_canonicalize_hostname = true
        pkinit_anchors = FILE:/etc/pki/tls/certs/ca-bundle.crt
        kdc_timeout = 30s
        max_retries = 3
      
        [realms]
        EMR.C-AAA**********CCC.COM = {
        kdc = master-1-1.c-aaa**********ccc.cn-beijing.emr.aliyuncs.com:88
        kdc = master-1-2.c-aaa**********ccc.cn-beijing.emr.aliyuncs.com:88
        admin_server = master-1-1.aaa**********ccc.cn-beijing.emr.aliyuncs.com:749
        }

      The following tables describe the parameters involved in the code:

      • [libdefaults] section

        Parameter

        Description

        default_realm

        The default Kerberos realm used to identify the authentication scope.

        dns_lookup_realm

        Specifies whether to look up realm information through DNS. This is typically set to false to avoid DNS parsing issues.

        dns_lookup_kdc

        Specifies whether to look up KDC (Key Distribution Center) addresses through DNS. This is typically set to false.

        ticket_lifetime

        The validity period of a Kerberos ticket. In this example, it is set to 24h, which means that the ticket is valid for 24 hours.

        renew_lifetime

        The maximum validity period for ticket renewal. In this example, it is set to 7d, which means that the ticket can be renewed for up to 7 days.

        forwardable

        Specifies whether to allow ticket forwarding. Setting this to true supports cross-service authentication.

        rdns

        Specifies whether to use reverse DNS resolution. Setting this to false helps avoid DNS resolution conflicts.

        dns_canonicalize_hostname

        Specifies whether to enable hostname canonicalization. Setting this to true ensures consistency in hostname resolution.

        pkinit_anchors

        The path to the PKINIT anchor certificate used to support public key-based authentication.

        kdc_timeout

        The timeout period for KDC requests. In this example, it is set to 30s, which means that each request waits for a maximum of 30 seconds.

        max_retries

        The maximum number of retries. In this example, it is set to 3, which means that the system attempts to connect to the KDC up to 3 times.

      • [realms] section

        Parameter

        Description

        EMR.C-AAA**********CCC.COM

        Defines the Kerberos realm.

        kdc

        The KDC server address and port used for distributing tickets. You can configure multiple KDC addresses to improve availability.

        admin_server

        The administration server address and port used for managing Kerberos principals and tickets.

Step 4: Verify the configuration

  1. Connect to the StarRocks instance through EMR StarRocks Manager.

  2. Create a Hive catalog.

    CREATE EXTERNAL CATALOG hive_catalog
    PROPERTIES (
      "type" = "hive",
      "hive.metastore.uris" = "thrift://master-1-1.c-7ebc1ff2****.cn-hangzhou.emr.aliyuncs.com:9083"
    );
    Note

    hive.metastore.uris is the URI of the Hive MetaStore. You can search for and view the hive.metastore.uris parameter value in the hive-site.xml file of the Hive service in the DataLake cluster.

  3. View the databases in the catalog.

    If the database list is displayed normally, the access is successful.

    SET CATALOG hive_catalog;
    SHOW DATABASES;