This topic describes how to configure a Serverless StarRocks instance to securely access a Hadoop cluster with Kerberos authentication enabled, enabling efficient data querying and analysis while ensuring data access security and performance.
Prerequisites
Instance and cluster preparation:
You have created an EMR Serverless StarRocks instance. For more information, see Create an instance.
You have created a self-managed Hadoop cluster or an EMR on ECS cluster that includes HDFS and Hive services with Kerberos Authentication enabled (such as DataLake or Custom type). For more information, see Create a cluster.
This topic uses an EMR-5.18.1 DataLake cluster created on EMR on ECS as an example.
Network connectivity:
Make sure that the Serverless StarRocks instance and the Hadoop cluster are in the same VPC or that the networks are connected.
Configure security group rules to allow the StarRocks instance to access the relevant ports of the Hadoop cluster.
ImportantWhen configuring security group rules, select the necessary ports to open for the Port Range based on your actual requirements.
Procedure
Step 1: Configure StarRocks instance parameters
Go to the Parameter Configuration page.
Log on to the E-MapReduce console.
In the left-side navigation pane, choose .
In the top navigation bar, select a region based on your business requirements.
Find the desired instance and click the name of the instance.
Click the Parameter Configuration tab.
On the Parameter Configuration page, click Add Configuration Item to add the following configuration items.
File
Configuration Item
Description
Reference Value
hdfs-site.xml
dfs.data.transfer.protectionThe data transfer protection level that ensures data security during transmission. The parameter value must be consistent with that of the DataLake cluster.
NoteYou can search for and view the
dfs.data.transfer.protectionparameter value in the hdfs-site.xml file of the HDFS service in the DataLake cluster.integritydfs.datanode.kerberos.principalThe Kerberos principal name of the DataNode. The parameter value must be consistent with that of the DataLake cluster.
NoteYou can search for and view the
dfs.datanode.kerberos.principalparameter value in the hdfs-site.xml file of the HDFS service in the DataLake cluster.hdfs/_HOST@EMR.C-AAA**********CCC.COMdfs.namenode.kerberos.principalThe Kerberos principal name of the NameNode. The parameter value must be consistent with that of the DataLake cluster.
NoteYou can search for and view the
dfs.namenode.kerberos.principalparameter value in the hdfs-site.xml file of the HDFS service in the DataLake cluster.hdfs/_HOST@EMR.C-AAA**********CCC.COMcore-site.xml
hadoop.security.authenticationEnables the Kerberos authentication mechanism.
kerberoshive-site.xml
hive.metastore.sasl.enabledSpecifies whether to enable SASL authentication. The default value is
true.truehive.metastore.kerberos.principalThe Kerberos principal name of the Hive Metastore. The parameter value must be consistent with that of the DataLake cluster.
NoteYou can search for and view the
hive.metastore.kerberos.principalparameter value in the hive-site.xml file of the Hive service in the DataLake cluster.hive/_HOST@EMR.C-AAA**********CCC.COM
(Optional) Step 2: Additional configuration for HA mode HDFS cluster
If you need to access an EMR cluster in HA mode, you must also add the following configurations to the hdfs-site.xml file on the Parameter Configuration page of the StarRocks instance.
You can go to the Configuration tab of the HDFS service under the Services tab of the EMR cluster to find the values of the related parameters in the Hdfs-site.xml file.
Parameter | Description | Reference Value |
| The name of the HDFS service. You can set a custom name. |
|
| The custom name of the NameNode. Separate multiple names with commas (,). The |
|
| The address used by the NameNode for remote procedure calls (RPCs). The nn represents the name of the NameNode configured in |
|
|
| |
|
| |
| The provider that the client uses to connect to the NameNode. |
|
Step 3: Configure Kerberos authentication
Configure the
kerberos.keytabfile.Obtain the Base64-encoded Keytab file.
Log on to the EMR cluster using SSH. For more information, see Log on to a cluster.
Use the following command to Base64-encode the
hive.keytabfile required for StarRocks to access Hive, and ensure that the encoded content has no line feeds.base64 -w 0 /etc/taihao-apps/hive-conf/keytab/hive.keytabNoteEnsure that the
hive.keytabfile is the Keytab file required for Serverless StarRocks to access Hive. If the Keytab file content is incorrect, authentication may fail.
On the Parameter Configuration page of the StarRocks instance, click kerberos.keytab on the left.
Enter the generated Base64 string in the Content configuration item of
kerberos.keytab.
Configure the
krb5.conffile.On the Parameter Configuration page of the StarRocks instance, click
krb5.confon the left.Enter the following content in
content. The content should be consistent with the content of thekrb5.conffile in the DataLake cluster.NoteYou can log on to the EMR cluster and then run the
cat /etc/krb5.confcommand to obtain the content of thekrb5.conffile in the DataLake cluster.[libdefaults] default_realm = EMR.C-AAA**********CCC.COM dns_lookup_realm = false dns_lookup_kdc = false ticket_lifetime = 24h renew_lifetime = 7d forwardable = true rdns = false dns_canonicalize_hostname = true pkinit_anchors = FILE:/etc/pki/tls/certs/ca-bundle.crt kdc_timeout = 30s max_retries = 3 [realms] EMR.C-AAA**********CCC.COM = { kdc = master-1-1.c-aaa**********ccc.cn-beijing.emr.aliyuncs.com:88 kdc = master-1-2.c-aaa**********ccc.cn-beijing.emr.aliyuncs.com:88 admin_server = master-1-1.aaa**********ccc.cn-beijing.emr.aliyuncs.com:749 }The following tables describe the parameters involved in the code:
[libdefaults]sectionParameter
Description
default_realmThe default Kerberos realm used to identify the authentication scope.
dns_lookup_realmSpecifies whether to look up realm information through DNS. This is typically set to
falseto avoid DNS parsing issues.dns_lookup_kdcSpecifies whether to look up KDC (Key Distribution Center) addresses through DNS. This is typically set to
false.ticket_lifetimeThe validity period of a Kerberos ticket. In this example, it is set to
24h, which means that the ticket is valid for 24 hours.renew_lifetimeThe maximum validity period for ticket renewal. In this example, it is set to
7d, which means that the ticket can be renewed for up to 7 days.forwardableSpecifies whether to allow ticket forwarding. Setting this to
truesupports cross-service authentication.rdnsSpecifies whether to use reverse DNS resolution. Setting this to
falsehelps avoid DNS resolution conflicts.dns_canonicalize_hostnameSpecifies whether to enable hostname canonicalization. Setting this to
trueensures consistency in hostname resolution.pkinit_anchorsThe path to the PKINIT anchor certificate used to support public key-based authentication.
kdc_timeoutThe timeout period for KDC requests. In this example, it is set to
30s, which means that each request waits for a maximum of 30 seconds.max_retriesThe maximum number of retries. In this example, it is set to
3, which means that the system attempts to connect to the KDC up to 3 times.[realms]sectionParameter
Description
EMR.C-AAA**********CCC.COMDefines the Kerberos realm.
kdcThe KDC server address and port used for distributing tickets. You can configure multiple KDC addresses to improve availability.
admin_serverThe administration server address and port used for managing Kerberos principals and tickets.
Step 4: Verify the configuration
Connect to the StarRocks instance through EMR StarRocks Manager.
Create a Hive catalog.
CREATE EXTERNAL CATALOG hive_catalog PROPERTIES ( "type" = "hive", "hive.metastore.uris" = "thrift://master-1-1.c-7ebc1ff2****.cn-hangzhou.emr.aliyuncs.com:9083" );Notehive.metastore.urisis the URI of the Hive MetaStore. You can search for and view thehive.metastore.urisparameter value in the hive-site.xml file of the Hive service in the DataLake cluster.View the databases in the catalog.
If the database list is displayed normally, the access is successful.
SET CATALOG hive_catalog; SHOW DATABASES;