Run Kudu Queries in Impala on EMR Clusters - E-MapReduce

After integrating Impala with Kudu, you can use Impala SQL to query and manage data in Kudu tables. This topic describes how to connect Impala to a Kudu cluster using the E-MapReduce (EMR) console or the CLI.

Prerequisites

Before you begin, ensure that you have:

An EMR cluster with Impala and Kudu selected as optional services. For more information, see Create a cluster.

How it works

There are two ways to tell Impala where the Kudu master nodes are:

Global flag (kudu_master_hosts): Set once in the Impala service configuration. All Kudu tables created through Impala automatically use this setting.
Per-table property (kudu.master_addresses): Specified in the TBLPROPERTIES clause of each CREATE TABLE statement. Use this approach when you configure Impala through the CLI without setting the global flag.

Integrate Impala with Kudu using the EMR console

Step 1: Configure the Impala service

Go to the Configure tab of the Impala service page. For more information, see Manage configuration items.
Click impalad.flgs, then click Add Configuration Item. Add the following configuration item:
Parameter Value
kudu_master_hosts master-1-1:7051
kudu_master_hosts specifies the hostname and port of the Kudu master node. For multiple master nodes, separate each hostname:port pair with a comma — for example: master-1-1:7051,master-1-2:7051,master-1-3:7051.
Click the catalogd.flgs tab, then click Add Configuration Item. Add the same configuration item:
Parameter Value
kudu_master_hosts master-1-1:7051

Parameter	Value
`kudu_master_hosts`	`master-1-1:7051`

Parameter	Value
`kudu_master_hosts`	`master-1-1:7051`

Step 2 (Optional): Verify the integration

Connect to Impala. For more information, see Use the Impala shell tool.

Create a test table:

CREATE TABLE my_first_table
(
  id BIGINT,
  name STRING,
  PRIMARY KEY(id)
)
PARTITION BY HASH PARTITIONS 16
STORED AS KUDU
TBLPROPERTIES(
  'kudu.num_tablet_replicas' = '1');

If the output contains Table has been created., Impala is successfully integrated with Kudu.

Integrate Impala with Kudu using the CLI

Step 1: Connect to Impala

Connect to Impala using the Impala shell tool. For more information, see Use the Impala shell tool.

Step 2: Create a Kudu table

Run the following command to create a table. The kudu.master_addresses property specifies the Kudu master node.

CREATE TABLE my_first_table
(
   id BIGINT,
   name STRING,
   PRIMARY KEY(id)
)
PARTITION BY HASH PARTITIONS 16
STORED AS KUDU
TBLPROPERTIES(
  'kudu.master_addresses' = 'master-1-1:7051',
  'kudu.num_tablet_replicas' = '1');

Key parameters:

Parameter	Description
`my_first_table`	Table name. Replace with a name of your choice.
`kudu.master_addresses`	Hostname and port of the Kudu master node. For multiple master nodes, separate each `hostname:port` pair with a comma — for example: `master-1-1:7051,master-1-2:7051,master-1-3:7051`. For a Hadoop cluster, replace `master-1-1` with `emr-header-1`.
`kudu.num_tablet_replicas`	Number of tablet replicas. The example uses `'1'`.

If the output contains Table has been created., the table is created successfully.

Step 3 (Optional): Insert data

INSERT INTO my_first_table VALUES(1, "ss");

Step 4 (Optional): Query data

SELECT * FROM my_first_table;

Expected output:

+----+------+
| id | name |
+----+------+
| 1  | ss   |
+----+------+

To drop the table, run DROP TABLE my_first_table;.