This topic describes how to use Impala on an E-MapReduce (EMR) cluster to query data stored in OSS-HDFS.
Prerequisites
Before you begin, ensure that you have:
An EMR cluster running EMR V3.42.0 or later, or EMR V5.8.0 or later, with the Impala service selected at cluster creation. See Create a cluster
OSS-HDFS enabled on a bucket, with access permissions granted. See Enable OSS-HDFS and grant access permissions
Usage notes
OSS-HDFS paths use the format
oss://<bucket-name>.<endpoint>/<path>. Specify this format in theLOCATIONclause when creating a database or table.
Query data stored in OSS-HDFS
Step 1: Log on to the EMR cluster
Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
Click the EMR cluster that you created.
On the Nodes tab, click
on the left side of the node group.Click the ECS instance ID. On the Instances page, click Connect next to the instance ID.
For details on connecting via SSH key pair or SSH password from Windows or Linux, see Log on to a cluster.
Step 2: Connect to Impala
Run the following command to open an Impala shell session. For more information, see Connect to Impala.
impala-shell -i core-1-1Step 3: Create a database, table, and query data
Create a database with its storage location set to an OSS-HDFS path.
CREATE DATABASE store LOCATION 'oss://<bucket-name>.<endpoint>/impala';Replace
<bucket-name>and<endpoint>with your actual bucket name and endpoint.Switch to the database.
USE store;Create an external table in Parquet format.
CREATE EXTERNAL TABLE customer_demographics ( `cd_demo_sk` INT, `cd_gender` STRING, `cd_marital_status` STRING, `cd_education_status` STRING, `cd_purchase_estimate` INT, `cd_credit_rating` STRING, `cd_dep_count` INT, `cd_dep_employed_count` INT, `cd_dep_college_count` INT) STORED AS PARQUET;Insert sample data.
INSERT INTO customer_demographics VALUES (1, 'Male', 'Single', 'Graduate', 1000, 'AAA', 2, 1, 1), (2, 'Female', 'Married', 'Undergraduate', 2000, 'BBB', 3, 2, 2);Query the data.
SELECT * FROM customer_demographics;