Configure Apache Impala on CDH 6 for querying data in OSS - Object Storage Service

Cloudera's Distribution Including Apache Hadoop (CDH) is an Apache Hadoop distribution. Hadoop 3.0.0 is supported by Object Storage Service (OSS). This guide describes how to enable CDH 6 components such as Hadoop, Hive, Spark, and Impala to query OSS data, walking you through the full configuration: connecting CDH to OSS, adding the required OSS JARs to Impala's classpath, and verifying the setup with TPC-DS benchmark queries.

This guide uses CDH 6.0.1 as the reference version.

Prerequisites

Before you begin, make sure you have:

A running CDH 6 cluster. For setup instructions, see the Cloudera Installation Guide.
Network access from all cluster nodes to the OSS endpoint in your target region
An Alibaba Cloud AccessKey ID and AccessKey secret with read/write permissions on the target bucket

Step 1: Add OSS configurations

Use the Cloudera Manager (CM) cluster management tool to add the following parameters. If CM is not managing your cluster, add the same parameters to core-site.xml directly.

Parameter	Description	Recommended value
`fs.oss.endpoint`	Endpoint for the region where your bucket is located	`oss-cn-zhangjiakou-internal.aliyuncs.com`
`fs.oss.accessKeyId`	AccessKey ID used to access OSS	Your AccessKey ID
`fs.oss.accessKeySecret`	AccessKey secret used to access OSS	Your AccessKey secret
`fs.oss.impl`	Hadoop OSS file system implementation class	`org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem`
`fs.oss.buffer.dir`	Temporary file directory for local buffering	`/tmp/oss`
`fs.oss.connection.secure.enabled`	Whether to enable HTTPS. Enabling HTTPS reduces throughput.	`false`
`fs.oss.connection.maximum`	Maximum number of concurrent connections to OSS	`2048`

For a full parameter reference, see the Hadoop-Aliyun module documentation.

After saving the configuration, restart the cluster as prompted by CM.

Verify the OSS connection

Run these commands to confirm that Hadoop can read from and write to OSS:

# List objects in your bucket
hadoop fs -ls oss://${your-bucket-name}/

# Create a test directory
hadoop fs -mkdir oss://${your-bucket-name}/hadoop-test

If both commands succeed, the configuration is correct. If either fails, check the parameter values — especially fs.oss.endpoint and the AccessKey credentials.

All content in ${} is environment variables. Modify these environment variables.

Step 2: Configure Apache Impala

CDH 6 includes OSS support by default, but Impala's classpath does not include the required OSS JAR files. Perform the following steps on every Impala node.

Create symbolic links

Navigate to the ${CDH_HOME}/lib/impala directory and create symbolic links to the OSS JAR files:

cd lib/
ln -s ../../../jars/hadoop-aliyun-3.0.0-cdh6.0.1.jar hadoop-aliyun.jar
ln -s ../../../jars/aliyun-sdk-oss-2.8.3.jar aliyun-sdk-oss-2.8.3.jar
ln -s ../../../jars/jdom-1.1.jar jdom-1.1.jar

Update CLASSPATH in Impala startup files

Navigate to the ${CDH_HOME}/bin directory. In each of the three files — impalad, statestored, and catalogd — add the following line immediately before the final exec command:

export CLASSPATH=${CLASSPATH}:${IMPALA_HOME}/lib/hadoop-aliyun.jar:${IMPALA_HOME}/lib/aliyun-sdk-oss-2.8.3.jar:${IMPALA_HOME}/lib/jdom-1.1.jar

After updating all three files, restart the Impala-related processes on all nodes.

Step 3: Verify with TPC-DS queries

Run TPC-DS benchmark queries against OSS data to confirm that Impala can query OSS correctly.

TPC-DS queries are written in HiveQL. Impala SQL is highly compatible with HiveQL, so most TPC-DS queries run on Impala without modification. A small subset of queries use syntax that Impala does not support and must be skipped or adapted.

For reference, see Apache Impala and the hive-testbench repository.

Generate sample data

Clone the hive-testbench repository and generate a TPC-DS dataset (scale factor 50) in OSS:

git clone https://github.com/hortonworks/hive-testbench.git
cd hive-testbench
git checkout hive14
./tpcds-build.sh
FORMAT=textfile ./tpcds-setup.sh 50 oss://{your-bucket-name}/

Create external tables

Use the DDL statements in ddl-tpcds/text/alltables.sql to create tables pointing to the OSS data. Replace ${LOCATION} in each statement with the corresponding OSS path.

The following example creates the call_center external table:

CREATE EXTERNAL TABLE call_center(
      cc_call_center_sk         bigint
,     cc_call_center_id         string
,     cc_rec_start_date         string
,     cc_rec_end_date           string
,     cc_closed_date_sk         bigint
,     cc_open_date_sk           bigint
,     cc_name                   string
,     cc_class                  string
,     cc_employees              int
,     cc_sq_ft                  int
,     cc_hours                  string
,     cc_manager                string
,     cc_mkt_id                 int
,     cc_mkt_class              string
,     cc_mkt_desc               string
,     cc_market_manager         string
,     cc_division               int
,     cc_division_name          string
,     cc_company                int
,     cc_company_name           string
,     cc_street_number          string
,     cc_street_name            string
,     cc_street_type            string
,     cc_suite_number           string
,     cc_city                   string
,     cc_county                 string
,     cc_state                  string
,     cc_zip                    string
,     cc_country                string
,     cc_gmt_offset             double
,     cc_tax_percentage         double
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION 'oss://{your-bucket-name}/50/call_center';

Important

Use CREATE EXTERNAL TABLE, not CREATE TABLE, for OSS-backed tables.

Verify that all 24 tables were created:

SHOW TABLES;

Run a TPC-DS query

Connect to the Impala shell and run one of the provided SQL files:

impala-shell -i cdh-slave01 -d default -f sample-queries-tpcds/query13.sql

The query in query13.sql joins six tables — store, store_sales, customer_demographics, household_demographics, customer_address, and date_dim — and returns aggregate sales metrics. A successful run produces output similar to:

+-------------------+-------------------------+----------------------------+----------------------------+
| avg(ss_quantity)  | avg(ss_ext_sales_price) | avg(ss_ext_wholesale_cost) | sum(ss_ext_wholesale_cost) |
+-------------------+-------------------------+----------------------------+----------------------------+
| 30.87106918238994 | 2352.642327044025       | 2162.600911949685          | 687707.09                  |
+-------------------+-------------------------+----------------------------+----------------------------+
Fetched 1 row(s) in 353.16s

The sample-queries-tpcds/ directory contains additional SQL files (query3.sql through query98.sql) for broader performance testing.

Limitations

Keep the following limitations in mind when using Impala to query OSS data:

TPC-DS query compatibility: Most TPC-DS queries run on Impala without modification, but a small subset use HiveQL syntax that Impala does not support. Skip or adapt those queries.
Table type: Always use CREATE EXTERNAL TABLE for OSS-backed tables.
Performance: OSS query latency is higher than HDFS. OSS is best suited for cold data or archival workloads where cost efficiency outweighs query speed.

Object Storage Service:Query OSS data using Apache Impala (CDH6)