Cloudera's Distribution Including Apache Hadoop (CDH) is an Apache Hadoop distribution. Hadoop 3.0.0 is supported by Object Storage Service (OSS). This guide describes how to enable CDH 6 components such as Hadoop, Hive, Spark, and Impala to query OSS data, walking you through the full configuration: connecting CDH to OSS, adding the required OSS JARs to Impala's classpath, and verifying the setup with TPC-DS benchmark queries.
This guide uses CDH 6.0.1 as the reference version.
Prerequisites
Before you begin, make sure you have:
A running CDH 6 cluster. For setup instructions, see the Cloudera Installation Guide.
Network access from all cluster nodes to the OSS endpoint in your target region
An Alibaba Cloud AccessKey ID and AccessKey secret with read/write permissions on the target bucket
Step 1: Add OSS configurations
Use the Cloudera Manager (CM) cluster management tool to add the following parameters. If CM is not managing your cluster, add the same parameters to core-site.xml directly.
| Parameter | Description | Recommended value |
|---|---|---|
fs.oss.endpoint | Endpoint for the region where your bucket is located | oss-cn-zhangjiakou-internal.aliyuncs.com |
fs.oss.accessKeyId | AccessKey ID used to access OSS | Your AccessKey ID |
fs.oss.accessKeySecret | AccessKey secret used to access OSS | Your AccessKey secret |
fs.oss.impl | Hadoop OSS file system implementation class | org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem |
fs.oss.buffer.dir | Temporary file directory for local buffering | /tmp/oss |
fs.oss.connection.secure.enabled | Whether to enable HTTPS. Enabling HTTPS reduces throughput. | false |
fs.oss.connection.maximum | Maximum number of concurrent connections to OSS | 2048 |
For a full parameter reference, see the Hadoop-Aliyun module documentation.
After saving the configuration, restart the cluster as prompted by CM.
Verify the OSS connection
Run these commands to confirm that Hadoop can read from and write to OSS:
# List objects in your bucket
hadoop fs -ls oss://${your-bucket-name}/
# Create a test directory
hadoop fs -mkdir oss://${your-bucket-name}/hadoop-testIf both commands succeed, the configuration is correct. If either fails, check the parameter values — especially fs.oss.endpoint and the AccessKey credentials.
All content in ${} is environment variables. Modify these environment variables.Step 2: Configure Apache Impala
CDH 6 includes OSS support by default, but Impala's classpath does not include the required OSS JAR files. Perform the following steps on every Impala node.
Create symbolic links
Navigate to the ${CDH_HOME}/lib/impala directory and create symbolic links to the OSS JAR files:
cd lib/
ln -s ../../../jars/hadoop-aliyun-3.0.0-cdh6.0.1.jar hadoop-aliyun.jar
ln -s ../../../jars/aliyun-sdk-oss-2.8.3.jar aliyun-sdk-oss-2.8.3.jar
ln -s ../../../jars/jdom-1.1.jar jdom-1.1.jarUpdate CLASSPATH in Impala startup files
Navigate to the ${CDH_HOME}/bin directory. In each of the three files — impalad, statestored, and catalogd — add the following line immediately before the final exec command:
export CLASSPATH=${CLASSPATH}:${IMPALA_HOME}/lib/hadoop-aliyun.jar:${IMPALA_HOME}/lib/aliyun-sdk-oss-2.8.3.jar:${IMPALA_HOME}/lib/jdom-1.1.jarAfter updating all three files, restart the Impala-related processes on all nodes.
Step 3: Verify with TPC-DS queries
Run TPC-DS benchmark queries against OSS data to confirm that Impala can query OSS correctly.
TPC-DS queries are written in HiveQL. Impala SQL is highly compatible with HiveQL, so most TPC-DS queries run on Impala without modification. A small subset of queries use syntax that Impala does not support and must be skipped or adapted.
For reference, see Apache Impala and the hive-testbench repository.
Generate sample data
Clone the hive-testbench repository and generate a TPC-DS dataset (scale factor 50) in OSS:
git clone https://github.com/hortonworks/hive-testbench.git
cd hive-testbench
git checkout hive14
./tpcds-build.sh
FORMAT=textfile ./tpcds-setup.sh 50 oss://{your-bucket-name}/Create external tables
Use the DDL statements in ddl-tpcds/text/alltables.sql to create tables pointing to the OSS data. Replace ${LOCATION} in each statement with the corresponding OSS path.
The following example creates the call_center external table:
CREATE EXTERNAL TABLE call_center(
cc_call_center_sk bigint
, cc_call_center_id string
, cc_rec_start_date string
, cc_rec_end_date string
, cc_closed_date_sk bigint
, cc_open_date_sk bigint
, cc_name string
, cc_class string
, cc_employees int
, cc_sq_ft int
, cc_hours string
, cc_manager string
, cc_mkt_id int
, cc_mkt_class string
, cc_mkt_desc string
, cc_market_manager string
, cc_division int
, cc_division_name string
, cc_company int
, cc_company_name string
, cc_street_number string
, cc_street_name string
, cc_street_type string
, cc_suite_number string
, cc_city string
, cc_county string
, cc_state string
, cc_zip string
, cc_country string
, cc_gmt_offset double
, cc_tax_percentage double
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION 'oss://{your-bucket-name}/50/call_center';Use CREATE EXTERNAL TABLE, not CREATE TABLE, for OSS-backed tables.
Verify that all 24 tables were created:
SHOW TABLES;Run a TPC-DS query
Connect to the Impala shell and run one of the provided SQL files:
impala-shell -i cdh-slave01 -d default -f sample-queries-tpcds/query13.sqlThe query in query13.sql joins six tables — store, store_sales, customer_demographics, household_demographics, customer_address, and date_dim — and returns aggregate sales metrics. A successful run produces output similar to:
+-------------------+-------------------------+----------------------------+----------------------------+
| avg(ss_quantity) | avg(ss_ext_sales_price) | avg(ss_ext_wholesale_cost) | sum(ss_ext_wholesale_cost) |
+-------------------+-------------------------+----------------------------+----------------------------+
| 30.87106918238994 | 2352.642327044025 | 2162.600911949685 | 687707.09 |
+-------------------+-------------------------+----------------------------+----------------------------+
Fetched 1 row(s) in 353.16sThe sample-queries-tpcds/ directory contains additional SQL files (query3.sql through query98.sql) for broader performance testing.
Limitations
Keep the following limitations in mind when using Impala to query OSS data:
TPC-DS query compatibility: Most TPC-DS queries run on Impala without modification, but a small subset use HiveQL syntax that Impala does not support. Skip or adapt those queries.
Table type: Always use
CREATE EXTERNAL TABLEfor OSS-backed tables.Performance: OSS query latency is higher than HDFS. OSS is best suited for cold data or archival workloads where cost efficiency outweighs query speed.