Cloudera's Distribution Including Apache Hadoop (CDH) 5 does not include built-in support for Alibaba Cloud Object Storage Service (OSS) — that support was added in CDH 6.0.1 (Hadoop 3.0.0). This topic walks you through configuring CDH 5 to access OSS as a storage backend, including Impala query support.
Prerequisites
Before you begin, ensure that you have:
A running CDH 5 cluster (this topic uses CDH 5.14.4 as the example version; for installation instructions, see the Cloudera Installation Guide)
Root or sudo access on all CDH nodes
Network access to download packages from Alibaba Cloud CDN
Background
CDH 5 bundles httpclient 4.2.5 and httpcore 4.2.5 to satisfy Resource Manager's dependency requirements. OSS SDKs require newer versions of both components. The setup process replaces the bundled versions with compatible ones and adds the Aliyun OSS JARs to the appropriate classpaths.
All paths enclosed in ${} are environment variables. Replace them with the actual values for your environment.Step 1: Add OSS configurations
Perform the following steps on all CDH nodes.
1. Check the CDH installation directory structure.
[root@cdh-master CDH-5.14.4-1.cdh5.14.4.p0.3]# ls -lh The output shows subdirectories including bin, etc, jars, lib, and others. The jars directory is where you place the downloaded OSS package.
2. Download the OSS-compatible CDH package to the `jars` folder.
For CDH 5.14.4:
http://gosspublic.alicdn.com/hadoop-spark/hadoop-oss-cdh-5.14.4.tar.gzThis package embeds the OSS patch from the CDH 5.14.4 Hadoop source. Packages for other versions:
| CDH version | Download URL |
|---|---|
| CDH 5.14.4 | http://gosspublic.alicdn.com/hadoop-spark/hadoop-oss-cdh-5.14.4.tar.gz |
| CDH 5.8.5 | https://gosspublic.alicdn.com/hadoop-spark/hadoop-oss-cdh-5.8.5.tar.gz |
| CDH 5.4.4 | https://gosspublic.alicdn.com/hadoop-spark/hadoop-oss-cdh-5.4.4.tar.gz |
| CDH 6.3.2 | http://gosspublic.alicdn.com/hadoop-spark/hadoop-oss-cdh-6.3.2.tar.gz |
> Note: For CDH 6.3.2, copy the package files to the jars folder, then follow the remaining steps to update aliyun-sdk-oss-3.4.1.jar and create symbolic links for the aliyun-java-sdk-*.jar files.
3. Extract the downloaded package.
[root@cdh-master CDH-5.14.4-1.cdh5.14.4.p0.3]# tar -tvf hadoop-oss-cdh-5.14.4.tar.gzThe package contains the following JARs:
hadoop-oss-cdh-5.14.4/aliyun-java-sdk-sts-3.0.0.jar
hadoop-oss-cdh-5.14.4/httpcore-4.4.4.jar
hadoop-oss-cdh-5.14.4/aliyun-sdk-oss-3.4.1.jar
hadoop-oss-cdh-5.14.4/aliyun-java-sdk-core-3.4.0.jar
hadoop-oss-cdh-5.14.4/aliyun-java-sdk-ram-3.0.0.jar
hadoop-oss-cdh-5.14.4/aliyun-java-sdk-ecs-4.2.0.jar
hadoop-oss-cdh-5.14.4/hadoop-aliyun-2.6.0-cdh5.14.4.jar
hadoop-oss-cdh-5.14.4/httpclient-4.5.2.jar4. Go to `${CDH_HOME}/lib/hadoop` and replace the bundled HTTP JARs with the newer versions.
Remove the old JARs and create symbolic links to the updated ones:
[root@cdh-master hadoop]# rm -f lib/httpclient-4.2.5.jar
[root@cdh-master hadoop]# rm -f lib/httpcore-4.2.5.jar
[root@cdh-master hadoop]# ln -s ../../jars/hadoop-aliyun-2.6.0-cdh5.14.4.jar hadoop-aliyun-2.6.0-cdh5.14.4.jar
[root@cdh-master hadoop]# ln -s hadoop-aliyun-2.6.0-cdh5.14.4.jar hadoop-aliyun.jar
[root@cdh-master hadoop]# cd lib
[root@cdh-master lib]# ln -s ../../../jars/aliyun-java-sdk-core-3.4.0.jar aliyun-java-sdk-core-3.4.0.jar
[root@cdh-master lib]# ln -s ../../../jars/aliyun-java-sdk-ecs-4.2.0.jar aliyun-java-sdk-ecs-4.2.0.jar
[root@cdh-master lib]# ln -s ../../../jars/aliyun-java-sdk-ram-3.0.0.jar aliyun-java-sdk-ram-3.0.0.jar
[root@cdh-master lib]# ln -s ../../../jars/aliyun-java-sdk-sts-3.0.0.jar aliyun-java-sdk-sts-3.0.0.jar
[root@cdh-master lib]# ln -s ../../../jars/aliyun-sdk-oss-3.4.1.jar aliyun-sdk-oss-3.4.1.jar
[root@cdh-master lib]# ln -s ../../../jars/httpclient-4.5.2.jar httpclient-4.5.2.jar
[root@cdh-master lib]# ln -s ../../../jars/httpcore-4.4.4.jar httpcore-4.4.4.jar
[root@cdh-master lib]# ln -s ../../../jars/jdom-1.1.jar jdom-1.1.jar5. Update the YARN classpath on the Resource Manager node.
Go to ${CDH_HOME}/lib/hadoop-yarn/bin/ on the Resource Manager node. In the yarn file, replace:
CLASSPATH=${CLASSPATH}:$HADOOP_YARN_HOME/${YARN_DIR}/*
CLASSPATH=${CLASSPATH}:$HADOOP_YARN_HOME/${YARN_LIB_JARS_DIR}/*with:
CLASSPATH=$HADOOP_YARN_HOME/${YARN_DIR}/*:${CLASSPATH}
CLASSPATH=$HADOOP_YARN_HOME/${YARN_LIB_JARS_DIR}/*:${CLASSPATH}This change ensures the newer httpclient and httpcore versions take precedence over the ones bundled by Resource Manager.
6. Add the older HTTP JARs to the Resource Manager lib directory.
Go to ${CDH_HOME}/lib/hadoop-yarn/lib on the Resource Manager node and run:
[root@cdh-master lib]# ln -s ../../../jars/httpclient-4.2.5.jar httpclient-4.2.5.jar
[root@cdh-master lib]# ln -s ../../../jars/httpcore-4.2.5.jar httpcore-4.2.5.jar7. Set the OSS configuration properties.
Use Cloudera Manager (CM) to add the properties, or edit core-site.xml directly if CM is not available.
Add the following properties to core-site.xml. Replace placeholder values with your actual settings:
<property>
<name>fs.oss.endpoint</name>
<value>oss-cn-hangzhou.aliyuncs.com</value>
<description>
Endpoint of the region where your OSS bucket is located.
For example, use oss-cn-hangzhou.aliyuncs.com for the China (Hangzhou) region.
See Regions and endpoints for a full list.
</description>
</property>
<property>
<name>fs.oss.accessKeyId</name>
<value>your-access-key-id</value>
<description>AccessKey ID used to authenticate with OSS.</description>
</property>
<property>
<name>fs.oss.accessKeySecret</name>
<value>your-access-key-secret</value>
<description>AccessKey secret used to authenticate with OSS.</description>
</property>
<property>
<name>fs.oss.impl</name>
<value>org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem</value>
<description>Class that implements the OSS file system for Hadoop.</description>
</property>
<property>
<name>fs.oss.buffer.dir</name>
<value>/tmp/oss</value>
<description>Directory for temporary files. /tmp/oss is recommended.</description>
</property>
<property>
<name>fs.oss.connection.secure.enabled</name>
<value>false</value>
<description>
Whether to use HTTPS. Enabling HTTPS degrades performance.
Set to false for most deployments.
</description>
</property>
<property>
<name>fs.oss.connection.maximum</name>
<value>2048</value>
<description>Maximum number of simultaneous connections to OSS.</description>
</property>For a complete list of supported properties, see the Hadoop-Aliyun module documentation.
8. Restart the cluster.
Restart all cluster services as prompted by CM, or restart manually if CM is not available.
9. Verify the setup.
Run the following commands to confirm read and write access to OSS:
# Test read access
hadoop fs -ls oss://<your-bucket-name>/
# Test write access
hadoop fs -mkdir oss://<your-bucket-name>/hadoop-test A successful ls command returns the bucket contents without errors. A successful mkdir command creates the directory and returns with exit code 0. If either command fails, see Troubleshooting below.
Step 2: Configure OSS support for Impala
Impala can query data in Hadoop Distributed File System (HDFS). After you configure CDH 5 for OSS, perform the following steps on each Impala node to extend that support to Impala.
1. Go to `${CDH_HOME}/lib/impala/lib` and update the HTTP JARs.
[root@cdh-master lib]# rm -f httpclient-4.2.5.jar httpcore-4.2.5.jar
[root@cdh-master lib]# ln -s ../../../jars/httpclient-4.5.2.jar httpclient-4.5.2.jar
[root@cdh-master lib]# ln -s ../../../jars/httpcore-4.4.4.jar httpcore-4.4.4.jar
[root@cdh-master lib]# ln -s ../../../jars/hadoop-aliyun-2.6.0-cdh5.14.4.jar hadoop-aliyun.jar
[root@cdh-master lib]# ln -s ../../../jars/aliyun-java-sdk-core-3.4.0.jar aliyun-java-sdk-core-3.4.0.jar
[root@cdh-master lib]# ln -s ../../../jars/aliyun-java-sdk-ecs-4.2.0.jar aliyun-java-sdk-ecs-4.2.0.jar
[root@cdh-master lib]# ln -s ../../../jars/aliyun-java-sdk-ram-3.0.0.jar aliyun-java-sdk-ram-3.0.0.jar
[root@cdh-master lib]# ln -s ../../../jars/aliyun-java-sdk-sts-3.0.0.jar aliyun-java-sdk-sts-3.0.0.jar
[root@cdh-master lib]# ln -s ../../../jars/aliyun-sdk-oss-3.4.1.jar aliyun-sdk-oss-3.4.1.jar
[root@cdh-master lib]# ln -s ../../../jars/jdom-1.1.jar jdom-1.1.jar2. Update the CLASSPATH in the Impala startup files.
Go to ${CDH_HOME}/bin. In each of the impalad, statestored, and catalogd files, add the following line before the final exec command:
export CLASSPATH=$CLASSPATH:${IMPALA_HOME}/lib/httpclient-4.5.2.jar:${IMPALA_HOME}/lib/httpcore-4.4.4.jar:${IMPALA_HOME}/lib/hadoop-aliyun.jar:${IMPALA_HOME}/lib/aliyun-java-sdk-core-3.4.0.jar:${IMPALA_HOME}/lib/aliyun-java-sdk-ecs-4.2.0.jar:${IMPALA_HOME}/lib/aliyun-java-sdk-ram-3.0.0.jar:${IMPALA_HOME}/lib/aliyun-java-sdk-sts-3.0.0.jar:${IMPALA_HOME}/lib/aliyun-sdk-oss-3.4.1.jar:${IMPALA_HOME}/lib/jdom-1.1.jar3. Restart all Impala-related processes on every node.
After the restart, Impala can query data stored in OSS.
Verify the Impala configuration
The following example creates an external table that maps to an OSS-backed TPC-DS call_center dataset, then runs a query to confirm end-to-end access.
-- Connect to Impala
[root@cdh-master ~]# impala-shell -i cdh-slave01:21000
-- Drop the table if it exists from a previous run
DROP TABLE IF EXISTS call_center;
-- Create an external table pointing to the OSS location
CREATE EXTERNAL TABLE call_center (
cc_call_center_sk BIGINT,
cc_call_center_id STRING,
cc_rec_start_date STRING,
cc_rec_end_date STRING,
cc_closed_date_sk BIGINT,
cc_open_date_sk BIGINT,
cc_name STRING,
cc_class STRING,
cc_employees INT,
cc_sq_ft INT,
cc_hours STRING,
cc_manager STRING,
cc_mkt_id INT,
cc_mkt_class STRING,
cc_mkt_desc STRING,
cc_market_manager STRING,
cc_division INT,
cc_division_name STRING,
cc_company INT,
cc_company_name STRING,
cc_street_number STRING,
cc_street_name STRING,
cc_street_type STRING,
cc_suite_number STRING,
cc_city STRING,
cc_county STRING,
cc_state STRING,
cc_zip STRING,
cc_country STRING,
cc_gmt_offset DOUBLE,
cc_tax_percentage DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION 'oss://<your-bucket-name>/call_center';
-- Run a test query
SELECT cc_country, COUNT(*) FROM call_center GROUP BY cc_country;Expected output:
+---------------+----------+
| cc_country | count(*) |
+---------------+----------+
| United States | 30 |
+---------------+----------+
Fetched 1 row(s) in 4.71sIf the query returns results, OSS access through Impala is working correctly.
Troubleshooting
CLASSPATH errors (ClassNotFoundException)
Symptoms: Errors like ClassNotFoundException: org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem when running Hadoop or Impala commands.
Likely cause: The symbolic links were not created correctly, or the JAR files are missing from the jars directory.
Steps to resolve:
Confirm that the extracted JARs exist in
${CDH_HOME}/jars/.Check that the symbolic links in
${CDH_HOME}/lib/hadoop/and itslib/subdirectory point to valid files:ls -la ${CDH_HOME}/lib/hadoop/lib/aliyun-*.jarFor Impala, verify the links in
${CDH_HOME}/lib/impala/lib/and confirm the CLASSPATH export appears beforeexecin theimpalad,statestored, andcatalogdfiles.
Authentication errors
Symptoms: Errors containing InvalidAccessKeyId, SignatureDoesNotMatch, or HTTP 403 responses when accessing OSS.
Likely cause: Incorrect AccessKey ID or AccessKey secret in core-site.xml (or CM configuration).
Steps to resolve:
Verify the
fs.oss.accessKeyIdandfs.oss.accessKeySecretvalues match your AccessKey pair. To obtain or rotate your AccessKey pair, see Obtain an AccessKey pair.Confirm that the AccessKey is active and has the required OSS permissions to read and write objects in the target bucket.
Restart the cluster after correcting the credentials.
Endpoint errors
Symptoms: Connection timeouts or UnknownHostException when accessing an OSS path.
Likely cause: The fs.oss.endpoint value does not match the region where your bucket is located.
Steps to resolve:
Check your bucket's region in the OSS console.
Set
fs.oss.endpointto the corresponding endpoint, for exampleoss-cn-hangzhou.aliyuncs.comfor China (Hangzhou). For a full list of endpoints, see Regions and endpoints.Restart the cluster after updating the endpoint.
HTTP version conflict (Resource Manager failures)
Symptoms: Resource Manager fails to start or throws errors related to httpclient or httpcore after the configuration changes.
Likely cause: The CLASSPATH order in the yarn file was not updated correctly in step 5, so the older httpclient version still takes precedence.
Steps to resolve:
Open
${CDH_HOME}/lib/hadoop-yarn/bin/yarnand confirm that$HADOOP_YARN_HOME/${YARN_DIR}/*appears before${CLASSPATH}, not after.Restart Resource Manager after making the correction.
What's next
Hadoop-Aliyun module documentation — full list of
fs.oss.*configuration propertiesRegions and endpoints — endpoint values for each OSS region
Obtain an AccessKey pair — create or manage your AccessKey credentials