CDH是Cloudera提供的包含Apache Hadoop核心組件的企業級巨量資料發行版,已支援Hadoop 3.0.0。本文將詳解如何配置CDH6環境下的Hadoop、Hive、Spark、Impala等組件,以實現對接阿里雲OSS儲存服務進行資料查詢操作。
前提條件
已搭建CDH6 叢集。具體操作,請參見安裝指南。本文以CDH6.0.1版本為例。
步驟一:增加OSS配置
通過叢集管理工具CM來增加配置。
若沒有CM管理的叢集,可以修改core-site.xml。以CM為例,需要增加如下配置:

配置項
說明
fs.oss.endpoint
填寫需要串連的OSS的Endpoint,樣本值為oss-cn-zhangjiakou-internal.aliyuncs.com。
fs.oss.accessKeyId
填寫存取金鑰中的AccessKeyId,存取金鑰要求擁有訪問OSS的許可權。
fs.oss.accessKeySecret
填寫存取金鑰中的AccessKeySecret,存取金鑰要求擁有訪問OSS的許可權。
fs.oss.impl
Hadoop OSS檔案系統實作類別。固定值為
org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem。fs.oss.buffer.dir
填寫臨時檔案目錄。 建議值為
/tmp/oss。fs.oss.connection.secure.enabled
是否開啟 HTTPS。開啟HTTPS可能會影響效能。 建議值為
false。fs.oss.connection.maximum
與OSS的串連數。建議值為
2048。更多參數解釋,請參見 Hadoop-Aliyun module。
根據CM提示重啟叢集。
測試讀寫OSS。
測試讀:
hadoop fs -ls oss://${your-bucket-name}/測試寫:
hadoop fs -mkdir oss://${your-bucket-name}/hadoop-test若測試可以讀寫OSS,則配置成功。若無法讀寫OSS,請檢查配置。
說明本文中所有 ${} 的內容為環境變數,請根據您實際的環境修改。
步驟二:配置Apache Impala
為了使CDH6版本中的Impala能夠支援OSS,您需要手動將OSS的相關JAR包添加到所有Impala節點的CLASSPATH環境變數中。具體操作步驟如下:
進入${CDH_HOME}/lib/impala目錄,建立符號連結:
[root@cdh-master impala]# cd lib/ [root@cdh-master lib]# ln -s ../../../jars/hadoop-aliyun-3.0.0-cdh6.0.1.jar hadoop-aliyun.jar [root@cdh-master lib]# ln -s ../../../jars/aliyun-sdk-oss-2.8.3.jar aliyun-sdk-oss-2.8.3.jar [root@cdh-master lib]# ln -s ../../../jars/jdom-1.1.jar jdom-1.1.jar進入${CDH_HOME}/bin目錄,修改impalad、statestored、catalogd三個檔案,在檔案最後一行exec命令前增加如下內容:
export CLASSPATH=${CLASSPATH}:${IMPALA_HOME}/lib/hadoop-aliyun.jar:${IMPALA_HOME}/lib/aliyun-sdk-oss-2.8.3.jar:${IMPALA_HOME}/lib/jdom-1.1.jar重啟所有節點的Impala相關進程。
步驟三:驗證配置
嘗試使用Apache Impala查詢OSS,並且運行TPC-DS的查詢語句。更多資訊,請參見Apache Impala介紹和TPC-DS查詢。
TPC-DS設計了一系列基於Hive QL的複雜查詢情境。考慮到Impala SQL與Hive QL之間具有較高的相容性,大部分TPC-DS查詢可以直接在Impala上執行。然而,由於兩者間仍存在一些細微差異,部分TPC-DS查詢可能因相容性問題無法直接在Impala環境中運行。
以下以選取大量能在Impala中順利執行的TPC-DS查詢作為實驗案例,通過這些案例來觀察並分析Impala在大規模資料處理上的效能表現。
產生sample資料到OSS。
[root@cdh-master ~]# git clone https://github.com/hortonworks/hive-testbench.git [root@cdh-master ~]# cd hive-testbench [root@cdh-master hive-testbench]# git checkout hive14 [root@cdh-master hive-testbench]# ./tpcds-build.sh [root@cdh-master hive-testbench]# FORMAT=textfile ./tpcds-setup.sh 50 oss://{your-bucket-name}/建表。
Benchmark的建表語句與Apache Impala的建表語句相容。您只需要複製ddl-tpcds/text/alltables.sql中的建表語句,然後修改${LOCATION}即可。
[root@cdh-master hive-testbench]# impala-shell -i cdh-slave01 -d default Starting Impala Shell without Kerberos authentication Connected to cdh-slave01:21000 Server version: impalad version 3.0.0-cdh6.0.1 RELEASE(build9a74a5053de5f7b8dd983802e6d75e58d31472db) *********************************************************************************** Welcome to the Impala shell. (Impala Shell v3.0.0-cdh6.0.1(9a74a50) built on Wed Sep 1911:27:37 PDT 2018) Want to know what version of Impala you're connected to? Run the VERSION command to find out! *********************************************************************************** Query: use `default` [cdh-slave01:21000] default> create external table call_center( > cc_call_center_sk bigint > , cc_call_center_id string > , cc_rec_start_date string > , cc_rec_end_date string > , cc_closed_date_sk bigint > , cc_open_date_sk bigint > , cc_name string > , cc_class string > , cc_employees int > , cc_sq_ft int > , cc_hours string > , cc_manager string > , cc_mkt_id int > , cc_mkt_class string > , cc_mkt_desc string > , cc_market_manager string > , cc_division int > , cc_division_name string > , cc_company int > , cc_company_name string > , cc_street_number string > , cc_street_name string > , cc_street_type string > , cc_suite_number string > , cc_city string > , cc_county string > , cc_state string > , cc_zip string > , cc_country string > , cc_gmt_offset double > , cc_tax_percentage double > ) > row format delimited fields terminated by '|' > location 'oss://{your-bucket-name}/50/call_center'; Query: create external table call_center( cc_call_center_sk bigint , cc_call_center_id string , cc_rec_start_date string , cc_rec_end_date string , cc_closed_date_sk bigint , cc_open_date_sk bigint , cc_name string , cc_class string , cc_employees int , cc_sq_ft int , cc_hours string , cc_manager string , cc_mkt_id int , cc_mkt_class string , cc_mkt_desc string , cc_market_manager string , cc_division int , cc_division_name string , cc_company int , cc_company_name string , cc_street_number string , cc_street_name string , cc_street_type string , cc_suite_number string , cc_city string , cc_county string , cc_state string , cc_zip string , cc_country string , cc_gmt_offset double , cc_tax_percentage double ) row format delimited fields terminated by '|' location 'oss://{your-bucket-name}/50/call_center' +-------------------------+ | summary | +-------------------------+ | Table has been created. | +-------------------------+ Fetched 1 row(s) in 4.10s檢查建表內容。
[cdh-slave01:21000] default> show tables; Query: show tables +------------------------+ | name | +------------------------+ | call_center | | catalog_page | | catalog_returns | | catalog_sales | | customer | | customer_address | | customer_demographics | | date_dim | | household_demographics | | income_band | | inventory | | item | | promotion | | reason | | ship_mode | | store | | store_returns | | store_sales | | time_dim | | warehouse | | web_page | | web_returns | | web_sales | | web_site | +------------------------+ Fetched 24 row(s) in 0.03s在Impala上執行TPC-DS查詢,已經可以正常查詢OSS的資料。
進入待查詢的SQL所在的 sample-queries-tpcds目錄。
[root@cdh-master hive-testbench]# ls sample-queries-tpcds query12.sql query20.sql query27.sql query39.sql query46.sql query54.sql query64.sql query71.sql query7.sql query87.sql query93.sql README.md query13.sql query21.sql query28.sql query3.sql query48.sql query55.sql query65.sql query72.sql query80.sql query88.sql query94.sql testbench.settings query15.sql query22.sql query29.sql query40.sql query49.sql query56.sql query66.sql query73.sql query82.sql query89.sql query95.sql testbench-withATS.settings query17.sql query24.sql query31.sql query42.sql query50.sql query58.sql query67.sql query75.sql query83.sql query90.sql query96.sql query18.sql query25.sql query32.sql query43.sql query51.sql query60.sql query68.sql query76.sql query84.sql query91.sql query97.sql query19.sql query26.sql query34.sql query45.sql query52.sql query63.sql query70.sql query79.sql query85.sql query92.sql query98.sql執行TPC-DS查詢 query13.sql。
[root@cdh-master hive-testbench]# impala-shell -i cdh-slave01 -d default -f sample-queries-tpcds/query13.sql Starting Impala Shell without Kerberos authentication Connected to cdh-slave01:21000 Server version: impalad version 3.0.0-cdh6.0.1 RELEASE(build9a74a5053de5f7b8dd983802e6d75e58d31472db) Query: use`default`Query: selectavg(ss_quantity) ,avg(ss_ext_sales_price) ,avg(ss_ext_wholesale_cost) ,sum(ss_ext_wholesale_cost) from store_sales ,store ,customer_demographics ,household_demographics ,customer_address ,date_dim where store.s_store_sk = store_sales.ss_store_sk and store_sales.ss_sold_date_sk = date_dim.d_date_sk and date_dim.d_year = 2001and((store_sales.ss_hdemo_sk=household_demographics.hd_demo_sk and customer_demographics.cd_demo_sk = store_sales.ss_cdemo_sk and customer_demographics.cd_marital_status = 'M'and customer_demographics.cd_education_status = '4 yr Degree'and store_sales.ss_sales_price between100.00and150.00and household_demographics.hd_dep_count = 3 )or (store_sales.ss_hdemo_sk=household_demographics.hd_demo_sk and customer_demographics.cd_demo_sk = store_sales.ss_cdemo_sk and customer_demographics.cd_marital_status = 'D'and customer_demographics.cd_education_status = 'Primary'and store_sales.ss_sales_price between50.00and100.00and household_demographics.hd_dep_count = 1 ) or (store_sales.ss_hdemo_sk=household_demographics.hd_demo_sk and customer_demographics.cd_demo_sk = ss_cdemo_sk and customer_demographics.cd_marital_status = 'U'and customer_demographics.cd_education_status = 'Advanced Degree'and store_sales.ss_sales_price between150.00and200.00and household_demographics.hd_dep_count = 1 )) and((store_sales.ss_addr_sk = customer_address.ca_address_sk and customer_address.ca_country = 'United States'and customer_address.ca_state in('KY', 'GA', 'NM') and store_sales.ss_net_profit between100and200 ) or (store_sales.ss_addr_sk = customer_address.ca_address_sk and customer_address.ca_country = 'United States'and customer_address.ca_state in('MT', 'OR', 'IN') and store_sales.ss_net_profit between150and300 ) or (store_sales.ss_addr_sk = customer_address.ca_address_sk and customer_address.ca_country = 'United States'and customer_address.ca_state in('WI', 'MO', 'WV') and store_sales.ss_net_profit between50and250 )) Query submitted at: 2018-10-3011:44:47(Coordinator: http://cdh-slave01:25000) Query progress can be monitored at: http://cdh-slave01:25000/query_plan?query_id=ff4b3157eddfc3c4:8a31c6500000000 +-------------------+-------------------------+----------------------------+----------------------------+ | avg(ss_quantity) | avg(ss_ext_sales_price) | avg(ss_ext_wholesale_cost) | sum(ss_ext_wholesale_cost) | +-------------------+-------------------------+----------------------------+----------------------------+ | 30.87106918238994 | 2352.642327044025 | 2162.600911949685 | 687707.09 | +-------------------+-------------------------+----------------------------+----------------------------+ Fetched 1 row(s) in 353.16s
本次查詢涉及到6張表:store、store_sales、customer_demographics、household_demographics、customer_address、date_dim。
[cdh-slave01:21000] default> showtable STATS store; Query: showtable STATS store +-------+--------+---------+--------------+-------------------+--------+-------------------+-------------------------------------------+ | #Rows | #Files | Size | Bytes Cached | CacheReplication | Format | Incremental stats | Location | +-------+--------+---------+--------------+-------------------+--------+-------------------+-------------------------------------------+ | -1 | 1 | 37.56KB | NOT CACHED | NOT CACHED | TEXT | false | oss://{your-bucket-name}/50/store | +-------+--------+---------+--------------+-------------------+--------+-------------------+-------------------------------------------+ Fetched 1 row(s) in 0.01s [cdh-slave01:21000] default> showtable STATS store_sales; Query: showtable STATS store_sales +-------+--------+---------+--------------+-------------------+--------+-------------------+-------------------------------------------------+ | #Rows | #Files | Size | Bytes Cached | CacheReplication | Format | Incremental stats | Location | +-------+--------+---------+--------------+-------------------+--------+-------------------+-------------------------------------------------+ | -1 | 50 | 18.75GB | NOT CACHED | NOT CACHED | TEXT | false | oss://{your-bucket-name}/50/store_sales | +-------+--------+---------+--------------+-------------------+--------+-------------------+-------------------------------------------------+ Fetched 1 row(s) in 0.01s [cdh-slave01:21000] default> showtable STATS customer_demographics; Query: showtable STATS customer_demographics +-------+--------+---------+--------------+-------------------+--------+-------------------+-----------------------------------------------------------+ | #Rows | #Files | Size | Bytes Cached | CacheReplication | Format | Incremental stats | Location | +-------+--------+---------+--------------+-------------------+--------+-------------------+-----------------------------------------------------------+ | -1 | 50 | 76.92MB | NOT CACHED | NOT CACHED | TEXT | false | oss://{your-bucket-name}/50/customer_demographics | +-------+--------+---------+--------------+-------------------+--------+-------------------+-----------------------------------------------------------+ Fetched 1 row(s) in 0.01s [cdh-slave01:21000] default> showtable STATS household_demographics; Query: showtable STATS household_demographics +-------+--------+----------+--------------+-------------------+--------+-------------------+------------------------------------------------------------+ | #Rows | #Files | Size | Bytes Cached | CacheReplication | Format | Incremental stats | Location | +-------+--------+----------+--------------+-------------------+--------+-------------------+------------------------------------------------------------+ | -1 | 1 | 148.10KB | NOT CACHED | NOT CACHED | TEXT | false | oss://{your-bucket-name}/50/household_demographics | +-------+--------+----------+--------------+-------------------+--------+-------------------+------------------------------------------------------------+ Fetched 1 row(s) in 0.01s [cdh-slave01:21000] default> showtable STATS customer_address; Query: showtable STATS customer_address +-------+--------+---------+--------------+-------------------+--------+-------------------+------------------------------------------------------+ | #Rows | #Files | Size | Bytes Cached | CacheReplication | Format | Incremental stats | Location | +-------+--------+---------+--------------+-------------------+--------+-------------------+------------------------------------------------------+ | -1 | 1 | 40.54MB | NOT CACHED | NOT CACHED | TEXT | false | oss://{your-bucket-name}/50/customer_address | +-------+--------+---------+--------------+-------------------+--------+-------------------+------------------------------------------------------+ Fetched 1 row(s) in 0.01s [cdh-slave01:21000] default> showtable STATS date_dim; Query: showtable STATS date_dim +-------+--------+--------+--------------+-------------------+--------+-------------------+----------------------------------------------+ | #Rows | #Files | Size | Bytes Cached | CacheReplication | Format | Incremental stats | Location | +-------+--------+--------+--------------+-------------------+--------+-------------------+----------------------------------------------+ | -1 | 1 | 9.84MB | NOT CACHED | NOT CACHED | TEXT | false | oss://{your-bucket-name}/50/date_dim | +-------+--------+--------+--------------+-------------------+--------+-------------------+----------------------------------------------+ Fetched 1 row(s) in 0.01s