Distribusi Cloudera termasuk Apache Hadoop (CDH) adalah distribusi Apache Hadoop. Hadoop 3.0.0 didukung oleh Object Storage Service (OSS). Topik ini menjelaskan cara mengaktifkan komponen CDH 6 seperti Hadoop, Hive, Spark, dan Impala untuk menanyakan data OSS.
Prasyarat
Cluster CDH 6 telah disiapkan. Untuk informasi lebih lanjut, lihat Panduan Instalasi Cloudera. Topik ini menggunakan CDH 6.0.1 sebagai contoh.
Langkah 1: Tambahkan konfigurasi OSS
Gunakan alat manajemen cluster CM untuk menambahkan konfigurasi.
Jika tidak ada cluster yang dikelola oleh CM, modifikasi file core-site.xml. Tabel berikut menggunakan CM sebagai contoh untuk menjelaskan konfigurasi yang perlu ditambahkan.
Parameter
Deskripsi
fs.oss.endpoint
Masukkan endpoint yang digunakan untuk mengakses wilayah tempat bucket berada. Contoh: oss-cn-zhangjiakou-internal.aliyuncs.com.
fs.oss.accessKeyId
Masukkan ID AccessKey yang digunakan untuk mengakses OSS.
fs.oss.accessKeySecret
Masukkan Rahasia AccessKey yang digunakan untuk mengakses OSS.
fs.oss.impl
Masukkan kelas yang digunakan untuk mengimplementasikan sistem file OSS dengan menggunakan Hadoop. Atur nilainya menjadi
org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem.fs.oss.buffer.dir
Masukkan direktori file sementara. Kami sarankan Anda mengatur parameter ini ke
/tmp/oss.fs.oss.connection.secure.enabled
Tentukan apakah akan mengaktifkan HTTPS. Performa mungkin terpengaruh saat HTTPS diaktifkan. Kami sarankan Anda mengatur parameter ini ke
false.fs.oss.connection.maximum
Masukkan jumlah maksimum koneksi ke OSS. Kami sarankan Anda mengatur parameter ini ke
2048.Untuk informasi lebih lanjut tentang deskripsi parameter, kunjungi Modul Hadoop-Aliyun.
Mulai ulang cluster sesuai petunjuk.
Uji membaca data dari dan menulis data ke OSS.
Uji pembacaan data dari OSS:
hadoop fs -ls oss://${your-bucket-name}/Uji penulisan data ke OSS:
hadoop fs -mkdir oss://${your-bucket-name}/hadoop-testJika data dapat dibaca dari dan ditulis ke OSS, konfigurasinya berhasil. Jika tidak, periksa konfigurasi.
CatatanSemua konten dalam ${} adalah variabel lingkungan. Modifikasi variabel lingkungan tersebut.
Langkah 2: Konfigurasikan Apache Impala
Secara default, CDH 6 mendukung OSS. Namun, paket JAR terkait OSS harus ditambahkan secara manual ke variabel CLASSPATH node Impala. Untuk menambahkan paket ini ke CLASSPATH, lakukan operasi berikut pada semua node Impala:
Pergi ke direktori ${CDH_HOME}/lib/impala dan buat tautan simbolis:
[root@cdh-master impala]# cd lib/ [root@cdh-master lib]# ln -s ../../../jars/hadoop-aliyun-3.0.0-cdh6.0.1.jar hadoop-aliyun.jar [root@cdh-master lib]# ln -s ../../../jars/aliyun-sdk-oss-2.8.3.jar aliyun-sdk-oss-2.8.3.jar [root@cdh-master lib]# ln -s ../../../jars/jdom-1.1.jar jdom-1.1.jarPergi ke direktori ${CDH_HOME}/bin. Tambahkan kode berikut sebelum perintah exec di baris terakhir file impalad, statestored, catalogd:
export CLASSPATH=${CLASSPATH}:${IMPALA_HOME}/lib/hadoop-aliyun.jar:${IMPALA_HOME}/lib/aliyun-sdk-oss-2.8.3.jar:${IMPALA_HOME}/lib/jdom-1.1.jarMulai ulang proses terkait Impala di semua node.
Langkah 3: Verifikasi konfigurasi
Gunakan Impala untuk menanyakan data OSS. Jalankan pernyataan query yang berasal dari TPC-DS. Untuk informasi lebih lanjut, kunjungi Apache Impala dan hive-testbench.
TPC-DS dirancang untuk query kompleks yang berbasis HiveQL. Karena Impala SQL sangat kompatibel dengan HiveQL, sebagian besar query TPC-DS dapat langsung dilakukan pada Impala. Namun, ada sedikit perbedaan antara Impala SQL dan HiveQL, dan query TPC-DS tertentu tidak dapat langsung dilakukan pada Impala.
Bagian ini menggunakan query TPC-DS yang dapat dilakukan pada Impala sebagai contoh untuk menunjukkan performa Impala dalam pemrosesan data berat.
Hasilkan data sampel dan simpan di OSS.
[root@cdh-master ~]# git clone https://github.com/hortonworks/hive-testbench.git [root@cdh-master ~]# cd hive-testbench [root@cdh-master hive-testbench]# git checkout hive14 [root@cdh-master hive-testbench]# ./tpcds-build.sh [root@cdh-master hive-testbench]# FORMAT=textfile ./tpcds-setup.sh 50 oss://{your-bucket-name}/Buat tabel.
Pernyataan yang digunakan untuk membuat tabel dengan menggunakan benchmark TPC-DS kompatibel dengan yang menggunakan Impala. Salin pernyataan dalam file ddl-tpcds/text/alltables.sql dan modifikasi ${LOCATION}. Contoh:
[root@cdh-master hive-testbench]# impala-shell -i cdh-slave01 -d default Starting Impala Shell without Kerberos authentication Connected to cdh-slave01:21000 Server version: impalad version 3.0.0-cdh6.0.1 RELEASE(build9a74a5053de5f7b8dd983802e6d75e58d31472db) *********************************************************************************** Welcome to the Impala shell. (Impala Shell v3.0.0-cdh6.0.1(9a74a50) built on Wed Sep 1911:27:37 PDT 2018) Ingin tahu versi Impala yang Anda gunakan? Jalankan perintah VERSION untuk mengetahuinya! *********************************************************************************** Query: use `default` [cdh-slave01:21000] default> create external table call_center( > cc_call_center_sk bigint > , cc_call_center_id string > , cc_rec_start_date string > , cc_rec_end_date string > , cc_closed_date_sk bigint > , cc_open_date_sk bigint > , cc_name string > , cc_class string > , cc_employees int > , cc_sq_ft int > , cc_hours string > , cc_manager string > , cc_mkt_id int > , cc_mkt_class string > , cc_mkt_desc string > , cc_market_manager string > , cc_division int > , cc_division_name string > , cc_company int > , cc_company_name string > , cc_street_number string > , cc_street_name string > , cc_street_type string > , cc_suite_number string > , cc_city string > , cc_county string > , cc_state string > , cc_zip string > , cc_country string > , cc_gmt_offset double > , cc_tax_percentage double > ) > row format delimited fields terminated by '|' > location 'oss://{your-bucket-name}/50/call_center'; Query: create external table call_center( cc_call_center_sk bigint , cc_call_center_id string , cc_rec_start_date string , cc_rec_end_date string , cc_closed_date_sk bigint , cc_open_date_sk bigint , cc_name string , cc_class string , cc_employees int , cc_sq_ft int , cc_hours string , cc_manager string , cc_mkt_id int , cc_mkt_class string , cc_mkt_desc string , cc_market_manager string , cc_division int , cc_division_name string , cc_company int , cc_company_name string , cc_street_number string , cc_street_name string , cc_street_type string , cc_suite_number string , cc_city string , cc_county string , cc_state string , cc_zip string , cc_country string , cc_gmt_offset double , cc_tax_percentage double ) row format delimited fields terminated by '|' location 'oss://{your-bucket-name}/50/call_center' +-------------------------+ | summary | +-------------------------+ | Table has been created. | +-------------------------+ Fetched 1 row(s) in 4.10sPeriksa isi tabel.
[cdh-slave01:21000] default> show tables; Query: show tables +------------------------+ | name | +------------------------+ | call_center | | catalog_page | | catalog_returns | | catalog_sales | | customer | | customer_address | | customer_demographics | | date_dim | | household_demographics | | income_band | | inventory | | item | | promotion | | reason | | ship_mode | | store | | store_returns | | store_sales | | time_dim | | warehouse | | web_page | | web_returns | | web_sales | | web_site | +------------------------+ Fetched 24 row(s) in 0.03sJalankan query TPC-DS pada Impala. Data dapat ditanyakan sesuai harapan.
Pergi ke direktori sample-queries-tpcds yang berisi file SQL.
[root@cdh-master hive-testbench]# ls sample-queries-tpcds query12.sql query20.sql query27.sql query39.sql query46.sql query54.sql query64.sql query71.sql query7.sql query87.sql query93.sql README.md query13.sql query21.sql query28.sql query3.sql query48.sql query55.sql query65.sql query72.sql query80.sql query88.sql query94.sql testbench.settings query15.sql query22.sql query29.sql query40.sql query49.sql query56.sql query66.sql query73.sql query82.sql query89.sql query95.sql testbench-withATS.settings query17.sql query24.sql query31.sql query42.sql query50.sql query58.sql query67.sql query75.sql query83.sql query90.sql query96.sql query18.sql query25.sql query32.sql query43.sql query51.sql query60.sql query68.sql query76.sql query84.sql query91.sql query97.sql query19.sql query26.sql query34.sql query45.sql query52.sql query63.sql query70.sql query79.sql query85.sql query92.sql query98.sqlJalankan query13.sql.
[root@cdh-master hive-testbench]# impala-shell -i cdh-slave01 -d default -f sample-queries-tpcds/query13.sql Starting Impala Shell without Kerberos authentication Connected to cdh-slave01:21000 Server version: impalad version 3.0.0-cdh6.0.1 RELEASE(build9a74a5053de5f7b8dd983802e6d75e58d31472db) Query: use`default`Query: selectavg(ss_quantity) ,avg(ss_ext_sales_price) ,avg(ss_ext_wholesale_cost) ,sum(ss_ext_wholesale_cost) from store_sales ,store ,customer_demographics ,household_demographics ,customer_address ,date_dim where store.s_store_sk = store_sales.ss_store_sk and store_sales.ss_sold_date_sk = date_dim.d_date_sk and date_dim.d_year = 2001and((store_sales.ss_hdemo_sk=household_demographics.hd_demo_sk and customer_demographics.cd_demo_sk = store_sales.ss_cdemo_sk and customer_demographics.cd_marital_status = 'M'and customer_demographics.cd_education_status = '4 yr Degree'and store_sales.ss_sales_price between100.00and150.00and household_demographics.hd_dep_count = 3 )or (store_sales.ss_hdemo_sk=household_demographics.hd_demo_sk and customer_demographics.cd_demo_sk = store_sales.ss_cdemo_sk and customer_demographics.cd_marital_status = 'D'and customer_demographics.cd_education_status = 'Primary'and store_sales.ss_sales_price between50.00and100.00and household_demographics.hd_dep_count = 1 ) or (store_sales.ss_hdemo_sk=household_demographics.hd_demo_sk and customer_demographics.cd_demo_sk = ss_cdemo_sk and customer_demographics.cd_marital_status = 'U'and customer_demographics.cd_education_status = 'Advanced Degree'and store_sales.ss_sales_price between150.00and200.00and household_demographics.hd_dep_count = 1 )) and((store_sales.ss_addr_sk = customer_address.ca_address_sk and customer_address.ca_country = 'United States'and customer_address.ca_state in('KY', 'GA', 'NM') and store_sales.ss_net_profit between100and200 ) or (store_sales.ss_addr_sk = customer_address.ca_address_sk and customer_address.ca_country = 'United States'and customer_address.ca_state in('MT', 'OR', 'IN') and store_sales.ss_net_profit between150and300 ) or (store_sales.ss_addr_sk = customer_address.ca_address_sk and customer_address.ca_country = 'United States'and customer_address.ca_state in('WI', 'MO', 'WV') and store_sales.ss_net_profit between50and250 )) Query submitted at: 2018-10-3011:44:47(Coordinator: http://cdh-slave01:25000) Query progress can be monitored at: http://cdh-slave01:25000/query_plan?query_id=ff4b3157eddfc3c4:8a31c6500000000 +-------------------+-------------------------+----------------------------+----------------------------+ | avg(ss_quantity) | avg(ss_ext_sales_price) | avg(ss_ext_wholesale_cost) | sum(ss_ext_wholesale_cost) | +-------------------+-------------------------+----------------------------+----------------------------+ | 30.87106918238994 | 2352.642327044025 | 2162.600911949685 | 687707.09 | +-------------------+-------------------------+----------------------------+----------------------------+ Fetched 1 row(s) in 353.16s
Query ini melibatkan enam tabel: store, store_sales, customer_demographics, household_demographics, customer_address, dan date_dim.
[cdh-slave01:21000] default> showtable STATS store; Query: showtable STATS store +-------+--------+---------+--------------+-------------------+--------+-------------------+-------------------------------------------+ | #Rows | #Files | Size | Bytes Cached | CacheReplication | Format | Incremental stats | Location | +-------+--------+---------+--------------+-------------------+--------+-------------------+-------------------------------------------+ | -1 | 1 | 37.56KB | NOT CACHED | NOT CACHED | TEXT | false | oss://{your-bucket-name}/50/store | +-------+--------+---------+--------------+-------------------+--------+-------------------+-------------------------------------------+ Fetched 1 row(s) in 0.01s [cdh-slave01:21000] default> showtable STATS store_sales; Query: showtable STATS store_sales +-------+--------+---------+--------------+-------------------+--------+-------------------+-------------------------------------------------+ | #Rows | #Files | Size | Bytes Cached | CacheReplication | Format | Incremental stats | Location | +-------+--------+---------+--------------+-------------------+--------+-------------------+-------------------------------------------------+ | -1 | 50 | 18.75GB | NOT CACHED | NOT CACHED | TEXT | false | oss://{your-bucket-name}/50/store_sales | +-------+--------+---------+--------------+-------------------+--------+-------------------+-------------------------------------------------+ Fetched 1 row(s) in 0.01s [cdh-slave01:21000] default> showtable STATS customer_demographics; Query: showtable STATS customer_demographics +-------+--------+---------+--------------+-------------------+--------+-------------------+-----------------------------------------------------------+ | #Rows | #Files | Size | Bytes Cached | CacheReplication | Format | Incremental stats | Location | +-------+--------+---------+--------------+-------------------+--------+-------------------+-----------------------------------------------------------+ | -1 | 50 | 76.92MB | NOT CACHED | NOT CACHED | TEXT | false | oss://{your-bucket-name}/50/customer_demographics | +-------+--------+---------+--------------+-------------------+--------+-------------------+-----------------------------------------------------------+ Fetched 1 row(s) in 0.01s [cdh-slave01:21000] default> showtable STATS household_demographics; Query: showtable STATS household_demographics +-------+--------+----------+--------------+-------------------+--------+-------------------+------------------------------------------------------------+ | #Rows | #Files | Size | Bytes Cached | CacheReplication | Format | Incremental stats | Location | +-------+--------+----------+--------------+-------------------+--------+-------------------+------------------------------------------------------------+ | -1 | 1 | 148.10KB | NOT CACHED | NOT CACHED | TEXT | false | oss://{your-bucket-name}/50/household_demographics | +-------+--------+----------+--------------+-------------------+--------+-------------------+------------------------------------------------------------+ Fetched 1 row(s) in 0.01s [cdh-slave01:21000] default> showtable STATS customer_address; Query: showtable STATS customer_address +-------+--------+---------+--------------+-------------------+--------+-------------------+------------------------------------------------------+ | #Rows | #Files | Size | Bytes Cached | CacheReplication | Format | Incremental stats | Location | +-------+--------+---------+--------------+-------------------+--------+-------------------+------------------------------------------------------+ | -1 | 1 | 40.54MB | NOT CACHED | NOT CACHED | TEXT | false | oss://{your-bucket-name}/50/customer_address | +-------+--------+---------+--------------+-------------------+--------+-------------------+------------------------------------------------------+ Fetched 1 row(s) in 0.01s [cdh-slave01:21000] default> showtable STATS date_dim; Query: showtable STATS date_dim +-------+--------+--------+--------------+-------------------+--------+-------------------+----------------------------------------------+ | #Rows | #Files | Size | Bytes Cached | CacheReplication | Format | Incremental stats | Location | +-------+--------+--------+--------------+-------------------+--------+-------------------+----------------------------------------------+ | -1 | 1 | 9.84MB | NOT CACHED | NOT CACHED | TEXT | false | oss://{your-bucket-name}/50/date_dim | +-------+--------+--------+--------------+-------------------+--------+-------------------+----------------------------------------------+ Fetched 1 row(s) in 0.01s