本文介紹如何使用JindoTable MoveTo命令將Hive表和分區資料移轉至OSS-HDFS服務。
前提條件
已建立EMR-3.36.0及以上版本(除3.39.x版本以外)或EMR-5.2.0(除5.5.x版本以外)及以上版本的叢集。
已通過Hive命令建立分區表,且表中已寫入資料。本教程以建立名為test_table的表,分區名稱為dt,分區值為value為例。
已開通並授權訪問OSS-HDFS服務。具體步驟,請參見非EMR叢集接入OSS-HDFS服務快速入門。
背景資訊
MoveTo命令可以在拷貝底層資料結束後,自動更新中繼資料,使表和分區的資料完整地遷移到新路徑;可以通過條件式篩選,一次拷貝大量分區。在資料移轉過程中,還使用了多種措施保護資料的完整性,確保資料安全。
操作步驟
叢集上每次僅允許運行一個MoveTo進程。如果叢集上有正在啟動並執行MoveTo進程,啟動新的MoveTo進程時會因為擷取不到配置鎖而退出,並告知正在啟動並執行MoveTo進程。此時,您可以終止掉正在啟動並執行MoveTo進程,啟動新的MoveTo進程,或者等待正在啟動並執行MoveTo進程結束。
- 通過SSH方式登入叢集,詳情請參見登入叢集。
執行以下命令,擷取協助資訊。
sudo jindo table -help moveTo協助資訊如下所示。
<dbName.tableName> The table to move. <destination path> The destination base directory which is always at the same level of a 'table location', where the moved partitions or un-partitioned data would located in. <condition>/-fullTable A filter condition to determine which partitions should be moved, supporting common operators (like '>') and built-in UDFs (like to_date) (UDFs not supported yet...), while -fullTable means that all partitions (or a whole un-partitioned table) should be moved. One but only one option must be specified among -c "<condition>" and -fullTable. <before days> Optional, saying that table/partitions should be moved only when they are created (not updated or modified) more than some days before from now. <parallelism> The maximum concurrency when copying partitions, 1 by default. <OSS storage policy> Storage policy for OSS destination, which can be Standard (by default), IA, Archive, or ColdArchive. Not applicable for destinations other than OSS. NOTE: if you are willing to use ColdArchive storage policy, please make sure that Cold Archive has been enabled for your OSS bucket. -o/-overWrite Overwriting the final paths where the data would be moved. For partitioned tables this overwrites partitions locations which are subdirectories of <destination path>; for un-partitioned table this overwrites the <destination path> itself. -r/-removeSource Let the source data be removed when the corresponding table/partition is successfully moved to the new destination. Otherwise (by default), the source data would be left as it was. -skipTrash Applicable only when [-r/-removeSource] is enabled. If present, source data would be immediately deleted from the file system, bypassing the trash. -e/-explain If present, the command would not really move data, but only prints the table/partitions that would be moved for given conditions. <log directory> A directory to locate log files, '/tmp/<current user>/' by default.命令格式
sudo jindo table -moveTo \ -t <dbName.tableName> \ -d <destination path> \ [-c "<condition>" | -fullTable] \ [-b/-before <before days>] \ [-p/-parallel <parallelism>] \ [-s/-storagePolicy <OSS storage policy>] \ [-o/-overWrite] \ [-r/-removeSource] \ [-skipTrash] \ [-e/-explain] \ [-l/-logDir <log directory>]命令說明
參數
是否必選
描述
-t <dbName.tableName>
是
待移動的表名稱,格式為
資料庫名.表名。資料庫和表名之前以半形句號(.)分隔。表可以是分區表或非分區表。
-d <destination path>
是
待移動的目標位置。無論是移動分區還是移動非分區表的整表,該位置都對應 "表" 一級的位置。如果移動的是分區,則分區的完整路徑是該路徑+分區名。例如
<destination path>/p1=v1/p2=v2/。-c "<condition>" | -fullTable
否
兩者必須指定其中一個。即您可以指定
-c "<condition>",或者指定-fullTable。- 指定
-fullTable時,則為移動整表,既可以是非分區表也可以是分區表。 - 指定
-c "<condition>"時,則提供了一個過濾條件,用來選擇希望移動的分區,支援常見運算子,例如大於符號(>)。例如,資料類型為String的分區ds,希望分區名大於 'd',則代碼為
-c " ds > 'd' "。
-b/before <before days>
否
僅建立時間距離目前時間超過一定天數的表或者分區才會被移動。
-p/-parallel <parallelism>
否
遷移操作的並行度。
-s/-storagePolicy <OSS storage policy>
否
OSS-HDFS服務不支援該選項。
-o/-overWrite
否
是否強制覆蓋目標寫入路徑。如果是分區表,則只會清空待移動分區的分區路徑,不會清空整個表路徑。
-r/-removeSource
否
移動完成,中繼資料也同步更新後,是否清理源路徑。如果是分區表,則只會清理成功移動的分區的源路徑。
-skipTrash
否
清理源路徑時是否跳過Trash。
說明該選項需與-r/-removeSource選項同時使用。
-e/-explain
否
如果出現該選項,則為解釋(explain )模式,只會顯示待移動的分區列表,而不會真正移動資料。
-l/-logDir <log directory>
否
指定Log檔案目錄。
預設值:/tmp/<current user>/
- 指定
將分區資料移轉至OSS-HDFS服務。
查看待遷移的分區是否符合預期。
結合-e 選項僅列舉待遷移的分區,但不會真正執行遷移任務。
sudo jindotable -moveTo -t tdb.test_table -d oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/data/tdb.test_table -c " dt > 'v' " -e返回結果如下:
Found 1 partitions to move: dt=value-2 MoveTo finished for table tdb.test_table to destination oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/data/tdb.test_table with condition " dt > 'v' " (explain only).將分區遷移至OSS-HDFS服務。
sudo jindotable -moveTo -t tdb.test_table -d oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/data/tdb.test_table -c " dt > 'v' "返回結果如下:
Found 1 partitions in total, and all are successfully moved. Successfully moved partitions: dt=value-2 No failed partition. MoveTo finished for table tdb.test_table to destination oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/data/tdb.test_table with condition " dt > 'v' ".通過查看Location屬性,驗證分區是否成功遷移。
sudo hive> desc formatted test_table partition (dt='value-2');返回結果如下:
OK # col_name data_type comment id int content string # Partition Information # col_name data_type comment dt string # Detailed Partition Information Partition Value: [value-2] Database: tdb Table: test_table CreateTime: UNKNOWN LastAccessTime: UNKNOWN Location: oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/data/tdb.test_table/dt=value-2可選:將分區從OSS-HDFS遷移至HDFS。
sudo jindotable -moveTo -t tdb.test_table -d hdfs://<hdfs-path>/user/hive/warehouse/tdb.db/test_table -c " dt > 'v' "返回結果如下:
No successfully moved partition. Failed partitions: dt=value-2 New location is not empty but -overWrite is not enabled. MoveTo finished for table tdb.test_table to destination hdfs://<hdfs-path>/user/hive/warehouse/tdb.db/test_table with condition -c " dt > 'v' ".返回結果提示No successfully moved partition.,原因是HDFS目標目錄非空。如果確認目標目錄可以丟棄,您可以使用-overWrite選項強制覆蓋目標目錄,確保將分區從OSS-HDFS遷移至HDFS。
sudo jindotable -moveTo -t tdb.test_table -d hdfs://<hdfs-path>/user/hive/warehouse/tdb.db/test_table -c " dt > 'v' "遷移成功後,返回結果如下:
Found 1 partitions in total, and all are successfully moved. Successfully moved partitions: dt=value-2 No failed partition. MoveTo finished for table tdb.test_table to destination hdfs:///user/hive/warehouse/tdb.db/test_table with condition " dt > 'v' ", overwriting new locations.
異常處理
如果遷移表或分區時遷移失敗並提示Conflicts found,請通過以下方法處理該問題。
確保同一時間不存在其他命令向相同的目標路徑遷移資料,例如DistCp、JindoDistCp等分布式拷貝命令。
刪除目標目錄。對於非分區表,刪除表一級目錄。對於分區表,刪除存在衝突的分區級目錄。
請勿刪除來源目錄。