全部產品
Search
文件中心

Object Storage Service:遷移Hive表和分區資料到OSS-HDFS服務

更新時間:Jun 19, 2024

本文介紹如何使用JindoTable MoveTo命令將Hive表和分區資料移轉至OSS-HDFS服務。

前提條件

  • 已建立EMR-3.36.0及以上版本(除3.39.x版本以外)或EMR-5.2.0(除5.5.x版本以外)及以上版本的叢集。

  • 已通過Hive命令建立分區表,且表中已寫入資料。本教程以建立名為test_table的表,分區名稱為dt,分區值為value為例。

  • 已開通並授權訪問OSS-HDFS服務。具體步驟,請參見非EMR叢集接入OSS-HDFS服務快速入門

背景資訊

MoveTo命令可以在拷貝底層資料結束後,自動更新中繼資料,使表和分區的資料完整地遷移到新路徑;可以通過條件式篩選,一次拷貝大量分區。在資料移轉過程中,還使用了多種措施保護資料的完整性,確保資料安全。

操作步驟

重要

叢集上每次僅允許運行一個MoveTo進程。如果叢集上有正在啟動並執行MoveTo進程,啟動新的MoveTo進程時會因為擷取不到配置鎖而退出,並告知正在啟動並執行MoveTo進程。此時,您可以終止掉正在啟動並執行MoveTo進程,啟動新的MoveTo進程,或者等待正在啟動並執行MoveTo進程結束。

  1. 通過SSH方式登入叢集,詳情請參見登入叢集
  2. 執行以下命令,擷取協助資訊。

    sudo jindo table -help moveTo

    協助資訊如下所示。

    <dbName.tableName>      The table to move.
    <destination path>      The destination base directory which is always at the
                              same level of a 'table location', where the moved
                              partitions or un-partitioned data would located in.
    <condition>/-fullTable  A filter condition to determine which partitions should
                              be moved, supporting common operators (like '>') and
                              built-in UDFs (like to_date) (UDFs not supported
                              yet...), while -fullTable means that all partitions (or
                              a whole un-partitioned table) should be moved. One but
                              only one option must be specified among -c
                              "<condition>" and -fullTable.
    <before days>           Optional, saying that table/partitions should be moved
                              only when they are created (not updated or modified)
                              more than some days before from now.
    <parallelism>           The maximum concurrency when copying partitions, 1 by
                              default.
    <OSS storage policy>    Storage policy for OSS destination, which can be Standard
                              (by default), IA, Archive, or ColdArchive. Not applicable for destinations other
                              than OSS. NOTE: if you are willing to use ColdArchive storage policy, please
                              make sure that Cold Archive has been enabled for your OSS bucket.
    
    -o/-overWrite           Overwriting the final paths where the data would be moved.
                              For partitioned tables this overwrites partitions locations
                              which are subdirectories of <destination path>; for
                              un-partitioned table this overwrites the <destination path>
                              itself.
    -r/-removeSource        Let the source data be removed when the corresponding
                              table/partition is successfully moved to the new destination.
                              Otherwise (by default), the source data would be left as it
                              was.
    -skipTrash              Applicable only when [-r/-removeSource] is enabled. If
                              present, source data would be immediately deleted from the
                              file system, bypassing the trash.
    -e/-explain             If present, the command would not really move data, but only
                              prints the table/partitions that would be moved for given
                              conditions.
    <log directory>         A directory to locate log files, '/tmp/<current user>/' by
                              default.
    • 命令格式

      sudo jindo table -moveTo \
        -t <dbName.tableName> \
        -d <destination path> \
        [-c "<condition>" | -fullTable] \
        [-b/-before <before days>] \
        [-p/-parallel <parallelism>] \
        [-s/-storagePolicy <OSS storage policy>] \
        [-o/-overWrite] \
        [-r/-removeSource] \
        [-skipTrash] \
        [-e/-explain] \
        [-l/-logDir <log directory>]
    • 命令說明

      參數

      是否必選

      描述

      -t <dbName.tableName>

      待移動的表名稱,格式為資料庫名.表名

      資料庫和表名之前以半形句號(.)分隔。表可以是分區表或非分區表。

      -d <destination path>

      待移動的目標位置。無論是移動分區還是移動非分區表的整表,該位置都對應 "表" 一級的位置。如果移動的是分區,則分區的完整路徑是該路徑+分區名。例如<destination path>/p1=v1/p2=v2/

      -c "<condition>" | -fullTable

      兩者必須指定其中一個。即您可以指定-c "<condition>",或者指定-fullTable

      • 指定-fullTable時,則為移動整表,既可以是非分區表也可以是分區表。
      • 指定-c "<condition>"時,則提供了一個過濾條件,用來選擇希望移動的分區,支援常見運算子,例如大於符號(>)。

        例如,資料類型為String的分區ds,希望分區名大於 'd',則代碼為-c " ds > 'd' "

      -b/before <before days>

      僅建立時間距離目前時間超過一定天數的表或者分區才會被移動。

      -p/-parallel <parallelism>

      遷移操作的並行度。

      -s/-storagePolicy <OSS storage policy>

      OSS-HDFS服務不支援該選項。

      -o/-overWrite

      是否強制覆蓋目標寫入路徑。如果是分區表,則只會清空待移動分區的分區路徑,不會清空整個表路徑。

      -r/-removeSource

      移動完成,中繼資料也同步更新後,是否清理源路徑。如果是分區表,則只會清理成功移動的分區的源路徑。

      -skipTrash

      清理源路徑時是否跳過Trash。

      說明

      該選項需與-r/-removeSource選項同時使用。

      -e/-explain

      如果出現該選項,則為解釋(explain )模式,只會顯示待移動的分區列表,而不會真正移動資料。

      -l/-logDir <log directory>

      指定Log檔案目錄。

      預設值:/tmp/<current user>/

  3. 將分區資料移轉至OSS-HDFS服務。

    1. 查看待遷移的分區是否符合預期。

      結合-e 選項僅列舉待遷移的分區,但不會真正執行遷移任務。

      sudo jindotable -moveTo -t tdb.test_table -d oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/data/tdb.test_table -c " dt > 'v' " -e

      返回結果如下:

      Found 1 partitions to move:
            dt=value-2
      MoveTo finished for table tdb.test_table to destination oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/data/tdb.test_table with condition " dt > 'v' " (explain only).
    2. 將分區遷移至OSS-HDFS服務。

      sudo jindotable -moveTo -t tdb.test_table -d oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/data/tdb.test_table  -c " dt > 'v' " 

      返回結果如下:

      Found 1 partitions in total, and all are successfully moved.
      Successfully moved partitions:
          dt=value-2
      No failed partition.
      MoveTo finished for table tdb.test_table to destination oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/data/tdb.test_table with condition " dt > 'v' ".
    3. 通過查看Location屬性,驗證分區是否成功遷移。

      sudo hive> desc formatted test_table partition (dt='value-2');

      返回結果如下:

      OK
      # col_name              data_type               comment
      id                      int
      content                 string
      
      # Partition Information
      # col_name              data_type               comment
      dt                      string
      
      # Detailed Partition Information
      Partition Value:        [value-2]
      Database:               tdb
      Table:                  test_table
      CreateTime:             UNKNOWN
      LastAccessTime:         UNKNOWN
      Location:               oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/data/tdb.test_table/dt=value-2
    4. 可選:將分區從OSS-HDFS遷移至HDFS。

      sudo jindotable -moveTo -t tdb.test_table -d hdfs://<hdfs-path>/user/hive/warehouse/tdb.db/test_table  -c " dt > 'v' "

      返回結果如下:

      No successfully moved partition.
      Failed partitions:
          dt=value-2    New location is not empty but -overWrite is not enabled.
      MoveTo finished for table tdb.test_table to destination hdfs://<hdfs-path>/user/hive/warehouse/tdb.db/test_table with condition -c " dt > 'v' ".

      返回結果提示No successfully moved partition.,原因是HDFS目標目錄非空。如果確認目標目錄可以丟棄,您可以使用-overWrite選項強制覆蓋目標目錄,確保將分區從OSS-HDFS遷移至HDFS。

      sudo jindotable -moveTo -t tdb.test_table -d hdfs://<hdfs-path>/user/hive/warehouse/tdb.db/test_table  -c " dt > 'v' "

      遷移成功後,返回結果如下:

      Found 1 partitions in total, and all are successfully moved.
      Successfully moved partitions:
          dt=value-2
      No failed partition.
      MoveTo finished for table tdb.test_table to destination hdfs:///user/hive/warehouse/tdb.db/test_table with condition " dt > 'v' ", overwriting new locations.

異常處理

如果遷移表或分區時遷移失敗並提示Conflicts found,請通過以下方法處理該問題。

  • 確保同一時間不存在其他命令向相同的目標路徑遷移資料,例如DistCp、JindoDistCp等分布式拷貝命令。

  • 刪除目標目錄。對於非分區表,刪除表一級目錄。對於分區表,刪除存在衝突的分區級目錄。

  • 請勿刪除來源目錄。