JindoTable allows you to run the MoveTo command to migrate data in tables or partitions. This topic describes how to use the MoveTo command.

Prerequisites

  • Java Development Kit (JDK) 8 is installed on your computer.
  • An E-MapReduce (EMR) cluster is created. For more information, see Create a cluster.

Background information

The MoveTo command can automatically update metadata after the command copies the underlying data. This way, data in a table or partitions can be fully migrated to the destination path. You can configure filter conditions for the MoveTo command to migrate a large number of partitions at the same time. JindoTable also provides some protective measures to ensure data integrity and security when the MoveTo command is used to migrate data.

Limits

The MoveTo command is supported only in EMR V3.36.0 and later minor versions, and in EMR V5.2.0 and later minor versions.

Use the MoveTo command

Notice Only one MoveTo process can be run at a time in an EMR cluster. If you attempt to start a MoveTo process on an EMR cluster on which another MoveTo process is running, your request is rejected because the configuration lock is unavailable. A message that contains information about the running MoveTo process is also displayed. In this case, you can terminate the running MoveTo process and start a new MoveTo process, or wait for the running MoveTo process to end.
  1. Log on to your cluster in SSH mode. For more information, see Log on to a cluster.
  2. Run the following command to obtain help information:
    jindo table -help moveTo
    Information similar to the following output is returned:
    <dbName.tableName>      The table to move.
    <destination path>      The destination base directory which is always at the
                              same level of a 'table location', where the moved
                              partitions or un-partitioned data would located in.
    <condition>/-fullTable  A filter condition to determine which partitions should
                              be moved, supporting common operators (like '>') and
                              built-in UDFs (like to_date) (UDFs not supported
                              yet...), while -fullTable means that all partitions (or
                              a whole un-partitioned table) should be moved. One but
                              only one option must be specified among -c
                              "<condition>" and -fullTable.
    <before days>           Optional, saying that table/partitions should be moved
                              only when they are created (not updated or modified)
                              more than some days before from now.
    <parallelism>           The maximum concurrency when copying partitions, 1 by
                              default.
            <OSS storage policy>: Storage policy for OSS destination, which can be Standard
      (by default), IA, Archive, or ColdArchive. Not applicable for destinations other
      than OSS. NOTE: if you are willing to use ColdArchive storage policy, please
      make sure that Cold Archive has been enabled for your OSS bucket.
    
    -o/-overWrite     Overwriting the final paths where the data would be moved.
                        For partitioned tables this overwrites partitions' locations
                        which are subdirectories of <destination path>; for
                        un-partitioned table this overwrites the <destination path>
                        itself.
    -r/-removeSource  Let the source data be removed when the corresponding
                        table/partition is successfully moved to the new destination.
                        Otherwise (by default), the source data would be left as it
                        was.
    -skipTrash        Applicable only when [-r/-removeSource] is enabled. If
                        present, source data would be immediately deleted from the
                        file system, bypassing the trash.
    -e/-explain       If present, the command would not really move data, but only
                        prints the table/partitions that would be moved for given
                        conditions.
    <log directory>   A directory to locate log files, '/tmp/<current user>/' by
                        default.
    MoveTo syntax:
    jindo table -moveTo \
      -t <dbName.tableName> \
      -d <destination path> \
      [-c "<condition>" | -fullTable] \
      [-b/-before <before days>] \
      [-p/-parallel <parallelism>] \
      [-s/-storagePolicy <OSS storage policy>] \
      [-o/-overWrite] \
      [-r/-removeSource] \
      [-skipTrash] \
      [-e/-explain] \
      [-l/-logDir <log directory>]
    Parameter Description Required
    -t <dbName.tableName> The name of the table that you want to migrate. You must specify this parameter in the Database name.Table name format.

    Separate the database name and table name with a period (.). The table can be a partitioned table or a non-partitioned table.

    Yes
    -d <destination path> The destination path. No matter whether you want to migrate a specific partition or an entire non-partitioned table, this parameter specifies a table-level path. If you want to migrate a partition, the complete path of the partition is composed of the value of this parameter and the name of the partition, such as <destination path>/p1=v1/p2=v2/. Yes
    -c "<condition>" | -fullTable You must specify either -c "<condition>" or -fullTable.
    • If you specify -fullTable, the entire partitioned or non-partitioned table is archived.
    • If you specify -c "<condition>", only the partitions that meet the filter condition are archived. Common operators, such as greater-than signs (>), are supported.

      For example, if the partition key column is the ds column whose data type is String and you want to archive partitions whose partition names are greater than 'd', use -c " ds > 'd' ".

    No
    -b/before <before days> Only the tables or partitions that were created at least the specified days ago can be migrated. No
    -p/-parallel <parallelism> The parallelism among migration operations. No
    -s/-storagePolicy <OSS storage policy> The storage class that you want to use after data is migrated to Object Storage Service (OSS). Valid values:
    • Standard
    • IA
    • Archive
    • ColdArchive
      Note Make sure that the storage class you want to use is enabled on the destination OSS bucket.
    No
    -o/-overWrite The destination path is forcibly cleared. For a partitioned table, only the destination path of the partition that you want to migrate is cleared. No
    -r/-removeSource After data is migrated and metadata is updated, the source path is cleared. For a partitioned table, only the source path of the partition that is migrated is cleared. No
    -skipTrash The trash is skipped when the source path is cleared.
    Note You can specify this option only if -r/-removeSource is specified.
    No
    -e/-explain The explain mode is used. In explain mode, the list of partitions to be migrated is displayed, but no data is migrated. No
    -l/-logDir <log directory> The directory in which log files are stored. No

Configure a lock directory

The MoveTo command supports process locks. You must use an HDFS path to store lock files. The default path is hdfs:///tmp/jindotable-lock/.
Notice The path must be an HDFS path. If you have no permission on the default path, you can perform the following steps to customize a path.
  1. Go to the HDFS service page.
    1. Log on to the Alibaba Cloud EMR console.
    2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
    3. Click the Cluster Management tab.
    4. On the Cluster Management page, find your cluster and click Details in the Actions column.
    5. In the left-side navigation pane of the Cluster Overview page, choose Cluster Service > HDFS.
  2. Add a custom item.
    1. Click the Configure tab. Then, click the hdfs-site or core-site tab in the Service Configuration section.
    2. In the upper-right corner of the Service Configuration section, click Custom Configuration.
      hdfs-site
    3. In the Add Configuration Item dialog box, add the jindotable.moveto.tablelock.base.dir parameter and set it to an existing HDFS path.
      Notice When you customize a lock directory, make sure that no MoveTo process is running on the nodes of the cluster. Otherwise, the MoveTo process may fail, which may even cause data pollution.
  3. Save the configuration.
    1. In the upper-right corner of the Service Configuration section, click Save.
    2. In the Confirm Changes dialog box, specify Description and click OK.