Move Hive Tables & Partitions at Scale with JindoTable - E-MapReduce

Quick start

Preview which partitions match your filter, then run the migration:

# Step 1: Preview partitions to be migrated (no data is moved)
jindo table -moveTo \
  -t mydb.events \
  -d oss://my-bucket/archive/events \
  -c "ds < '2023-01-01'" \
  -e

# Step 2: Run the migration
jindo table -moveTo \
  -t mydb.events \
  -d oss://my-bucket/archive/events \
  -c "ds < '2023-01-01'"

Prerequisites

Before you begin, ensure that you have:

Java Development Kit (JDK) 8 installed on your computer
An E-MapReduce (EMR) cluster running EMR V3.36.0 or later, or EMR V5.2.0 or later

How it works

When MoveTo runs, it:

Copies the underlying data to the destination path.
Updates the table or partition metadata to point to the new location.
Optionally removes the source data after a successful migration (requires -r).

MoveTo uses a process lock stored in Hadoop Distributed File System (HDFS) to prevent concurrent runs. Only one MoveTo process can run at a time in an EMR cluster. If you start a second process while one is running, the request is rejected with a message that identifies the running process. Either wait for it to finish or stop it before starting a new one.

Migrate tables or partitions

Follow this three-step workflow to migrate data safely:

Preview — run with -e to verify which partitions will be moved.
Migrate — run the actual migration command.
Clean up — add -r to remove source data only after you confirm the migration succeeded.

Warning

-r/-removeSource permanently removes source data after migration. When combined with -skipTrash, data is deleted immediately without going to the HDFS trash. Always run with -e/-explain first to verify which partitions will be moved before using these flags.

Important

Do not start a MoveTo process on a cluster where one is already running. The new request will be rejected.

Step 1: Log on to your EMR cluster

Log on to your EMR cluster in SSH mode. For more information, see Log on to a cluster.

Step 2: Preview the migration (recommended)

Run the command with -e to print the list of matching partitions without moving any data:

jindo table -moveTo \
  -t <dbName.tableName> \
  -d <destination path> \
  -c "<condition>" \
  -e

Example: Preview all partitions in the ds column older than 2023-01-01:

jindo table -moveTo \
  -t mydb.events \
  -d oss://my-bucket/archive/events \
  -c "ds < '2023-01-01'" \
  -e

The command prints the list of matching partitions without moving any data.

Step 3: Run the migration

To view all available options for the MoveTo command, run:

jindo table -help moveTo

The full command syntax is:

jindo table -moveTo \
  -t <dbName.tableName> \
  -d <destination path> \
  [-c "<condition>" | -fullTable] \
  [-b/-before <before days>] \
  [-p/-parallel <parallelism>] \
  [-s/-storagePolicy <OSS storage policy>] \
  [-o/-overWrite] \
  [-r/-removeSource] \
  [-skipTrash] \
  [-e/-explain] \
  [-l/-logDir <log directory>]

Parameter	Description	Required
`-t <dbName.tableName>`	The table to migrate, in `database.table` format. Supports both partitioned and non-partitioned tables.	Yes
`-d <destination path>`	The table-level destination path. For partitioned tables, the full partition path is composed as `<destination path>/p1=v1/p2=v2/`.	Yes
`-c "<condition>"` \| `-fullTable`	Use `-fullTable` to move the entire table. Use `-c "<condition>"` to filter partitions by a condition (supports standard operators such as `>`). Example: `-c "ds > 'd'"`. You must specify one of these options.	No
`-b/-before <before days>`	Migrate only tables or partitions created at least the specified number of days ago.	No
`-p/-parallel <parallelism>`	Maximum number of concurrent partition copy operations. Defaults to `1`.	No
`-s/-storagePolicy <OSS storage policy>`	Target storage class for Object Storage Service (OSS) destinations. Valid values: `Standard` (default), `IA`, `Archive`, `ColdArchive`. Not applicable for non-OSS destinations.	No
`-o/-overWrite`	Clear the destination path before writing. For partitioned tables, only the destination path of the migrated partition is cleared.	No
`-r/-removeSource`	Remove the source path after the migration and metadata update succeed. For partitioned tables, only the source path of the migrated partition is removed.	No
`-skipTrash`	Delete source data immediately, bypassing the HDFS trash. Only valid when `-r`/`-removeSource` is specified.	No
`-e/-explain`	Print the list of partitions to be migrated without moving any data. Use this to validate your filter conditions before running the actual migration.	No
`-l/-logDir <log directory>`	Directory for log files. Defaults to `/tmp/<current user>/`.	No

Examples

Migrate all partitions in mydb.events to an OSS Archive path, using 4 parallel threads, and remove source data after a successful migration:

jindo table -moveTo \
  -t mydb.events \
  -d oss://my-bucket/archive/events \
  -fullTable \
  -s Archive \
  -p 4 \
  -r

Note

Before using the ColdArchive storage class, make sure that Cold Archive is enabled on the destination OSS bucket.

Migrate partitions older than 2023-06-01 in the ds column, created at least 90 days ago:

jindo table -moveTo \
  -t mydb.logs \
  -d oss://my-bucket/cold/logs \
  -c "ds < '2023-06-01'" \
  -b 90 \
  -s IA

Configure a custom lock directory

MoveTo uses a process lock stored in HDFS to prevent concurrent runs. The default lock path is hdfs:///tmp/jindotable-lock/.

Important

The lock path must be an HDFS path. If you do not have write permission on the default path, follow these steps to set a custom path.

Warning

Before changing the lock directory, make sure no MoveTo process is running on the cluster. Changing the lock directory while a process is active may cause the process to fail and could result in data corruption.

Go to the HDFS service page in the EMR console.
1. Log on to the Alibaba Cloud EMR console.
2. In the top navigation bar, select the region where your cluster resides and select a resource group.
3. Click the Cluster Management tab.
4. On the Cluster Management page, find your cluster and click Details in the Actions column.
5. In the left-side navigation pane of the Cluster Overview page, choose Cluster Service > HDFS.
Add a custom configuration item.
1. Click the Configure tab, then click hdfs-site or core-site in the Service Configuration section.
2. In the upper-right corner of the Service Configuration section, click Custom Configuration.
3. In the Add Configuration Item dialog box, add the jindotable.moveto.tablelock.base.dir parameter and set its value to an existing HDFS path.
Save the configuration.
1. In the upper-right corner of the Service Configuration section, click Save.
2. In the Confirm Changes dialog box, fill in Description and click OK.