Hadoop Distributed File System (HDFS) cannot be infinitely scaled out due to limits on the cluster size and scaling costs. As a result, capacity bottlenecks occur in HDFS. Alibaba Cloud provides Object Storage Service (OSS) and OSS-HDFS that is compatible with the HDFS API to seamlessly expand the storage capabilities of a cloud-based Hadoop ecosystem. You can use JindoTable to filter Hive data based on the partition key and migrate partitions between HDFS and OSS-HDFS. This topic describes how to use JindoTable to migrate Hive tables and partitions to OSS-HDFS.
Prerequisites
An E-MapReduce (EMR) cluster of V3.42.0 or a later minor version or of V5.8.0 or a later minor version is created. For more information, see Create a cluster.
A partitioned table is created by running a Hive command, and data is written to the table.
In this example, a partitioned table named test_table is created. The partition name is dt and the partition value is value.
OSS-HDFS is enabled and access permissions on OSS-HDFS are granted. For more information, see Enable OSS-HDFS and grant access permissions.
Precautions
You must add the jindotable.moveto.tablelock.base.dir configuration item to the core-site.xml or hdfs-site.xml configuration file of Hadoop. The configuration files reside in the $HADOOP_CONF_DIR directory.
You can set the value of this configuration item to an HDFS directory to store the lock files that are automatically created during the running of the MoveTo tool. Make sure that only the moveTo tool has the permissions to access the directory. If you do not configure the configuration item, the default directory hdfs:///tmp/jindotable-lock/ is used. If the moveTo tool does not have the permissions to access the directory, an error is reported.
Use JindoTable
Obtain help information
Run the following command to obtain the help information about the moveTo tool:
jindotable -help moveToParameter description
jindotable -moveTo -t <dbName.tableName> -d <destination path> [-c "<condition>" | -fullTable] [-b/-before <before days>] [-p/-parallel <parallelism>] [-s/-storagePolicy <OSS storage policy>] [-o/-overWrite] [-r/-removeSource] [-skipTrash] [-e/-explain] [-q/-queue <yarn queue>] [-w/-workingDir <working directory>][-l/-logDir <log directory>]Parameter | Description | Required |
-t <dbName.tableName> | The table that you want to move. | Yes |
-d <destination path> | The destination directory in which the table is stored. The directories of partitions are automatically created in this directory. | Yes |
-c "<condition>" | -fullTable | The expression of the filter condition for partitions. Basic operators are supported. User-defined functions (UDFs) are not supported. | No |
-b/-before <before days> | The time condition for migrating partitions. Unit: days. Only the partitions that were created the specified number of days ago are migrated. | No |
-p/-parallel <parallelism> | The maximum parallelism of the task that is run by using the moveTo tool. Default value: 1. | No |
-s/-storagePolicy <OSS storage policy> | The storage policy for data files that are copied to OSS. Default value: Standard. Valid values: Standard, IA, Archive, and ColdArchive. | No |
-o/-overWrite | Specifies whether to overwrite data in the destination directory. Only data in the directories of migrated partitions are overwritten. The directories of partitions that are not migrated remain unchanged. | No |
-r/-removeSource | Specifies whether to delete the source data after the migration is complete. | No |
-skipTrash | Specifies whether to bypass the recycle bin when the system deletes source data. | No |
-e/-explain | Specifies whether to use the explain mode. In explain mode, the system displays the partitions to be migrated but does not migrate the partitions. | No |
-q/-queue <yarn queue> | Specifies the YARN queue for distributed copy. | No |
-w/-workingDir | The temporary working directory for distributed copy. | No |
-l/-logDir <log directory> | The directory in which log files are stored. Default value: /tmp/<current user>/. | No |
Procedure
Log on to the master node of your cluster in SSH mode. For more information, see Log on to a cluster.
Check whether the partitions that you want to migrate are as expected.
You can use the -e option to list the partitions that you want to migrate. The partitions are not actually migrated.
jindotable -moveTo -t tdb.test_table -d oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/data/tdb.test_table -c " dt > 'v' " -eMigrate partitions to OSS-HDFS.
jindotable -moveTo -t tdb.test_table -d oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/data/tdb.test_table -c " dt > 'v' "Run the following command in the Hive CLI to check whether partitions are migrated to OSS-HDFS:
desc formatted test_table partition (dt='value');Optional. Migrate partitions from OSS-HDFS to HDFS.
jindotable -moveTo -t tdb.test_table -d hdfs://<hdfs-path>/user/hive/warehouse/tdb.db/test_table -c " dt > 'v' "If the
No successfully moved partitionmessage appears in the returned result, the destination HDFS directory is not empty. If you confirm that the destination directory can be cleared, you can use the -overWrite option to forcefully migrate the partitions by overwriting data in the destination directory.jindotable -moveTo -t tdb.test_table -d hdfs://<hdfs-path>/user/hive/warehouse/tdb.db/test_table -c " dt > 'v' " -overWrite
Exception handling
To ensure data security and prevent dirty data, JindoTable automatically checks a destination directory to ensure that no other command is run to copy data to the same directory. If a conflict occurs, the current command for migrating a table or partitions fails. In this case, you must stop all the other copy tasks and clear the destination directory. Then, run the copy command again. For a non-partitioned table, the destination directory is the directory in which the table is stored. For a partitioned table, each partition is copied or moved to its own destination directory. You need to only clear the directories for partitions to be copied or moved.
If an exception occurs when a table or partition is migrated and the migration process is interrupted, manual intervention may be required. If a command is aborted, the copying process is not complete, and the source data and the metadata of the table remain unchanged. Therefore, the data is still in the secure state. Common causes that lead to unexpected command aborting:
The command process is killed by a user before the running of the command ends.
An exception such as memory overflow occurs. As a result, the command process is terminated.
References
If you use an environment other than EMR, you must install and deploy JindoSDK first. For more information, see Deploy JindoSDK in an environment other than EMR.