Use Hadoop DistCp to copy data from an Object Storage Service (OSS) bucket to a Lindorm file database in a single MapReduce job.
Prerequisites
Before you begin, ensure that you have:
-
Activated the file engine for your Lindorm instance. For more information, see Activate the file engine service.
-
A Hadoop cluster running Hadoop 2.7.3 or later, configured to access the Lindorm file engine. For more information, see Use open source HDFS clients to connect to and use LindormDFS.
-
JDK 1.8 or later installed on all nodes of the Hadoop cluster.
-
JindoFS SDK installed on all nodes of the Hadoop cluster. See Install the JindoFS SDK below.
Install the JindoFS SDK
Install the JindoFS SDK on every node in your Hadoop cluster before running the migration.
-
Download
jindofs-sdk.jarfrom the JindoFS SDK repository, then copy it to the Hadoop library directory:cp ./jindofs-sdk-*.jar ${HADOOP_HOME}/share/hadoop/hdfs/lib/ -
Add the following environment variable to
/etc/profileon each node:export B2SDK_CONF_DIR=/etc/jindofs-sdk-conf -
Create the JindoFS SDK configuration file at
/etc/jindofs-sdk-conf/bigboot.cfg:[bigboot] logger.dir=/tmp/bigboot-log [bigboot-client] client.oss.retry=5 client.oss.upload.threads=4 client.oss.upload.queue.size=5 client.oss.upload.max.parallelism=16 client.oss.timeout.millisecond=30000 client.oss.connection.timeout.millisecond=4000 -
Load the environment variable:
source /etc/profile -
Verify that your OSS bucket is accessible from the Hadoop cluster:
${HADOOP_HOME}/bin/hadoop fs -ls oss://<accessKeyId>:<accessKeySecret>@<bucket-name>.<endpoint>/If the command returns the bucket contents without errors, the SDK is configured correctly.
Migrate data from the OSS bucket
-
Check the size of the data to migrate:
${HADOOP_HOME}/bin/hadoop du -h oss://<accessKeyId>:<accessKeySecret>@<bucket-name>.<endpoint>/test_data -
Run DistCp to start a MapReduce job that copies the data to the Lindorm file database:
${HADOOP_HOME}/bin/hadoop distcp \ oss://<accessKeyId>:<accessKeySecret>@<bucket-name>.<endpoint>/test_data.txt \ hdfs://<instance-id>/Replace
<instance-id>with your Lindorm instance ID.The following table describes the parameters:
Parameter Description Required accessKeyIdThe AccessKey ID used to authenticate OSS API calls. To get your AccessKey pair, see Create an AccessKey pair. Yes accessKeySecretThe AccessKey Secret used to authenticate OSS API calls. Yes bucket-name.endpointThe OSS bucket access address, consisting of the bucket name and the endpoint for the region where the bucket is deployed. Yes -
Check the job output. The migration is complete when the output shows:
-
map 100% reduce 0% -
Job job_xxx completed successfully -
BYTESCOPIEDequalsBYTESEXPECTED
Example output:
20/09/29 12:23:59 INFO mapreduce.Job: map 100% reduce 0% 20/09/29 12:23:59 INFO mapreduce.Job: Job job_1601195105349_0015 completed successfully 20/09/29 12:23:59 INFO mapreduce.Job: Counters: 38 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=122343 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=470 HDFS: Number of bytes written=47047709 HDFS: Number of read operations=15 HDFS: Number of large read operations=0 HDFS: Number of write operations=4 OSS: Number of bytes read=0 OSS: Number of bytes written=0 OSS: Number of read operations=0 OSS: Number of large read operations=0 OSS: Number of write operations=0 Job Counters Launched map tasks=1 Other local map tasks=1 Total time spent by all maps in occupied slots (ms)=5194 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=5194 Total vcore-milliseconds taken by all map tasks=5194 Total megabyte-milliseconds taken by all map tasks=5318656 Map-Reduce Framework Map input records=1 Map output records=0 Input split bytes=132 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=64 CPU time spent (ms)=2210 Physical memory (bytes) snapshot=222294016 Virtual memory (bytes) snapshot=2672074752 Total committed heap usage (bytes)=110100480 File Input Format Counters Bytes Read=338 File Output Format Counters Bytes Written=0 org.apache.hadoop.tools.mapred.CopyMapper$Counter BYTESCOPIED=47047709 BYTESEXPECTED=47047709 COPY=1 20/09/29 12:23:59 INFO common.AbstractJindoFileSystem: Read total statistics: oss read average -1 us, cache read average -1 us, read oss percent 0% -
Verify the migration
Check the size of the data that is migrated to the Lindorm file database:
${HADOOP_HOME}/bin/hadoop fs -du -s -h hdfs://<instance-id>/