Migrate data from OSS to LindormDFS - - Alibaba Cloud Documentation Center

Prerequisites

Before you begin, ensure that you have:

Activated the file engine for your Lindorm instance. For more information, see Activate the file engine service.
A Hadoop cluster running Hadoop 2.7.3 or later, configured to access the Lindorm file engine. For more information, see Use open source HDFS clients to connect to and use LindormDFS.
JDK 1.8 or later installed on all nodes of the Hadoop cluster.
JindoFS SDK installed on all nodes of the Hadoop cluster. See Install the JindoFS SDK below.

Install the JindoFS SDK

Install the JindoFS SDK on every node in your Hadoop cluster before running the migration.

Download jindofs-sdk.jar from the JindoFS SDK repository, then copy it to the Hadoop library directory:
```
cp ./jindofs-sdk-*.jar ${HADOOP_HOME}/share/hadoop/hdfs/lib/
```
Add the following environment variable to /etc/profile on each node:
```
export B2SDK_CONF_DIR=/etc/jindofs-sdk-conf
```

Create the JindoFS SDK configuration file at /etc/jindofs-sdk-conf/bigboot.cfg:

[bigboot]
logger.dir=/tmp/bigboot-log

[bigboot-client]
client.oss.retry=5
client.oss.upload.threads=4
client.oss.upload.queue.size=5
client.oss.upload.max.parallelism=16
client.oss.timeout.millisecond=30000
client.oss.connection.timeout.millisecond=4000

Load the environment variable:
```
source /etc/profile
```
Verify that your OSS bucket is accessible from the Hadoop cluster:
```
${HADOOP_HOME}/bin/hadoop fs -ls oss://<accessKeyId>:<accessKeySecret>@<bucket-name>.<endpoint>/
```
If the command returns the bucket contents without errors, the SDK is configured correctly.

Migrate data from the OSS bucket

Check the size of the data to migrate:

${HADOOP_HOME}/bin/hadoop du -h oss://<accessKeyId>:<accessKeySecret>@<bucket-name>.<endpoint>/test_data

Run DistCp to start a MapReduce job that copies the data to the Lindorm file database:

${HADOOP_HOME}/bin/hadoop distcp \
  oss://<accessKeyId>:<accessKeySecret>@<bucket-name>.<endpoint>/test_data.txt \
  hdfs://<instance-id>/

Replace <instance-id> with your Lindorm instance ID.

The following table describes the parameters:

Parameter	Description	Required
`accessKeyId`	The AccessKey ID used to authenticate OSS API calls. To get your AccessKey pair, see Create an AccessKey pair.	Yes
`accessKeySecret`	The AccessKey Secret used to authenticate OSS API calls.	Yes
`bucket-name.endpoint`	The OSS bucket access address, consisting of the bucket name and the endpoint for the region where the bucket is deployed.	Yes

Check the job output. The migration is complete when the output shows:

map 100% reduce 0%
Job job_xxx completed successfully
BYTESCOPIED equals BYTESEXPECTED

Example output:

20/09/29 12:23:59 INFO mapreduce.Job:  map 100% reduce 0%
20/09/29 12:23:59 INFO mapreduce.Job: Job job_1601195105349_0015 completed successfully
20/09/29 12:23:59 INFO mapreduce.Job: Counters: 38
 File System Counters
  FILE: Number of bytes read=0
  FILE: Number of bytes written=122343
  FILE: Number of read operations=0
  FILE: Number of large read operations=0
  FILE: Number of write operations=0
  HDFS: Number of bytes read=470
  HDFS: Number of bytes written=47047709
  HDFS: Number of read operations=15
  HDFS: Number of large read operations=0
  HDFS: Number of write operations=4
  OSS: Number of bytes read=0
  OSS: Number of bytes written=0
  OSS: Number of read operations=0
  OSS: Number of large read operations=0
  OSS: Number of write operations=0
 Job Counters
  Launched map tasks=1
  Other local map tasks=1
  Total time spent by all maps in occupied slots (ms)=5194
  Total time spent by all reduces in occupied slots (ms)=0
  Total time spent by all map tasks (ms)=5194
  Total vcore-milliseconds taken by all map tasks=5194
  Total megabyte-milliseconds taken by all map tasks=5318656
 Map-Reduce Framework
  Map input records=1
  Map output records=0
  Input split bytes=132
  Spilled Records=0
  Failed Shuffles=0
  Merged Map outputs=0
  GC time elapsed (ms)=64
  CPU time spent (ms)=2210
  Physical memory (bytes) snapshot=222294016
  Virtual memory (bytes) snapshot=2672074752
  Total committed heap usage (bytes)=110100480
 File Input Format Counters
  Bytes Read=338
 File Output Format Counters
  Bytes Written=0
 org.apache.hadoop.tools.mapred.CopyMapper$Counter
  BYTESCOPIED=47047709
  BYTESEXPECTED=47047709
  COPY=1
20/09/29 12:23:59 INFO common.AbstractJindoFileSystem: Read total statistics: oss read average -1 us, cache read average -1 us, read oss percent 0%

Verify the migration

Check the size of the data that is migrated to the Lindorm file database:

${HADOOP_HOME}/bin/hadoop fs -du -s -h hdfs://<instance-id>/