All Products
Search
Document Center

Migrate data from an OSS bucket to a Lindorm file database

Last Updated: Jul 09, 2021

This topic describes how to import data from an Object Storage Service (OSS) bucket to a database that is powered by the ApsaraDB for Lindorm (Lindorm) file engine.

Before you begin

  1. Activate the file engine for your Lindorm instance. For more information, see Activate the file engine service.

  2. Create a Hadoop cluster. We recommend that you use Hadoop 2.7.3 or later. In this example, Apache Hadoop 2.7.3 is used. You must modify the Hadoop configuration. For more information, see Use an open source HDFS clients to access the file engine.

  3. Install Java Development Kit (JDK) on all the nodes of the Hadoop cluster. The JDK version must be 1.8 or later.

  4. Install the OSS client JindoFS SDK on all the nodes of the Hadoop cluster. For more information about JindoFS SDK, see JindoFS SDK.

    cp ./jindofs-sdk-*.jar  ${HADOOP_HOME}/share/hadoop/hdfs/lib/
    • Create a JindoFS SDK configuration file for each node of the Hadoop cluster.

      • Add the following environment variable to the /etc/profile file.

      export B2SDK_CONF_DIR=/etc/jindofs-sdk-conf
      • Create a JindoFS SDK configuration file: /etc/jindofs-sdk-conf/bigboot.cfg.

      [bigboot]
      logger.dir=/tmp/bigboot-log[bigboot-client]
      client.oss.retry=5
      client.oss.upload.threads=4
      client.oss.upload.queue.size=5
      client.oss.upload.max.parallelism=16
      client.oss.timeout.millisecond=30000
      client.oss.connection.timeout.millisecond=4000
      • Load the environment variable. After the environment variable is loaded, the environment variable takes effect.

      source /etc/profile
      • Verify that your OSS bucket can be accessed in the Hadoop cluster.

      ${HADOOP_HOME}/bin/hadoop fs -ls oss://<accessKeyId>:<accessKeySecret>@<bucket-name>.<endpoint>/

Migrate data from an OSS bucket to a Lindorm file database

  1. Determine the size of the data that needs to be migrated.

    ${HADOOP_HOME}/bin/hadoop du -h oss://<accessKeyId>:<accessKeySecret>@<bucket-name>.<endpoint>/test_data
  2. Use the Hadoop distributed copy (DistCp) tool to start a MapReduce task to migrate the data to the file database.

    ${HADOOP_HOME}/bin/hadoop distcp  \
    oss://<accessKeyId>:<accessKeySecret>@<bucket-name>.<endpoint>/test_data.txt \
    hdfs://${Instance ID}/

    Replace ${Instance ID} with your Lindorm instance ID.

    Configure the parameters based on the description in the following table.

    Parameter

    Description

    accessKeyId

    The AccessKey pair that is required when you call the OSS API. For information about how to obtain your AccessKey pair, see Create an AccessKey pair.

    accessKeySecret

    bucket-name.endpoint

    The access address of the OSS bucket. The address consists of the bucket name and the endpoint that corresponds to the region where the bucket is deployed.

  3. View the migration result after the task is completed.

    If the result is similar to the following example, the data is migrated:

    20/09/29 12:23:59 INFO mapreduce.Job:  map 100% reduce 0%
    20/09/29 12:23:59 INFO mapreduce.Job: Job job_1601195105349_0015 completed successfully
    20/09/29 12:23:59 INFO mapreduce.Job: Counters: 38
     File System Counters
      FILE: Number of bytes read=0
      FILE: Number of bytes written=122343
      FILE: Number of read operations=0
      FILE: Number of large read operations=0
      FILE: Number of write operations=0
      HDFS: Number of bytes read=470
      HDFS: Number of bytes written=47047709
      HDFS: Number of read operations=15
      HDFS: Number of large read operations=0
      HDFS: Number of write operations=4
      OSS: Number of bytes read=0
      OSS: Number of bytes written=0
      OSS: Number of read operations=0
      OSS: Number of large read operations=0
      OSS: Number of write operations=0
     Job Counters
      Launched map tasks=1
      Other local map tasks=1
      Total time spent by all maps in occupied slots (ms)=5194
      Total time spent by all reduces in occupied slots (ms)=0
      Total time spent by all map tasks (ms)=5194
      Total vcore-milliseconds taken by all map tasks=5194
      Total megabyte-milliseconds taken by all map tasks=5318656
     Map-Reduce Framework
      Map input records=1
      Map output records=0
      Input split bytes=132
      Spilled Records=0
      Failed Shuffles=0
      Merged Map outputs=0
      GC time elapsed (ms)=64
      CPU time spent (ms)=2210
      Physical memory (bytes) snapshot=222294016
      Virtual memory (bytes) snapshot=2672074752
      Total committed heap usage (bytes)=110100480
     File Input Format Counters
      Bytes Read=338
     File Output Format Counters
      Bytes Written=0
     org.apache.hadoop.tools.mapred.CopyMapper$Counter
      BYTESCOPIED=47047709
      BYTESEXPECTED=47047709
      COPY=1
    20/09/29 12:23:59 INFO common.AbstractJindoFileSystem: Read total statistics: oss read average -1 us, cache read average -1 us, read oss percent 0%
  4. Verify the migration result.

    Check the size of the data that is migrated to the Lindorm file database.

    ${HADOOP_HOME}/bin/hadoop fs -du -s -h hdfs://${Instance ID}/