All Products
Search
Document Center

Lindorm:Use LTS to import offline data to Lindorm in batches

Last Updated:Aug 28, 2022

This topic describes how to use Lindorm Tunnel Service (LTS) to import offline data to Lindorm in batches.

Benefits

LTS provided by Lindorm can be used to import files from offline storage services such as Alibaba Cloud MaxCompute, Apache Hive, and AWS S3 to Lindorm in bulkload mode. Compared with common import services such as open source DataX, the bulkload feature of LTS provides the following benefits:

  • High efficiency: Data is directly written to LindormDFS. In this mode, write-ahead logging (WAL) operations, flush operations, and remote procedure call (RPC) forwarding operations that are required to write data online are skipped. This reduces the resource usage and improves the efficiency of data import operations by up to 10 times compared with importing data by calling the create API operation. LTS also provides multiple performance optimization methods.

  • High stability: The system performs data conversion, data encoding, and data compression by using offline resources. These operations require a small amount of online service resources. In addition, the system performs intelligent monitoring on the properties of Lindorm tables such as block encoding and data compression, partition alignment, localization rate, and number of files. This way, you can import large amounts of data, and ensure the stability and efficiency of the online queries. A query can be completed in a few milliseconds.

  • High controllability: The five-year usage of LTS in Alibaba Group proves that the speed of data import is linearly controllable based on distributed scheduling.

Before you begin

  1. An LTS instance is created in the same VPC in which your Lindorm instance is deployed.

  2. The port for Hadoop Distributed File System (HDFS) is open. If the port for Hadoop Distributed File System (HDFS) is open, the LTS file import feature is enabled. If the port for Hadoop Distributed File System (HDFS) is not open, contact Lindorm technical support to open the port for your Lindorm instance.

  3. Create configuration files. Configuration files must be in JSON format and include read configurations and write configurations. The following section describes sample read configurations for an AWS S3 data source and Alibaba Cloud MaxCompute data source, and a sample write configuration for the destination Lindorm instance.

    • Sample read configurations

      • Sample read configuration for an AWS S3 data source

        {
          "file.type":"csv",
          "accessKey": "test",
          "secret_key":"test",
          "endPoint":"http://localhost:8001",
          "bucket":"testbucket",
          "region":"test",
          "fileKey":"",
          "zipped":"false",
          "column.separator":",",
          "column": [
            "pk|String",
            "value|String"
          ],
          "pks":"pk|String"
        }
      • Sample read configuration for an Alibaba Cloud MaxCompute data source

        {
            "accessId":"****",
            "accessKey":"****",
            "project":"****",
            "odpsServer":""http://service.cn-shanghai.maxcompute.aliyun-inc.com/api ",
            "tunnelEndPoint":"http://dt.cn-shanghai.maxcompute.aliyun-inc.com"
            "partition": [
                "ds=???"
                 ],
            "column": [
                "pk",
                "value"
            ],
            "table":"xx"
         }

        Note

        For information about how to configure the odpsServer parameter and tunnelEndPoint parameter, see Endpoints.

    • Sample write configuration for the destination Lindorm instance

      Note

      Write configurations are complex. You can contact Alibaba Cloud technical support to obtain information about how to create a write configuration file. For more information, see Expert support.

      {
          "configuration": [
            {
              "bulkerMode":"himporter2",
              "himporter.ipc.call.timeout": "600000",
              "zookeeper.znode.parent":"/test",
              "lindorm.zookeeper.quorum":"test.tbsite.net:2181",
              "lindorm.client.seedserver":"test.tbsite.net:30020",
              "lindorm.client.namespace":"test",
              "lindorm.client.username":"test",
              "lindorm.client.password":"test",
              "fs.defaultFS":"hdfs://test",
              "dfs.nameservices":"test",
              "dfs.client.failover.proxy.provider.test":"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
              "dfs.ha.namenodes.hdfscluster-test":"nn1,nn2",
              "dfs.namenode.rpc-address.test.nn1":"test.tbsite.net:8020",
              "dfs.namenode.rpc-address.test.nn2":"test.tbsite.net:8020",
              "importer.job.concurrency": "15"
            }
          ],
         "columns": [
          "pk",
          "value"
          ],
          "nullMode":"skip",
          "namespace":"test",
          "writeBufferSize":2097152,
          "encoding":"utf-8",
          "lindormTable":"test_table"
          }

Task configuration methods and prerequisites

The bulkload feature provided by LTS supports two job configuration methods. The same configuration files are required even if you use different methods to configure a bulkload job.

Configuration method

Scenarios

Prerequisites

Create a job in the LTS console

Debugging scenarios

You have logged on to the web user interface (UI) of the LTS instance. For more information, see Create a synchronization task.

Create a job by calling the API

Production scenarios

The LTS instance can be accessed over the Internet. To enable Internet access for the LTS instance, submit a ticket.

Create a job in the LTS console

  1. In the left-side navigation pane of the LTS web UI, choose Data Import > Bulkload.

  2. Click create new job. On the page that appears, configure the parameters.

    Parameter

    Description

    Source

    The source data source, such as aws_s3.

    Target

    The destination data source, such as lindorm.

    Reader Config

    The configuration of the reader plug-in. Enter the configuration code in the JSON format. To view the sample code, see the Before you begin section in this topic.

    Writer Config

    The configuration of the writer plug-in. Enter the configuration code in the JSON format. To view the sample code, see the Before you begin section in this topic.

  3. Click Submit.

Create a job by calling the API

Note

LTS does not provide job scheduling capacities. You must use your client to call the create API operation to create a job to import data.

  1. On the web UI page of the LTS instance, change the address in the browser address bar. HTTP requests that are used to access LTS must be sent over the Internet. In the browser address bar, change the access address from https://bds-****/bds to https://bds-****/api and press Enter.

  2. Create a job.

    1. On the Headers tab, set the value of the Content-Type field to application/x-www-form-urlencoded.

    2. On the Body tab, configure the parameters that are shown in the following figure.Specify the address

      Key

      Value

      dataSource

      Set the value to odpsreader.

      readerConfStr

      The configuration of the reader plug-in. Enter the configuration code in the JSON format. To view the sample code, see the Before you begin section in this topic.

      dataTarget

      Set the value to lindormbulkwriter.

      writerConfStr

      The configuration of the writer plug-in. Enter the configuration code in the JSON format. To view the sample code, see the Before you begin section in this topic.

    3. Execute the following sample code to send a POST request. The string in the response body is the ID of the job.

      curl -X POST 
      http://yourhost:12311/hi2/create 
      -H 'content-type:application/x-www-form-urlencoded' 
      -d 'your config string here'
  3. Query the status of the job.

    1. On the Headers tab, set the value of the Content-Type field to application/json.

    2. Execute the following sample code to send a GET request.

      curl -X GET 
      http://yourhost:12311/hi2/<Job ID>/detail 
      -H'content-type: application/json'

      The string in the response body describes the details of the job. Example:

      {
        "jobId":"0c84e8ec-6bf0-4e70-8b6a-f92beddb****",
        "status":{
       "status":"LOADED",
       "msg":"job 0c84e8ec-6bf0-4e70-8b6a-f92beddb**** is loaded",
       "progressRate":1.0,
       "version":1
        },
        "version":0
      }

      The following table describes valid values of the status parameter in the response body.

      State

      Description

      INIT

      The job is created and is waiting to be scheduled.

      PREPARING

      The required data is being sorted in MaxCompute.

      PREPARE_FAILED

      MaxCompute failed to sort the required data due to incorrect configurations in the job configuration.

      LOADING

      The required data is being written to the destination Lindorm instance.

      LOAD_FAILED

      Failed to write data.

      LOADED

      The job is complete.

      ABORTING

      The job is being terminated.

      ABORT_FAILED

      Failed to terminate the job.

      ABORTED

      The job is terminated.

  4. Query the statistics of the job. Execute the following sample code to send a request.

    curl -X GET 
    http://yourhost:12311/hi2/<Job ID>/statics 
    -H 'content-type: application/json'
    Note
    • This API operation is used to query data from the master node. This API operation can be called by using ZooKeeper at a low frequency.

    • In the response body, the row parameter indicates the number of data rows, and the taskNumber parameter indicates the total number of jobs that are running.

  5. Stop the job.

    1. On the Headers tab, set the value of the Content-Type field to application/json.

    2. Execute the following sample code to send a POST request.

      curl -X POST \
        http://yourhost:61031/hi2/<Job ID>/abort
        -H 'content-type: application/json'

      The string in the response body is the request result. Example:

      {
      "message":"SUCCESS"
      }