All Products
Search
Document Center

Lindorm:Import data in batches

Last Updated:Aug 12, 2025

Lindorm provides the bulkload feature to import data quickly and reliably. This topic describes how to import data in batches.

Features

The bulkload feature loads data files in bypass mode. It does not use the data API write path or the computing resources of your instance. Compared to importing data using an API, the bulkload feature provides the following advantages:

  • Imports data more than 10 times faster.

  • Ensures stable online services because it does not use online service resources.

  • Offers flexible resource usage by separating online and offline resources.

  • Supports importing data from various data sources, such as CSV, ORC, Parquet, and MaxCompute.

  • Easy to use. You do not need to write any code to load data in batches in bypass mode.

  • Cost-effective. Lindorm Tunnel Service (LTS) uses the cloud-native elastic capability of serverless Spark to provide computing resources for bulkloading. Resources are scaled as needed and are pay-as-you-go. You do not need to configure computing resources for long periods, which reduces costs.

Prerequisites

Supported data sources

Source data source

Destination data source

MaxCompute Table

LindormTable

HDFS CSV or OSS CSV

HDFS Parquet or OSS Parquet

HDFS ORC or OSS ORC

Submission methods

You can submit a job to quickly import data in one of the following ways:

Submit a job using the LTS console

  1. Log on to the LTS console. For more information, see Activate and log on to LTS.

  2. In the navigation pane on the left, choose Data Source Management > Add Data Source to add the following data sources.

  3. In the navigation pane on the left, choose Import To Lindorm/HBase > Universal Import.

    Note
    • For LTS versions earlier than 3.8.12.4.3, choose Import To Lindorm/HBase > Bulkload.

    • To view the LTS version, go to the Configuration Information section on the Instance Details page in the Lindorm console.

  4. Click Create Job and configure the following parameters.

    Configuration item

    Parameter

    Description

    Select Data Source

    Source Data Source

    Select the ODPS or HDFS data source that you added.

    Destination Data Source

    Select the Lindorm wide table data source that you added.

    Plugin Configuration

    Reader Configuration

    • If the source is an ODPS data source, configure the following reader parameters:

      • table: The name of the ODPS table.

      • column: The names of the ODPS columns to import.

      • partition: Leave this empty for a non-partitioned table. For a partitioned table, configure the partition information.

      • numPartitions: The degree of parallelism for reading data.

    • If the source is a CSV file in a Hadoop Distributed File System (HDFS), configure the following reader parameters:

      • filePath: The directory where the CSV file is located.

      • header: Specifies whether the CSV file contains a header row.

      • delimite: The separator used in the CSV file.

      • column: The column names and their corresponding types in the CSV file.

    • If the source is a Parquet file in an HDFS, configure the following reader parameters:

      • filePath: The directory where the Parquet file is located.

      • column: The column names in the Parquet file.

    Note

    For configuration examples, see Configuration examples.

    Writer Configuration

    • namespace: The namespace of the Lindorm wide table.

    • lindormTable: The name of the Lindorm wide table.

    • compression: The compression algorithm. Currently, only zstd is supported. To disable compression, set this to none.

    • columns: Configure this parameter based on the destination table type.

      • If you import data to a Lindorm wide table, specify the column names of the Lindorm SQL wide table. The columns must correspond to the columns in the reader configuration.

      • If you import data to a Lindorm table that is compatible with HBase, specify the standard column names of the HBase table. The columns must correspond to the columns in the reader configuration.

    • timestamp: The timestamp of the data in the Lindorm wide table. The following types are supported:

      • A Long type with a 13-digit value.

      • A String type in the yyyy-MM-dd HH:mm:ss or yyyy-MM-dd HH:mm:ss SSS format.

    Note

    For configuration examples, see Configuration examples.

    Job Running Parameter Configuration

    Spark Driver Specification

    Select the Spark Driver specification.

    Spark Executor Specification

    Select the Spark Executor specification.

    Number Of Executors

    Enter the number of executors.

    Spark Configuration

    Enter the Spark configuration. This parameter is optional.

  5. Click Create.

  6. On the Bulkload page, click the Job Name to view the job details.

    • Click the Job Name to view the Spark UI of the job.

    • Click Details to view the execution logs of the job.

    任务详情页

    Note

    If data is evenly distributed across partitions in the destination Lindorm wide table, it takes about 1 hour to import 100 GB of data with a 4:1 compression ratio. The actual time may vary.

Configuration examples

Reader plugin configuration examples

  • Example of a reader configuration for an ODPS data source.

    {
      "table": "test",
      "column": [ 
        "id",
        "intcol",
        "doublecol",
        "stringcol",
        "string1col",
        "decimalcol"
      ],
      "partition": [
        "pt=1" 
      ],
      "numPartitions":10 
    }
  • Example of a reader configuration for a CSV file in an HDFS data source.

    {
      "filePath":"csv/",
      "header": false,
      "delimiter": ",",
      "column": [
        "id|string",
        "intcol|int",
        "doublecol|double",
        "stringcol|string",
        "string1col|string",
        "decimalcol|decimal"
      ]
    }
  • Example of a reader configuration for a Parquet file in an HDFS data source.

    {
      "filePath":"parquet/",
      "column": [   // Column names in the Parquet file.
        "id",
        "intcol",
        "doublecol",
        "stringcol",
        "string1col",
        "decimalcol"
      ]
    }

Writer plugin configuration examples

  • Example of a writer configuration for importing data to a Lindorm SQL table.

    {
      "namespace": "default",
      "lindormTable": "xxx",
      "compression":"zstd",
      "timestamp":"2022-07-01 10:00:00",
      "columns": [
           "id",
           "intcol",
           "doublecol",
           "stringcol",
            "string1col",
            "decimalcol"
      ]
    }
  • Example of a writer configuration for importing data to a Lindorm table that is compatible with HBase.

    {
      "namespace": "default",
      "lindormTable": "xxx",
      "compression":"zstd",
      "timestamp":"2022-07-01 10:00:00",
      "columns": [
        "ROW||String",    // ROW represents the rowkey, and String represents the type.
        "f:intcol||Int",  // Format: column family:column name||column type.
        "f:doublecol||Double",
        "f:stringcol||String",
        "f:string1col||String",
        "f:decimalcol||Decimal"
      ]
    }

Submit a job using an API operation

Submit a job

  • API operation (POST): http://{LTSMaster}:12311/pro/proc/bulkload/create. Replace {LTSMaster} with the master hostname of your Lindorm instance. You can obtain the hostname from the Basic Information section on the Cluster Information page of the LTS console for your Lindorm instance.获取页面

  • Parameters:

    Parameter

    Description

    src

    The name of the source data source.

    dst

    The name of the destination data source.

    readerConfig

    The reader plugin configuration in JSON format. For configuration examples, see Configuration examples.

    writerConfig

    The writer plugin configuration in JSON format. For configuration examples, see Configuration examples.

    driverSpec

    The specification of the driver. Valid values: small, medium, large, and xlarge. We recommend that you set this parameter to large.

    instances

    The number of executors.

    fileType

    If the source data source is HDFS, set this parameter to CSV or Parquet.

    sparkAdditionalParams

    The extension parameters. This parameter is optional.

  • Example:

    curl -d "src=hdfs&dst=ld&readerConfig={\"filePath\":\"parquet/\",\"column\":[\"id\",\"intcol\",\"doublecol\",\"stringcol\",\"string1col\",\"decimalcol\"]}&writerConfig={\"columns\":[\"ROW||String\",\"f:intcol||Int\",\"f:doublecol||Double\",\"f:stringcol||String\",\"f:string1col||String\",\"f:decimalcol||Decimal\"],\"namespace\":\"default\",\"lindormTable\":\"bulkload_test\",\"compression\":\"zstd\"}&driverSpec=large&instances=5&fileType=Parquet" -H "Content-Type: application/x-www-form-urlencoded" -X POST http://{LTSMaster}:12311/pro/proc/bulkload/create

    The following content is returned. The value of the message parameter is the job ID.

    {"success":"true","message":"proc-91-ff383c616e5242888b398e51359c****"}

Get job information

  • API operation (GET): http://{LTSMaster}:12311/pro/proc/{procId}/info. Replace {LTSMaster} with the master hostname of your Lindorm instance. You can obtain the hostname from the Basic Information section on the Cluster Information page of the LTS console for your Lindorm instance.

  • Parameter: procId indicates the job ID.

  • Example:

    curl http://{LTSMaster}:12311/pro/proc/proc-91-ff383c616e5242888b398e51359c****/info

    The following content is returned:

    {
        "data":{
            "checkJobs":Array,
            "procId":"proc-91-ff383c616e5242888b398e51359c****",  // Job ID
            "incrJobs":Array,
            "procConfig":Object,
            "stage":"WAIT_FOR_SUCCESS",
            "fullJobs":Array,
            "mergeJobs":Array,
            "srcDS":"hdfs",    // Source data source
            "sinkDS":"ld-uf6el41jkba96****",  // Destination data source
            "state":"RUNNING",   // Job status
            "schemaJob":Object,   
            "procType":"SPARK_BULKLOAD"   // Job type
        },
        "success":"true"
    }

Stop a job

  • API operation (GET): http://{LTSMaster}:12311/pro/proc/{procId}/abort. Replace {LTSMaster} with the master hostname of your Lindorm instance. You can obtain the hostname from the Basic Information section on the Cluster Information page of the LTS console for your Lindorm instance.

  • Parameter: procId indicates the job ID.

  • Example:

    curl http://{LTSMaster}:12311/pro/proc/proc-91-ff383c616e5242888b398e51359c****/abort

    The following content is returned:

    {"success":"true","message":"ok"}

Retry a job

  • API operation (GET): http://{LTSMaster}:12311/pro/proc/{procId}/retry. Replace {LTSMaster} with the master hostname of your Lindorm instance. You can obtain the hostname from the Basic Information section on the Cluster Information page of the LTS console for your Lindorm instance.

  • Parameter: procId indicates the job ID.

  • Example:

    curl http://{LTSMaster}:12311/pro/proc/proc-91-ff383c616e5242888b398e51359c****/retry

    The following result is returned:

    {"success":"true","message":"ok"}

Delete a job

  • API operation (GET): http://{LTSMaster}:12311/pro/proc/{procId}/delete. Replace {LTSMaster} with the master hostname of your Lindorm instance. You can obtain the hostname from the Basic Information section on the Cluster Information page of the LTS console for your Lindorm instance.

  • Parameter: procId indicates the job ID.

  • Example:

    curl http://{LTSMaster}:12311/pro/proc/proc-91-ff383c616e5242888b398e51359c****/delete

    The following result is returned:

    {"success":"true","message":"ok"}