Lindorm provides the bulkload feature to import data quickly and reliably. This topic describes how to import data in batches.
Features
The bulkload feature loads data files in bypass mode. It does not use the data API write path or the computing resources of your instance. Compared to importing data using an API, the bulkload feature provides the following advantages:
Imports data more than 10 times faster.
Ensures stable online services because it does not use online service resources.
Offers flexible resource usage by separating online and offline resources.
Supports importing data from various data sources, such as CSV, ORC, Parquet, and MaxCompute.
Easy to use. You do not need to write any code to load data in batches in bypass mode.
Cost-effective. Lindorm Tunnel Service (LTS) uses the cloud-native elastic capability of serverless Spark to provide computing resources for bulkloading. Resources are scaled as needed and are pay-as-you-go. You do not need to configure computing resources for long periods, which reduces costs.
Prerequisites
LTS is activated and you are logged on to the LTS console. For more information, see Activate and log on to LTS.
The Lindorm compute engine is activated. For more information, see Activation and upgrade/downgrade.
A Spark data source is added. For more information, see Add a Spark data source.
Supported data sources
Source data source | Destination data source |
MaxCompute Table | LindormTable |
HDFS CSV or OSS CSV | |
HDFS Parquet or OSS Parquet | |
HDFS ORC or OSS ORC |
Submission methods
You can submit a job to quickly import data in one of the following ways:
Submit a job using the LTS console
Log on to the LTS console. For more information, see Activate and log on to LTS.
In the navigation pane on the left, choose Data Source Management > Add Data Source to add the following data sources.
Add an ODPS data source. For more information, see ODPS data source.
Add a Lindorm wide table data source. For more information, see Lindorm wide table data source.
Add an HDFS data source. For more information, see Add an HDFS data source.
In the navigation pane on the left, choose .
NoteFor LTS versions earlier than 3.8.12.4.3, choose .
To view the LTS version, go to the Configuration Information section on the Instance Details page in the Lindorm console.
Click Create Job and configure the following parameters.
Configuration item
Parameter
Description
Select Data Source
Source Data Source
Select the ODPS or HDFS data source that you added.
Destination Data Source
Select the Lindorm wide table data source that you added.
Plugin Configuration
Reader Configuration
If the source is an ODPS data source, configure the following reader parameters:
table: The name of the ODPS table.
column: The names of the ODPS columns to import.
partition: Leave this empty for a non-partitioned table. For a partitioned table, configure the partition information.
numPartitions: The degree of parallelism for reading data.
If the source is a CSV file in a Hadoop Distributed File System (HDFS), configure the following reader parameters:
filePath: The directory where the CSV file is located.
header: Specifies whether the CSV file contains a header row.
delimite: The separator used in the CSV file.
column: The column names and their corresponding types in the CSV file.
If the source is a Parquet file in an HDFS, configure the following reader parameters:
filePath: The directory where the Parquet file is located.
column: The column names in the Parquet file.
NoteFor configuration examples, see Configuration examples.
Writer Configuration
namespace: The namespace of the Lindorm wide table.
lindormTable: The name of the Lindorm wide table.
compression: The compression algorithm. Currently, only zstd is supported. To disable compression, set this to none.
columns: Configure this parameter based on the destination table type.
If you import data to a Lindorm wide table, specify the column names of the Lindorm SQL wide table. The columns must correspond to the columns in the reader configuration.
If you import data to a Lindorm table that is compatible with HBase, specify the standard column names of the HBase table. The columns must correspond to the columns in the reader configuration.
timestamp: The timestamp of the data in the Lindorm wide table. The following types are supported:
A Long type with a 13-digit value.
A String type in the yyyy-MM-dd HH:mm:ss or yyyy-MM-dd HH:mm:ss SSS format.
NoteFor configuration examples, see Configuration examples.
Job Running Parameter Configuration
Spark Driver Specification
Select the Spark Driver specification.
Spark Executor Specification
Select the Spark Executor specification.
Number Of Executors
Enter the number of executors.
Spark Configuration
Enter the Spark configuration. This parameter is optional.
Click Create.
On the Bulkload page, click the Job Name to view the job details.
Click the Job Name to view the Spark UI of the job.
Click Details to view the execution logs of the job.
NoteIf data is evenly distributed across partitions in the destination Lindorm wide table, it takes about 1 hour to import 100 GB of data with a 4:1 compression ratio. The actual time may vary.
Configuration examples
Reader plugin configuration examples
Example of a reader configuration for an ODPS data source.
{ "table": "test", "column": [ "id", "intcol", "doublecol", "stringcol", "string1col", "decimalcol" ], "partition": [ "pt=1" ], "numPartitions":10 }Example of a reader configuration for a CSV file in an HDFS data source.
{ "filePath":"csv/", "header": false, "delimiter": ",", "column": [ "id|string", "intcol|int", "doublecol|double", "stringcol|string", "string1col|string", "decimalcol|decimal" ] }Example of a reader configuration for a Parquet file in an HDFS data source.
{ "filePath":"parquet/", "column": [ // Column names in the Parquet file. "id", "intcol", "doublecol", "stringcol", "string1col", "decimalcol" ] }
Writer plugin configuration examples
Example of a writer configuration for importing data to a Lindorm SQL table.
{ "namespace": "default", "lindormTable": "xxx", "compression":"zstd", "timestamp":"2022-07-01 10:00:00", "columns": [ "id", "intcol", "doublecol", "stringcol", "string1col", "decimalcol" ] }Example of a writer configuration for importing data to a Lindorm table that is compatible with HBase.
{ "namespace": "default", "lindormTable": "xxx", "compression":"zstd", "timestamp":"2022-07-01 10:00:00", "columns": [ "ROW||String", // ROW represents the rowkey, and String represents the type. "f:intcol||Int", // Format: column family:column name||column type. "f:doublecol||Double", "f:stringcol||String", "f:string1col||String", "f:decimalcol||Decimal" ] }
Submit a job using an API operation
Submit a job
API operation (POST):
http://{LTSMaster}:12311/pro/proc/bulkload/create. Replace {LTSMaster} with the master hostname of your Lindorm instance. You can obtain the hostname from the Basic Information section on the Cluster Information page of the LTS console for your Lindorm instance.
Parameters:
Parameter
Description
src
The name of the source data source.
dst
The name of the destination data source.
readerConfig
The reader plugin configuration in JSON format. For configuration examples, see Configuration examples.
writerConfig
The writer plugin configuration in JSON format. For configuration examples, see Configuration examples.
driverSpec
The specification of the driver. Valid values: small, medium, large, and xlarge. We recommend that you set this parameter to large.
instances
The number of executors.
fileType
If the source data source is HDFS, set this parameter to CSV or Parquet.
sparkAdditionalParams
The extension parameters. This parameter is optional.
Example:
curl -d "src=hdfs&dst=ld&readerConfig={\"filePath\":\"parquet/\",\"column\":[\"id\",\"intcol\",\"doublecol\",\"stringcol\",\"string1col\",\"decimalcol\"]}&writerConfig={\"columns\":[\"ROW||String\",\"f:intcol||Int\",\"f:doublecol||Double\",\"f:stringcol||String\",\"f:string1col||String\",\"f:decimalcol||Decimal\"],\"namespace\":\"default\",\"lindormTable\":\"bulkload_test\",\"compression\":\"zstd\"}&driverSpec=large&instances=5&fileType=Parquet" -H "Content-Type: application/x-www-form-urlencoded" -X POST http://{LTSMaster}:12311/pro/proc/bulkload/createThe following content is returned. The value of the message parameter is the job ID.
{"success":"true","message":"proc-91-ff383c616e5242888b398e51359c****"}
Get job information
API operation (GET):
http://{LTSMaster}:12311/pro/proc/{procId}/info. Replace {LTSMaster} with the master hostname of your Lindorm instance. You can obtain the hostname from the Basic Information section on the Cluster Information page of the LTS console for your Lindorm instance.Parameter: procId indicates the job ID.
Example:
curl http://{LTSMaster}:12311/pro/proc/proc-91-ff383c616e5242888b398e51359c****/infoThe following content is returned:
{ "data":{ "checkJobs":Array, "procId":"proc-91-ff383c616e5242888b398e51359c****", // Job ID "incrJobs":Array, "procConfig":Object, "stage":"WAIT_FOR_SUCCESS", "fullJobs":Array, "mergeJobs":Array, "srcDS":"hdfs", // Source data source "sinkDS":"ld-uf6el41jkba96****", // Destination data source "state":"RUNNING", // Job status "schemaJob":Object, "procType":"SPARK_BULKLOAD" // Job type }, "success":"true" }
Stop a job
API operation (GET):
http://{LTSMaster}:12311/pro/proc/{procId}/abort. Replace {LTSMaster} with the master hostname of your Lindorm instance. You can obtain the hostname from the Basic Information section on the Cluster Information page of the LTS console for your Lindorm instance.Parameter: procId indicates the job ID.
Example:
curl http://{LTSMaster}:12311/pro/proc/proc-91-ff383c616e5242888b398e51359c****/abortThe following content is returned:
{"success":"true","message":"ok"}
Retry a job
API operation (GET):
http://{LTSMaster}:12311/pro/proc/{procId}/retry. Replace {LTSMaster} with the master hostname of your Lindorm instance. You can obtain the hostname from the Basic Information section on the Cluster Information page of the LTS console for your Lindorm instance.Parameter: procId indicates the job ID.
Example:
curl http://{LTSMaster}:12311/pro/proc/proc-91-ff383c616e5242888b398e51359c****/retryThe following result is returned:
{"success":"true","message":"ok"}
Delete a job
API operation (GET):
http://{LTSMaster}:12311/pro/proc/{procId}/delete. Replace {LTSMaster} with the master hostname of your Lindorm instance. You can obtain the hostname from the Basic Information section on the Cluster Information page of the LTS console for your Lindorm instance.Parameter: procId indicates the job ID.
Example:
curl http://{LTSMaster}:12311/pro/proc/proc-91-ff383c616e5242888b398e51359c****/deleteThe following result is returned:
{"success":"true","message":"ok"}