Use Stream Load to import data into StarRocks - E-MapReduce

Submit a load task

Stream Load uses curl to send a PUT request to the FE node. Other HTTP clients work too.

Syntax

curl --location-trusted -u <username>:<password> \
     -H "Expect:100-continue" \
     -XPUT <url> \
     (data_desc) \
     [opt_properties]

Include Expect:100-continue in the request header. This prevents unnecessary data transfer if the server rejects the task before the payload is sent.

URL

http://<fe_host>:<fe_http_port>/api/<database_name>/<table_name>/_stream_load

Parameter	Required	Description
`<fe_host>`	Yes	IP address of the FE node
`<fe_http_port>`	Yes	HTTP port of the FE node. Default: 18030. Run `SHOW FRONTENDS` to look up the address and port, or check the http_port parameter on the Configuration tab of the Cluster Service page.
`<database_name>`	Yes	Name of the database containing the target table
`<table_name>`	Yes	Name of the target table

Source data parameters

The data_desc block describes the source file and controls how StarRocks maps and filters the data.

Common parameters

Parameter	Required	Description
`-T <file_path>`	Yes	Path to the source data file. The filename can include or omit a file extension.
`format`	No	File format. Valid values: `CSV` (default), `JSON`
`partitions`	No	Target partitions. If omitted, data is loaded into all partitions of the table.
`temporary_partitions`	No	Target temporary partitions
`columns`	No	Column mapping between the source file and the StarRocks table. Required when the file schema does not match the table schema. See Column mapping examples.

CSV parameters

Parameter	Required	Description
`column_separator`	No	Column delimiter. Default: `\t`. For invisible characters, use hexadecimal with the `\x` prefix—for example, `-H "column_separator:\x01"` for the Hive delimiter.
`row_delimiter`	No	Row delimiter. Default: `\n`. Note `curl` cannot pass `\n` directly. Use Bash escaped string syntax: `-H $'row_delimiter:\n'`.
`skip_header`	No	Number of header rows to skip at the start of the CSV file. Default: 0.
`where`	No	Filter condition. Only rows that match the condition are loaded. Example: `-H "where: k1 = 20180601"`
`max_filter_ratio`	No	Maximum ratio of rows that can be filtered out due to data quality issues. Default: 0 (zero tolerance). Rows filtered by `where` are not counted.
`strict_mode`	No	Controls type-conversion filtering. `false` (default): disabled. `true`: strict filtering applied during column type conversions.
`timeout`	No	Load timeout in seconds. Default: 600. Valid range: 1–259200.
`timezone`	No	Time zone for the load. Default: UTC+8. Affects all time zone-related functions in the load.
`exec_mem_limit`	No	Memory limit for the load. Default: 2 GB.

JSON parameters

Parameter	Required	Description
`jsonpaths`	No	Fields to load, in JSON format. Required only for JSON matching mode.
`strip_outer_array`	No	Whether to strip the outermost JSON array. `false` (default): the entire array is imported as a single value. `true`: each element of the array is imported as a separate row.
`json_root`	No	Root element of the JSON data. Required only for JSON matching mode. Must be a valid JsonPath string. Default: empty (the entire JSON file is imported).
`ignore_json_size`	No	Whether to skip the JSON body size check. By default, the JSON body cannot exceed 100 MB. Set to `true` to skip the check—but note that large payloads may cause high memory usage.
`compression` / `Content-Encoding`	No	Compression algorithm for data transmission. Supported: GZIP, BZIP2, LZ4_FRAME, Zstandard.

Load task options

opt_properties are optional settings that apply to the entire load task.

-H "label: <label_name>"
-H "where: <condition>"
-H "max_filter_ratio: <num>"
-H "timeout: <num>"
-H "strict_mode: true | false"
-H "timezone: <string>"
-H "load_mem_limit: <num>"
-H "partial_update: true | false"
-H "partial_update_mode: row | column"
-H "merge_condition: <column_name>"

Parameter	Required	Description
`label`	No	Label for the load task. Prevents duplicate imports—StarRocks rejects any task that reuses a label from a successfully completed task in the last 30 minutes.
`where`	No	Filter condition applied to transformed data. Only rows matching the condition are loaded.
`max_filter_ratio`	No	Maximum ratio of rows that can be filtered out due to data quality issues. Default: 0. Rows filtered by `where` are not counted.
`log_rejected_record_num`	No	Maximum number of filtered rows to log per BE (or CN) node. Supported in StarRocks 3.1 and later. Valid values: `0` (default, no logging), `-1` (log all), or a positive integer.
`timeout`	No	Load timeout in seconds. Default: 600. Valid range: 1–259200.
`strict_mode`	No	`false` (default): disabled. `true`: strict filtering applied.
`timezone`	No	Time zone for the load. Default: UTC+8.
`load_mem_limit`	No	Memory limit for the load task. Default: 2 GB.
`partial_update`	No	Whether to use partial column updates. Default: `FALSE`.
`partial_update_mode`	No	Mode for partial updates. `row` (default): suitable for real-time updates of many columns in small batches. `column`: suitable for batch updates of a few columns across many rows—for example, updating 10 out of 100 columns for all rows can improve performance by 10x.
`merge_condition`	No	Column used as the update condition. The update takes effect only when the imported value is greater than or equal to the current value. Must be a non-primary-key column. Supported only for primary key tables.

In StarRocks SQL, reserved keywords must be enclosed in backticks. See Keywords.

Return value

After the load completes, StarRocks returns the result in JSON format:

{
    "TxnId": 9,
    "Label": "label2",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 4,
    "NumberLoadedRows": 4,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 45,
    "LoadTimeMs": 235,
    "BeginTxnTimeMs": 101,
    "StreamLoadPlanTimeMs": 102,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 11,
    "CommitAndPublishTimeMs": 19
}

Field	Description
`TxnId`	Transaction ID of the load
`Label`	Label of the load task
`Status`	Final status of the load. See the table below.
`ExistingJobStatus`	Status of the existing task when `Status` is `Label Already Exists`. Values: `RUNNING` or `FINISHED`.
`Message`	Details about the load status. If the load fails, contains the failure reason.
`NumberTotalRows`	Total rows read from the data stream
`NumberLoadedRows`	Rows successfully loaded. Valid only when `Status` is `Success`.
`NumberFilteredRows`	Rows filtered out due to data quality issues
`NumberUnselectedRows`	Rows filtered out by the `where` condition
`LoadBytes`	Size of the source file
`LoadTimeMs`	Total time for the load task, in milliseconds
`BeginTxnTimeMs`	Time to start the transaction, in milliseconds
`StreamLoadPlanTimeMs`	Time to generate the execution plan, in milliseconds
`ReadDataTimeMs`	Time to read data, in milliseconds
`WriteDataTimeMs`	Time to write data, in milliseconds
`ErrorURL`	URL for error details when the load fails. See Check error details.

Load status

Status	Meaning	Action
`Success`	Data loaded and visible	None
`Publish Timeout`	Load submitted successfully, but data visibility may be delayed	No retry needed
`Label Already Exists`	The label is already used by another task	Change the label and resubmit
`Fail`	Load failed	Check `ErrorURL` for details, fix the issue, and retry

Check error details

When Status is Fail, use ErrorURL to retrieve details about filtered rows:

# View error rows directly
curl "<ErrorURL>"

# Save error rows to a local file
wget "<ErrorURL>"

Example:

wget "http://172.17.**.**:18040/api/_load_error_log?file=error_log_b74dccdcf0ceb4de_e82b2709c6c013ad"

Cancel a load task

Stream Load tasks cannot be canceled manually. A task is automatically canceled if it times out or an import error occurs.

Examples

Load CSV data

This example loads data.csv into example_table in the load_test database:

curl --location-trusted -u "root:" \
     -H "Expect:100-continue" \
     -H "label:label2" \
     -H "column_separator:," \
     -T data.csv -XPUT \
     http://172.17.**.**:18030/api/load_test/example_table/_stream_load

Load JSON data

This example loads JSON records from json.data:

curl --location-trusted -u "root:" \
     -H "Expect:100-continue" \
     -H "label:label2" \
     -H "format:json" \
     -T json.data -XPUT \
     http://172.17.**.**:18030/api/load_test/example_table/_stream_load

Set error tolerance

To allow up to 20% of rows to be filtered out due to data quality issues:

curl --location-trusted -u "root:" \
     -H "Expect:100-continue" \
     -H "label:label3" \
     -H "column_separator:," \
     -H "max_filter_ratio:0.2" \
     -T data.csv -XPUT \
     http://172.17.**.**:18030/api/load_test/example_table/_stream_load

Filter rows by condition

To load only rows where k1 = 20180601:

curl --location-trusted -u "root:" \
     -H "Expect:100-continue" \
     -H "label:label4" \
     -H "column_separator:," \
     -H "where:k1 = 20180601" \
     -T data.csv -XPUT \
     http://172.17.**.**:18030/api/load_test/example_table/_stream_load

Configure column mapping

When the source file column order does not match the table schema, use the columns parameter to specify the mapping.

Example 1: The table has columns c1, c2, c3. The source file columns are in the order c3, c2, c1:

curl --location-trusted -u "root:" \
     -H "Expect:100-continue" \
     -H "label:label5" \
     -H "column_separator:," \
     -H "columns:c3, c2, c1" \
     -T data.csv -XPUT \
     http://172.17.**.**:18030/api/load_test/example_table/_stream_load

Example 2: The source file has an extra column that does not exist in the table. Use a placeholder for the extra column:

curl --location-trusted -u "root:" \
     -H "Expect:100-continue" \
     -H "label:label6" \
     -H "column_separator:," \
     -H "columns:c1, c2, c3, temp" \
     -T data.csv -XPUT \
     http://172.17.**.**:18030/api/load_test/example_table/_stream_load

Example 3: The table has columns year, month, day. The source file has a single timestamp column in 2018-06-01 01:02:03 format. Use derived column expressions:

curl --location-trusted -u "root:" \
     -H "Expect:100-continue" \
     -H "label:label7" \
     -H "column_separator:," \
     -H "columns:col, year=year(col), month=month(col), day=day(col)" \
     -T data.csv -XPUT \
     http://172.17.**.**:18030/api/load_test/example_table/_stream_load

Load compressed data

To load an LZ4-compressed JSON file:

curl --location-trusted -u "root:" \
     -H "Expect:100-continue" \
     -H "label:label8" \
     -H "format:json" \
     -H "compression:lz4_frame" \
     -T data.json.lz4 -XPUT \
     http://172.17.**.**:18030/api/load_test/example_table/_stream_load

End-to-end example

Step 1: Create the target table

Log in to the master node of the StarRocks cluster using SSH. For details, see Log on to a cluster.
Connect to the StarRocks cluster using a MySQL client:
```
mysql -h127.0.0.1 -P 9030 -uroot
```

Create the database and table:

CREATE DATABASE IF NOT EXISTS load_test;
USE load_test;

CREATE TABLE IF NOT EXISTS example_table (
    id INT,
    name VARCHAR(50),
    age INT
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 3
PROPERTIES (
    "replication_num" = "1"  -- Set the number of replicas to 1.
);

Press Ctrl+D to exit the MySQL client.

Step 2: Prepare the source data

Prepare CSV data

CSV data — create a file named data.csv:

id,name,age
1,Alice,25
2,Bob,30
3,Charlie,35

Prepare JSON data

JSON data — create a file named json.data:

{"id":1,"name":"Emily","age":25}
{"id":2,"name":"Benjamin","age":35}
{"id":3,"name":"Olivia","age":28}
{"id":4,"name":"Alexander","age":60}
{"id":5,"name":"Ava","age":17}

Step 3: Run the load task

Import CSV data

Load CSV data:

curl --location-trusted -u "root:" \
     -H "Expect:100-continue" \
     -H "label:label1" \
     -H "column_separator:," \
     -T data.csv -XPUT \
     http://172.17.**.**:18030/api/load_test/example_table/_stream_load

Import JSON data

Load JSON data:

curl --location-trusted -u "root:" \
     -H "Expect:100-continue" \
     -H "label:label2" \
     -H "format:json" \
     -T json.data -XPUT \
     http://172.17.**.**:18030/api/load_test/example_table/_stream_load

What's next

For Java integration examples, see the stream_load demo.
For Spark Streaming integration, see SparkStreaming to StarRocks.