All Products
Search
Document Center

DataWorks:StarRocks data source

Last Updated:Oct 28, 2025

The StarRocks data source provides a bidirectional channel to read data from and write data to StarRocks. This topic describes how to use DataWorks for StarRocks data synchronization.

Supported versions

  • All versions of EMR Serverless StarRocks.

  • EMR on ECS: StarRocks 2.1.

  • The community edition of StarRocks is supported.

    Note
    • Because DataWorks supports only connections to StarRocks over an internal network, you must deploy the community edition of StarRocks on EMR on ECS.

    • The community edition of StarRocks is open. If you encounter compatibility issues when you use this data source, you can submit a ticket to provide feedback.

Supported field types

Only numeric, string, and date field types are supported.

Prerequisites for data synchronization (network connectivity)

EMR Serverless StarRocks

To ensure network connectivity, add the IP addresses of your DataWorks resource group to the internal network whitelist of the EMR Serverless StarRocks instance.

  • For the IP addresses of the DataWorks resource group that you need to add to the whitelist, see General configurations: Add a whitelist.

  • The following figure shows where to add IP addresses to the whitelist for an EMR Serverless StarRocks instance.

    image.png

    Self-managed StarRocks

    Ensure that the DataWorks resource group can access the query port, FE port, and BE port of StarRocks. The default ports are 9030, 8030, and 8040, respectively.

Add a data source

Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data Source Management. You can view the infotips of parameters in the DataWorks console to understand the meanings of the parameters when you add a data source.

Select a StarRocks connection mode based on your network environment:

Scenario 1: Internal network connection (Recommended)

Connections over an internal network have low latency and provide more secure data transmission. No additional public network permissions are required.

  • Scenario: Your StarRocks instance and the serverless resource group are in the same VPC.

  • Supported connection modes:

    • Select Alibaba Cloud Instance Mode. When you select the StarRocks instance in the same VPC, the system automatically retrieves the connection information.

    • Select Connection String Mode to manually enter the internal address or IP address, port, and Load URL of the instance.

Scenario 2: Internet connection

Transmitting data over the Internet poses security risks. Use security policies such as whitelists and IP address authentication to mitigate these risks.

  • Scenario: You need to access the StarRocks instance over the Internet, for purposes such as cross-region or on-premises access.

  • Supported connection mode: Connection String Mode. Ensure that public network access is enabled for the StarRocks instance.

    • Select Connection String Mode and manually enter the public address or IP address, port, and Load URL of the instance.

Note

By default, serverless resource groups cannot access the Internet. To connect to a StarRocks instance using a public IP address, you must configure an Internet NAT gateway and elastic IP addresses (EIPs) for the attached VPC. This enables the resource group to access the data source over the Internet. Also, ensure that the serverless resource group can access the query port, FE port, and BE port of StarRocks. The default ports are 9030, 8030, and 8040, respectively.

If you use Alibaba Cloud EMR StarRocks Serverless, set Host Address/IP to the Internal Address or Public Address, and set the port to the Query Port.

  • FE: You can find this value on the instance details page.

    image.png

  • Database: After you connect to the instance using EMR StarRocks Manager, you can access the database from the SQL Editor or Data Management.

    image.png

    Note

    To create a database, you can execute SQL commands directly in the SQL Editor.

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.

Guide to configuring a single-table offline synchronization task

Appendix: Script demo and parameter description

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

Reader script demo

{
    "stepType": "starrocks",
    "parameter": {
        "selectedDatabase": "didb1",
        "datasource": "starrocks_datasource",
        "column": [
            "id",
            "name"
        ],
        "where": "id>100",
        "table": "table1",
        "splitPk": "id"
    },
    "name": "Reader",
    "category": "reader"
}

Reader script parameters

Parameter

Description

Required

Default value

datasource

The name of the StarRocks data source.

Yes

None

selectedDatabase

The name of the StarRocks database.

No

The database name configured in the StarRocks data source.

column

The columns to read from the source table.

Yes

None

where

The filter condition. In many scenarios, you might synchronize only the data for the current day. To do this, set the where condition to gmt_create>$bizdate.

  • A where condition enables effective incremental synchronization.

  • If you do not provide a where statement, or if you do not provide a key or value for where, a full data synchronization is performed.

No

None

table

The name of the source table.

Yes

None

splitPk

When StarRocks Reader extracts data, specifying splitPk tells the system to use this field for data sharding. This starts concurrent tasks and improves the efficiency of data synchronization. We recommend you use the primary key of the table for splitPk. Primary keys are usually distributed evenly, which helps prevent data hot spots in the shards.

No

None

Writer script demo

{
    "stepType": "starrocks",
    "parameter": {
        "selectedDatabase": "didb1",
        "loadProps": {
            "row_delimiter": "\\x02",
            "column_separator": "\\x01"
        },
        "datasource": "starrocks_public",
        "column": [
            "id",
            "name"
        ],
        "loadUrl": [
            "1.1.X.X:8030"
        ],
        "table": "table1",
        "preSql": [
            "truncate table table1"
        ],
        "postSql": [
        ],
        "maxBatchRows": 500000,
        "maxBatchSize": 5242880,
        "strategyOnError": "exit"
    },
    "name": "Writer",
    "category": "writer"
}

Writer script parameters

Parameter

Description

Required

Default value

datasource

The name of the StarRocks data source.

Yes

None

selectedDatabase

The name of the StarRocks database.

No

The database name configured in the StarRocks data source.

loadProps

The request parameters for StarRocks StreamLoad. When you import data using StreamLoad CSV, you can configure import parameters here. If you have no special configurations, use {}. Configurable parameters include the following:

  • column_separator: The column delimiter for CSV import. The default value is \t.

  • row_delimiter: The row delimiter for CSV import. The default value is \n.

If your data contains \t or \n, you must use other characters as delimiters. The following example shows how to use special characters:

{"column_separator":"\\x01","row_delimiter":"\\x02"}

StreamLoad also supports importing data in JSON format. You can configure it as follows:

{
  "format": "json"
}

The parameters that can be configured for the JSON format are:

  • strip_outer_array: Specifies whether to strip the outermost array structure. Valid values: true and false. Default value: false.

    In a real-world scenario, the JSON data to import might be enclosed in a pair of square brackets []. In this case, set this parameter to true. StarRocks then strips the outer brackets [] and imports each inner object as a separate row. If you set this parameter to false, StarRocks parses the entire JSON file as a single array and imports it as one row.

    For example, the JSON data to be imported is as follows:

    [{"category":1,"author":2},{"category":3,"author":4}]
    
    • If you set this parameter to true, StarRocks parses {"category":1,"author":2} and {"category":3,"author":4} into two rows and imports them into the corresponding rows of the destination StarRocks table.

    • If you set this parameter to false, StarRocks parses the entire JSON array as one row and imports it into the destination StarRocks table.

  • ignore_json_size: Specifies whether to check the size of the JSON body in the HTTP request.

    Note

    By default, the size of the JSON body in an HTTP request cannot exceed 100 MB. If the size exceeds 100 MB, an error is returned: The size of this batch exceed the max size [104857600] of json type data data [8617627793].Set ignore_json_size to skip check,although it may lead enormous memory consuming. To prevent this error, add ignore_json_size: true to the HTTP request header to skip the size check.

  • compression: Specifies the compression algorithm to use during StreamLoad data transmission. Supported algorithms: GZIP, BZIP2, LZ4_FRAME, and ZSTD.

  • strict_mode: Specifies whether to enable strict mode.

    Valid values:

    • true: Enables strict mode. StarRocks filters out invalid data rows, imports only valid data rows, and returns details about the invalid data.

    • false: Disables strict mode. StarRocks converts fields that fail to transform to NULL values and imports these faulty data rows containing NULL values along with the correct data rows.

    Default value: false.

Yes

None

column

The columns to write to the destination table.

Yes

None

loadUrl

Enter the IP address and HTTP port of the StarRocks frontend (FE). The default port is 8030. If you have multiple FE nodes, you can enter all of them, separated by commas (,).

Yes

None

table

The names of the tables to synchronize.

Yes

None

preSql

An SQL statement that is executed before the data synchronization task runs. For example, use truncate table tablename to clear old data from a table before the task starts.

No

None

postSql

An SQL statement that is executed after the data synchronization task finishes.

No

None

maxBatchRows

The maximum number of rows to write per batch.

No

500000

maxBatchSize

The maximum number of bytes to write per batch.

No

5242880

strategyOnError

The policy for handling exceptions during batch writing to StarRocks.

Valid values:

  • exit: The sync task fails and exits when an exception occurs while writing to StarRocks.

  • batchDirtyData. When an exception occurs while writing to StarRocks, the current batch of data is recorded as dirty data.

Default value: exit.

No

exit