All Products
Search
Document Center

DataWorks:StarRocks data source

Last Updated:Apr 03, 2025

DataWorks provides StarRocks Reader and StarRocks Writer for you to read data from and write data to StarRocks data sources. This topic describes the capabilities of synchronizing data from or to StarRocks data sources.

Supported versions

  • E-MapReduce (EMR) Serverless StarRocks 2.5 and 3.1.

  • EMR on ECS: StarRocks 2.1.

  • Support for Community Edition StarRocks.

    Note
    • Because DataWorks only supports internal network connections to StarRocks, Community Edition StarRocks must be deployed on EMR on ECS.

    • Community Edition StarRocks has strong openness. If you encounter compatibility issues when using the data source, you are welcome to submit a ticket for feedback.

Supported field types

Only numeric, string, and date field types are supported.

Preparations before data synchronization

To ensure network connectivity of resource groups, you need to add the IP addresses of the DataWorks resource groups that you will use to the internal network whitelist of the EMR Serverless StarRocks instance in advance. You also need to allow access from this network segment to ports 9030, 8030, and 8040.

  • For the whitelist IP addresses of DataWorks resource groups, see: Add whitelist.

  • The following figure shows the entry points for accessing the IP address whitelists of an EMR Serverless StarRocks instance.

    image.png

Add a data source

Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Add and manage data sources. You can view the infotips of parameters in the DataWorks console to understand the meanings of the parameters when you add a data source.

Choose a StarRocks connection mode based on your network environment:

Scenario 1: Internal network connection (recommended)

Internal network links have low latency and more secure data transmission, without requiring additional public network permissions.

  • Applicable scenario: Your StarRocks instance and Serverless resource group are in the same VPC.

  • Supports both Alibaba Cloud instance mode and connection string mode:

    • Select Alibaba Cloud Instance Mode: Directly select the StarRocks instance in the same VPC, and the system automatically obtains the connection information without manual configuration.

    • Select Connection String Mode: Manually enter the internal address/IP, port, and Load URL of the instance.

Scenario 2: Public network connection

Public network transmission has security risks. We recommend using it with security policies such as whitelists and IP authentication.

  • Applicable scenario: You need to access the StarRocks instance through the public network (such as cross-region or local environment access).

  • Supports connection string mode (make sure the StarRocks instance has public network access enabled):

    • Select Connection String Mode: Manually enter the public address/IP, port, and Load URL of the instance.

Note

Serverless resource groups do not have public network access capabilities by default. When connecting to a StarRocks instance using a public address, you need to configure a public NAT Gateway and EIP for the bound VPC to support public network access to the data source.

If you are using Alibaba Cloud EMR StarRocks Serverless, use the Internal Address or Public Address for the Host Address/ip, and use the Query Port for the port.

  • FE information: You can obtain this from the instance details page.

    image.png

  • Database: After connecting to the instance using EMR StarRocks Manager, you can view the corresponding database in SQL Editor or Metadata Management.

    image.png

    Note

    If you need to create a database, you can directly run SQL statements in the SQL editor.

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.

Configure a batch synchronization task to synchronize data of a single table

Appendix: Script demos and parameter descriptions

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

Code for StarRocks reader

{
    "stepType": "starrocks",
    "parameter": {
        "selectedDatabase": "didb1",
        "datasource": "starrocks_datasource",
        "column": [
            "id",
            "name"
        ],
        "where": "id>100",
        "table": "table1",
        "splitPk": "id"
    },
    "name": "Reader",
    "category": "reader"
}

Parameters in code for StarRocks Reader

Parameter

Description

Required

Default value

datasource

The name of the StarRocks data source.

Yes

None

selectedDatabase

The name of the StarRocks database.

No

The database is the one that is configured in the StarRocks data source.

column

The names of the columns from which you want to read data.

Yes

None

where

The filter condition. In actual business scenarios, you often need to synchronize data of the current day. In this case, you can set the where condition to gmt_create>$bizdate.

  • The where condition can effectively implement incremental business data synchronization.

  • If you do not specify the where statement or do not provide a key or value for where, all data is synchronized.

No

None

table

The name of the table from which you want to read data.

Yes

None

splitPk

When StarRocks Reader extracts data, if you specify splitPk, it indicates that you want to use the field represented by splitPk for data sharding. The data synchronization will then start concurrent tasks for data synchronization, improving the efficiency of data synchronization. We recommend that you set the splitPk parameter to the name of the primary key column of the source table. This way, data can be evenly distributed to different shards based on the primary key column, instead of being intensively distributed only to specific shards.

No

None

Code for StarRocks writer

{
    "stepType": "starrocks",
    "parameter": {
        "selectedDatabase": "didb1",
        "loadProps": {
            "row_delimiter": "\\x02",
            "column_separator": "\\x01"
        },
        "datasource": "starrocks_public",
        "column": [
            "id",
            "name"
        ],
        "loadUrl": [
            "1.1.X.X:8030"
        ],
        "table": "table1",
        "preSql": [
            "truncate table table1"
        ],
        "postSql": [
        ],
        "maxBatchRows": 500000,
        "maxBatchSize": 5242880
    },
    "name": "Writer",
    "category": "writer"
}

Parameters in code for StarRocks writer

Parameter

Description

Required

Default value

datasource

The name of the StarRocks data source.

Yes

None

selectedDatabase

The name of the StarRocks database.

No

The name of the database that is configured in the StarRocks data source.

loadProps

The request parameters for the StarRocks Stream Load import method. If you want to import data as CSV files by using the Stream Load import method, you can configure request parameters. If you have no special requirements, set the parameter to {}. Request parameters that you can configure for the Stream Load import method:

  • column_separator: The column delimiter for CSV import. Default value: \t.

  • row_delimiter: The row delimiter for CSV import. Default value: \n.

If the data you want to write to StarRocks contains \t or \n, you must use other characters as delimiters. Example:

{"column_separator":"\\x01","row_delimiter":"\\x02"}

Stream Load also supports importing data in JSON format. You can configure:

{
  "format": "json"
}

Parameters that can be configured in JSON format:

  • strip_outer_array: Specifies whether to crop the outermost array structure. Valid values: true and false. Default value: false.

    In real business scenarios, the JSON data to be imported may have a pair of brackets [] representing an array structure at the outermost layer. In this case, we recommend that you set this parameter to true, so that StarRocks will crop the outer brackets [] and import each inner array as a separate row of data. If you set this parameter to false, StarRocks will parse the entire JSON data file as an array and import it as a single row of data.

    For example, if the JSON data to be imported is as follows:

    [{"category":1,"author":2},{"category":3,"author":4}]
    
    • If you set this parameter to true, StarRocks will parse {"category":1,"author":2} and {"category":3,"author":4} as two rows of data and import them into the corresponding data rows in the target StarRocks table.

    • If you set this parameter to false, StarRocks will parse the entire JSON array as a single row of data and import it into the target StarRocks table.

  • ignore_json_size: Specifies whether to check the size of the JSON Body in the HTTP request.

    Note

    The size of the JSON Body in the HTTP request cannot exceed 100MB by default. If the size of the JSON Body exceeds 100MB, the error The size of this batch exceed the max size [104857600] of json type data data [8617627793].Set ignore_json_size to skip check,although it may lead enormous memory consuming. will be prompted. To avoid this error, you can add ignore_json_size: true to the HTTP request header to ignore the check on the size of the JSON Body.

  • compression: Specifies which compression algorithm to use during the StreamLoad data transfer process. The supported algorithms are GZIP, BZIP2, LZ4_FRAME, and ZSTD.

Yes

None

column

The names of the columns to which you want to write data.

Yes

None

loadUrl

Enter the StarRocks FrontEnd IP and Http Port (usually default is 8030). If there are multiple FrontEnd nodes, you can configure all of them and separate them with commas (,).

Yes

None

table

The name of the table to which you want to write data.

Yes

None

preSql

The SQL statement that you want to execute before the synchronization task is run. For example, you can execute the TRUNCATE TABLE tablename statement to delete outdated data before the synchronization task is run.

No

None

postSql

The SQL statement that you want to execute after the data synchronization task is run.

No

None

maxBatchRows

The maximum number of rows of data written each time.

No

500000

maxBatchSize

The maximum number of bytes written each time.

No

5242880