All Products
Search
Document Center

DataWorks:StarRocks Data Source

Last Updated:Mar 22, 2025

The StarRocks data source offers a bidirectional channel for reading from and writing to StarRocks. This topic describes the capabilities of DataWorks for synchronizing data with StarRocks data sources.

Supported versions

  • E-MapReduce (EMR) Serverless StarRocks 2.5 and 3.1 are supported.

  • EMR on ECS: StarRocks 2.1 is supported.

  • The StarRocks Community Edition is supported. For details about the Community Edition, visit the StarRocks official website.

    Note
    • DataWorks only supports internal network connections to StarRocks; hence, the Community Edition must be deployed on EMR on ECS.

    • The Community Edition of StarRocks is highly open. If you experience compatibility issues with the data source, please or submit a ticket for feedback.

Supported field types

Most StarRocks data types, including numeric, string, and date types, are supported.

Prepare before data synchronization

To ensure network connectivity for a resource group, add the IP address or CIDR block of the resource group to the internal IP address whitelist of the desired EMR Serverless StarRocks instance beforehand. Additionally, allow the CIDR block to access ports 9030, 8030, and 8040.

  • For more information about whitelisting IP addresses of DataWorks resource groups, see add a whitelist.

  • The following figure shows how to access the IP address whitelists of an EMR Serverless StarRocks instance.

    image.png

Add a data source

Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Add and manage data sources. You can view the infotips of parameters in the DataWorks console to understand the meanings of the parameters when you add a data source.

When creating a StarRocks data source, you can use either the Alibaba Cloud instance mode or the connection string mode:

  • Alibaba Cloud Instance Mode lets you directly select a StarRocks instance created in the same VPC on Alibaba Cloud.

  • Connection String Mode allows you to connect using the host address/IP, port, and Load URL of StarRocks. This mode is compatible with internal network instances in the same VPC or public network address connections.

    Note

    Serverless resource groups do not have public network access by default. To connect to a StarRocks instance using a public network address, configure a public NAT Gateway and EIP for the associated VPC to enable public network access to the data source.

    For Alibaba Cloud EMR StarRocks Serverless, use the Internal Address or Public Address for the Host Address/ip, and the Query Port for the port.

    • FE node information: Obtain FE node details on the instance's details tab.

      image.png

    • Database: After connecting to the instance using EMR StarRocks Manager, view the corresponding database in the SQL Editor or Metadata Management.

      image.png

      Note

      To create a database, run SQL statements directly in the SQL editor.

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.

Configure a batch synchronization task to synchronize data of a single table

Appendix: Code and parameters

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

Code for StarRocks reader

{
    "stepType": "starrocks",
    "parameter": {
        "selectedDatabase": "didb1",
        "datasource": "starrocks_datasource",
        "column": [
            "id",
            "name"
        ],
        "where": "id>100",
        "table": "table1",
        "splitPk": "id"
    },
    "name": "Reader",
    "category": "reader"
}

Parameters in code for StarRocks Reader

Parameter

Description

Required

Default value

datasource

The name of the StarRocks data source.

Yes

None

selectedDatabase

The name of the StarRocks database.

No

The name of the database that is configured in the StarRocks data source.

column

The names of the columns from which you want to read data.

Yes

None

where

The filter condition. In actual business scenarios, you often select the data of the current day for synchronization and specify the where condition as gmt_create>$bizdate.

  • The where condition can effectively perform incremental synchronization of business data.

  • If the where statement, including the absence of the key or value of the where condition, is not provided, the data synchronization is considered as full data synchronization.

No

None

table

The name of the table from which you want to read data.

Yes

None

splitPk

When StarRocks Reader extracts data, if you specify splitPk, it indicates that you want to use the field represented by splitPk for data sharding. As a result, concurrent tasks are started for data synchronization to improve the efficiency of data synchronization. We recommend that you specify the name of the primary key column of a source table as the shard key. This way, data can be evenly distributed to different shards based on the primary key column, instead of being intensively distributed only to specific shards.

No

None

Code for StarRocks writer

{
    "stepType": "starrocks",
    "parameter": {
        "selectedDatabase": "didb1",
        "loadProps": {
            "row_delimiter": "\\x02",
            "column_separator": "\\x01"
        },
        "datasource": "starrocks_public",
        "column": [
            "id",
            "name"
        ],
        "loadUrl": [
            "1.1.X.X:8030"
        ],
        "table": "table1",
        "preSql": [
            "truncate table table1"
        ],
        "postSql": [
        ],
        "maxBatchRows": 500000,
        "maxBatchSize": 5242880
    },
    "name": "Writer",
    "category": "writer"
}

Parameters in code for StarRocks writer

Parameter

Description

Required

Default value

datasource

The name of the StarRocks data source.

Yes

None

selectedDatabase

The name of the StarRocks database.

No

The name of the database that is configured in the StarRocks data source.

loadProps

The request parameters for the StarRocks Stream Load import method. If you want to import data as CSV files by using the Stream Load import method, you can configure request parameters. If you have no special requirements, set the parameter to {}. Request parameters that you can configure for the Stream Load import method:

  • column_separator: The column delimiter for CSV import. The default value is \t.

  • row_delimiter: The row delimiter for CSV import. The default value is \n.

  • If the data you want to write to StarRocks contains \t or \n, you must use other characters as delimiters. Example:

    {    "column_separator": "\\x01",    "row_delimiter": "\\x02"}

Yes

None

column

The names of the columns to which you want to write data.

Yes

None

loadUrl

The URL of a StarRocks frontend node. The URL consists of the IP address of the frontend node and the HTTP port number. The default HTTP port number is 8030. If you specify URLs for multiple frontend nodes, separate them with commas (,).

Yes

None

table

The name of the table to which you want to write data.

Yes

None

preSql

The SQL statement that you want to execute before the synchronization task is run. For example, you can execute the TRUNCATE TABLE tablename statement to delete outdated data before the synchronization task is run.

No

None

postSql

The SQL statement that you want to execute after the synchronization task is run.

No

None

maxBatchRows

The maximum number of rows of data written each time.

No

500000

maxBatchSize

The maximum number of bytes written each time.

No

5242880