All Products
Search
Document Center

DataWorks:StarRocks

Last Updated:Mar 26, 2026

The StarRocks data source lets you read data from and write data to StarRocks in DataWorks Data Integration tasks.

Supported versions

Deployment type Supported versions
EMR Serverless StarRocks All versions
E-MapReduce on ECS StarRocks 2.1
StarRocks Community Edition Supported (must be deployed on E-MapReduce on ECS)
DataWorks connects to StarRocks only over an internal network. If you use StarRocks Community Edition, deploy it on E-MapReduce on ECS. If you encounter compatibility issues, submit a ticket.

Supported data types

Only numeric, string, and date data types are supported.

Limitations

  • For real-time synchronization of an entire database from MySQL to StarRocks, the destination StarRocks table must use a primary key model.

  • During real-time synchronization of an entire database from MySQL to StarRocks, DDL operations other than TRUNCATE are not supported. Configure the task to either ignore these DDL operations or report an error.

Network connectivity

Before adding StarRocks as a data source, make sure the DataWorks resource group can reach your StarRocks instance.

EMR Serverless StarRocks

Add the IP addresses of the DataWorks resource group to the IP allowlist of your EMR Serverless StarRocks instance.

Self-managed StarRocks

Make sure the DataWorks resource group can reach the query port, FE port, and BE port of your StarRocks instance. The default ports are 9030, 8030, and 8040.

Add a data source

Before developing a synchronization task, add StarRocks as a data source in DataWorks. For instructions, see Data source management. Parameter descriptions are available in the DataWorks console when you add the data source.

Select a connection mode based on your network environment:

Internal network connection (recommended)

Use this mode when your StarRocks instance and the Serverless resource group are in the same Virtual Private Cloud (VPC). An internal network connection provides low latency without exposing data to the public internet.

Both Alibaba Cloud instance mode and Connection string mode are supported:

  • ApsaraDB for RDS: Select a StarRocks instance in the same VPC directly. The system retrieves the connection information automatically.

  • User-created Data Store with Public IP Addresses: Enter the internal address or IP address, port, and Load URL of the instance.

Public network connection

Use this mode when you need to access StarRocks over the public network, for example, for cross-region access or from an on-premises environment. Only Connection string mode is supported.

  • Enable public network access on your StarRocks instance before connecting.

  • User-created Data Store with Public IP Addresses: Enter the public address or IP address, port, and Load URL of the instance.

Serverless resource groups cannot access the public network by default. To connect to StarRocks over a public endpoint, configure a NAT Gateway and an Elastic IP address (EIP) for the bound VPC. Make sure the Serverless resource group can reach the query port, FE port, and BE port (defaults: 9030, 8030, and 8040).

For Alibaba Cloud EMR StarRocks Serverless, set Host Address/IP Address to Internal Endpoint or Public network address, and set the port to a query port.

  • FE: Get the FE information from the instance details page.

    FE information on instance details page

  • Database: After connecting through EMR StarRocks Manager, view databases in SQL Editor or Metadata Management.

    To create a database, run SQL commands directly in the SQL editor.

    Database view in SQL Editor

Configure a synchronization task

Select a configuration guide based on your synchronization type.

Synchronization type Supported sources Configuration guide
Batch synchronization for a single table All data source types supported by Data Integration Codeless UI / Code Editor
Real-time synchronization for a single table Kafka Configure a full-database real-time synchronization task
Batch synchronization for a full database MySQL Configure a full-database real-time synchronization task
Real-time synchronization for a full database MySQL, Oracle, and PolarDB Configure a full-database real-time synchronization task

For batch synchronization using the Code Editor, see the parameter reference in Appendix: Code and parameters.

Appendix: Code and parameters

Configure a batch synchronization task using the Code Editor

When configuring a batch synchronization task in the Code Editor, set the parameters in the script according to the unified script format. For format requirements, see Use the Code Editor.

Reader

Code example

{
    "stepType": "starrocks",
    "parameter": {
        "selectedDatabase": "didb1",
        "datasource": "starrocks_datasource",
        "column": [
            "id",
            "name"
        ],
        "where": "id>100",
        "table": "table1",
        "splitPk": "id"
    },
    "name": "Reader",
    "category": "reader"
}

Parameters

Parameter Required Default Description
datasource Yes None The name of the StarRocks data source.
table Yes None The source table.
column Yes None The column names to synchronize. To add a SET_VAR hint, prepend it to the first column name. For example, to add SET_VAR(enable_spill = true) when reading the id column, configure column as ["/*+ SET_VAR(enable_spill = true)*/ id"].
selectedDatabase No Database from the data source configuration The name of the StarRocks database.
where No None A filter condition for incremental synchronization. For example, gmt_create>${bizdate} syncs records created on the current business date. Omit this parameter for a full sync.
splitPk No None The field used to shard data for concurrent synchronization. Use the table's primary key for even data distribution and to avoid hot spots.

Writer

Code example

{
    "stepType": "starrocks",
    "parameter": {
        "selectedDatabase": "didb1",
        "loadProps": {
            "row_delimiter": "",
            "column_separator": ""
        },
        "datasource": "starrocks_public",
        "column": [
            "id",
            "name"
        ],
        "loadUrl": [
            "1.1.X.X:8030"
        ],
        "table": "table1",
        "preSql": [
            "truncate table table1"
        ],
        "postSql": [
        ],
        "maxBatchRows": 500000,
        "maxBatchSize": 5242880,
        "strategyOnError": "exit"
    },
    "name": "Writer",
    "category": "writer"
}

Parameters

Parameter Required Default Description
datasource Yes None The name of the StarRocks data source.
table Yes None The destination table.
column Yes None The destination columns to write data to.
loadUrl Yes None The FE IP address and HTTP port (default: 8030). For multiple FE nodes, separate entries with a comma. For example: ["192.168.1.1:8030","192.168.1.2:8030"].
loadProps Yes None Request parameters for the StarRocks StreamLoad job. Set to {} if no special configuration is needed. See loadProps parameters for details.
selectedDatabase No Database from the data source configuration The name of the StarRocks database.
preSql No None An SQL statement to run before the synchronization task starts. For example: TRUNCATE TABLE table1.
postSql No None An SQL statement to run after the synchronization task completes.
maxBatchRows No 500000 The maximum number of rows per write batch.
maxBatchSize No 5242880 The maximum data size per write batch, in bytes.
strategyOnError No exit The error handling policy for batch writes. exit: fail and exit the task if an error occurs. batchDirtyData: record the failed batch as dirty data and continue.

loadProps parameters

loadProps configures the underlying StreamLoad import job. The available parameters depend on the import format.

CSV format (default)

Parameter Default Description
column_separator \t The column separator.
row_delimiter \n The row delimiter.

If your data contains the default delimiter characters, specify alternative delimiters:

{"column_separator": "<your-separator>", "row_delimiter": "<your-delimiter>"}

JSON format

Set "format": "json" to import data in JSON format:

{
    "format": "json"
}

Additional parameters for JSON imports:

Parameter Default Description
strip_outer_array false Whether to strip the outermost array [] and import each element as a separate row. Set to true when the JSON data is wrapped in an outer array.
ignore_json_size Whether to bypass the 100 MB JSON body size check. If the JSON body exceeds 100 MB, set this to true to skip the check.
compression The compression algorithm for data transmission. Supported values: GZIP, BZIP2, LZ4_FRAME, ZSTD.
strict_mode false Whether to enable strict mode. true: filter out rows with conversion errors and import only valid rows. false: convert fields that fail type conversion to NULL and import them along with valid rows.