The StarRocks data source provides a bidirectional channel to read data from and write data to StarRocks. This topic describes how to use DataWorks for StarRocks data synchronization.
Supported versions
All versions of EMR Serverless StarRocks.
EMR on ECS: StarRocks 2.1.
The community edition of StarRocks is supported.
NoteBecause DataWorks supports only connections to StarRocks over an internal network, you must deploy the community edition of StarRocks on EMR on ECS.
The community edition of StarRocks is open. If you encounter compatibility issues when you use this data source, you can submit a ticket to provide feedback.
Supported field types
Only numeric, string, and date field types are supported.
Prerequisites for data synchronization (network connectivity)
EMR Serverless StarRocks
To ensure network connectivity, add the IP addresses of your DataWorks resource group to the internal network whitelist of the EMR Serverless StarRocks instance.
For the IP addresses of the DataWorks resource group that you need to add to the whitelist, see General configurations: Add a whitelist.
The following figure shows where to add IP addresses to the whitelist for an EMR Serverless StarRocks instance.

Self-managed StarRocks
Ensure that the DataWorks resource group can access the query port, FE port, and BE port of StarRocks. The default ports are 9030, 8030, and 8040, respectively.
Add a data source
Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data Source Management. You can view the infotips of parameters in the DataWorks console to understand the meanings of the parameters when you add a data source.
Select a StarRocks connection mode based on your network environment:
Scenario 1: Internal network connection (Recommended)
Connections over an internal network have low latency and provide more secure data transmission. No additional public network permissions are required.
Scenario: Your StarRocks instance and the serverless resource group are in the same VPC.
Supported connection modes:
Select Alibaba Cloud Instance Mode. When you select the StarRocks instance in the same VPC, the system automatically retrieves the connection information.
Select Connection String Mode to manually enter the internal address or IP address, port, and Load URL of the instance.
Scenario 2: Internet connection
Transmitting data over the Internet poses security risks. Use security policies such as whitelists and IP address authentication to mitigate these risks.
Scenario: You need to access the StarRocks instance over the Internet, for purposes such as cross-region or on-premises access.
Supported connection mode: Connection String Mode. Ensure that public network access is enabled for the StarRocks instance.
Select Connection String Mode and manually enter the public address or IP address, port, and Load URL of the instance.
By default, serverless resource groups cannot access the Internet. To connect to a StarRocks instance using a public IP address, you must configure an Internet NAT gateway and elastic IP addresses (EIPs) for the attached VPC. This enables the resource group to access the data source over the Internet. Also, ensure that the serverless resource group can access the query port, FE port, and BE port of StarRocks. The default ports are 9030, 8030, and 8040, respectively.
If you use Alibaba Cloud EMR StarRocks Serverless, set Host Address/IP to the Internal Address or Public Address, and set the port to the Query Port.
FE: You can find this value on the instance details page.

Database: After you connect to the instance using EMR StarRocks Manager, you can access the database from the SQL Editor or Data Management.
NoteTo create a database, you can execute SQL commands directly in the SQL Editor.
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Guide to configuring a single-table offline synchronization task
For the procedure, see Configure in codeless UI and Configure in code editor.
For all parameters and a script demo for configuring tasks in the code editor, see Appendix: Script demo and parameter description.
Appendix: Script demo and parameter description
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.
Reader script demo
{
"stepType": "starrocks",
"parameter": {
"selectedDatabase": "didb1",
"datasource": "starrocks_datasource",
"column": [
"id",
"name"
],
"where": "id>100",
"table": "table1",
"splitPk": "id"
},
"name": "Reader",
"category": "reader"
}Reader script parameters
Parameter | Description | Required | Default value |
datasource | The name of the StarRocks data source. | Yes | None |
selectedDatabase | The name of the StarRocks database. | No | The database name configured in the StarRocks data source. |
column | The columns to read from the source table. | Yes | None |
where | The filter condition. In many scenarios, you might synchronize only the data for the current day. To do this, set the where condition to
| No | None |
table | The name of the source table. | Yes | None |
splitPk | When StarRocks Reader extracts data, specifying splitPk tells the system to use this field for data sharding. This starts concurrent tasks and improves the efficiency of data synchronization. We recommend you use the primary key of the table for splitPk. Primary keys are usually distributed evenly, which helps prevent data hot spots in the shards. | No | None |
Writer script demo
{
"stepType": "starrocks",
"parameter": {
"selectedDatabase": "didb1",
"loadProps": {
"row_delimiter": "\\x02",
"column_separator": "\\x01"
},
"datasource": "starrocks_public",
"column": [
"id",
"name"
],
"loadUrl": [
"1.1.X.X:8030"
],
"table": "table1",
"preSql": [
"truncate table table1"
],
"postSql": [
],
"maxBatchRows": 500000,
"maxBatchSize": 5242880,
"strategyOnError": "exit"
},
"name": "Writer",
"category": "writer"
}Writer script parameters
Parameter | Description | Required | Default value |
datasource | The name of the StarRocks data source. | Yes | None |
selectedDatabase | The name of the StarRocks database. | No | The database name configured in the StarRocks data source. |
loadProps | The request parameters for StarRocks StreamLoad. When you import data using StreamLoad CSV, you can configure import parameters here. If you have no special configurations, use {}. Configurable parameters include the following:
If your data contains \t or \n, you must use other characters as delimiters. The following example shows how to use special characters: StreamLoad also supports importing data in JSON format. You can configure it as follows: The parameters that can be configured for the JSON format are:
| Yes | None |
column | The columns to write to the destination table. | Yes | None |
loadUrl | Enter the IP address and HTTP port of the StarRocks frontend (FE). The default port is | Yes | None |
table | The names of the tables to synchronize. | Yes | None |
preSql | An SQL statement that is executed before the data synchronization task runs. For example, use truncate table tablename to clear old data from a table before the task starts. | No | None |
postSql | An SQL statement that is executed after the data synchronization task finishes. | No | None |
maxBatchRows | The maximum number of rows to write per batch. | No | 500000 |
maxBatchSize | The maximum number of bytes to write per batch. | No | 5242880 |
strategyOnError | The policy for handling exceptions during batch writing to StarRocks. Valid values:
Default value: | No | exit |