The StarRocks data source lets you read data from and write data to StarRocks. This topic describes the capabilities that DataWorks supports for StarRocks data synchronization.
Supported versions
All versions of EMR Serverless StarRocks are supported.
E-MapReduce on ECS: StarRocks 2.1.
StarRocks Community Edition is supported.
NoteDataWorks connects to StarRocks only over an internal network. Therefore, the Community Edition of StarRocks must be deployed on E-MapReduce on ECS.
If you encounter compatibility issues with this data source, submit a ticket.
Limitations
For real-time synchronization of an entire database from MySQL to StarRocks, the destination StarRocks table must use a primary key model.
When you perform real-time synchronization of an entire database from MySQL to StarRocks, DDL operations other than TRUNCATE are not supported. You can either ignore these DDL operations or configure the task to report an error.
Supported data types
Only numeric, string, and date data types are supported.
Network connectivity
EMR Serverless StarRocks
To ensure network connectivity, you must add the IP addresses of the DataWorks resource group to the IP address allowlist of the EMR Serverless StarRocks instance.
For the IP addresses of DataWorks resource groups, see General configurations: Add IP addresses to an allowlist.
You can add IP addresses to the allowlist of an EMR Serverless StarRocks instance in the EMR console.

Self-managed StarRocks
Ensure that the DataWorks resource group can access the query port, FE port, and BE port of your StarRocks instance. The default ports are 9030, 8030, and 8040.
Add a data source
Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data source management. You can view parameter descriptions in the DataWorks console to understand the meanings of the parameters when you add a data source.
Select a connection mode for StarRocks based on your network environment:
Use case 1: Internal network connection (recommended)
An internal network connection provides low latency and high security without requiring public network access.
Use case: Your StarRocks instance and the Serverless resource group are in the same VPC.
Both Alibaba Cloud instance mode and Connection string mode are supported:
Select ApsaraDB for RDS: Directly select a StarRocks instance in the same VPC. The system automatically obtains the connection information.
Select User-created Data Store with Public IP Addresses: Manually enter the internal address or IP address, port, and Load URL of the instance.
Use case 2: Public network connection
Data transfer over the public network has security risks. Use security controls such as IP allowlists and IP-based authentication.
Use case: You need to access the StarRocks instance over the public network, for example, for cross-region access or from an on-premises environment.
Only Connection string mode is supported. Ensure that public network access is enabled for your StarRocks instance:
Select User-created Data Store with Public IP Addresses: Manually enter the public address or IP address, port, and Load URL of the instance.
By default, Serverless resource groups cannot access the public network. To connect to a StarRocks instance by using a public endpoint, you must configure a NAT Gateway and an Elastic IP address (EIP) for the bound VPC to enable public network access. You must also ensure that the Serverless resource group can access the query port, FE port, and BE port of your StarRocks instance. The default ports are 9030, 8030, and 8040.
If you are using Alibaba Cloud EMR StarRocks Serverless, set Host Address/IP Address to Internal Endpoint or Public network address, and the port to a query port.
FE: You can obtain the FE information on the instance details page.

Database: After connecting to the instance by using EMR StarRocks Manager, you can view the corresponding databases in the SQL Editor or Metadata Management.
NoteTo create a database, run SQL commands directly in the SQL editor.
Data synchronization task development
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Batch synchronization for a single table
Supported data sources: All data source types supported by the Data Integration module.
Procedure: For more information, see Use the Codeless UI and Use the Code Editor.
For the full list of parameters and a code example for the Code Editor, see Appendix: Code and parameters.
Real-time synchronization for a single table
Supported data sources: Kafka
Guide: Configure a full-database real-time synchronization task
Batch synchronization for a full database
Supported data sources: MySQL
Guide: Configure a full-database real-time synchronization task
Real-time synchronization for a full database
Supported data sources: MySQL, Oracle, and PolarDB
Guide: Configure a full-database real-time synchronization task
Appendix: Code and parameters
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Use the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.
Reader code example
{
"stepType": "starrocks",
"parameter": {
"selectedDatabase": "didb1",
"datasource": "starrocks_datasource",
"column": [
"id",
"name"
],
"where": "id>100",
"table": "table1",
"splitPk": "id"
},
"name": "Reader",
"category": "reader"
}Reader parameters
Parameter | Description | Required | Default |
datasource | The name of the StarRocks data source. | Yes | None |
selectedDatabase | The name of the StarRocks database. | No | The database name that you specified when you configured the StarRocks data source. |
column | The collection of column names in the configured table to be synchronized. If you want to add a SET_VAR hint when you read data from StarRocks, you can add the hint before the first column name in column. For example, if the column to be synchronized is id and you want to add | Yes | None |
where | In real-world business scenarios, a common filter for synchronizing data of the current day is to specify the where clause as
| No | None |
table | The source table. | Yes | None |
splitPk | When StarRocks Reader extracts data, you can specify the splitPk parameter to shard data based on the field provided for splitPk. This starts concurrent data synchronization tasks and improves efficiency. We recommend using the table's primary key as the split key because primary keys are typically evenly distributed, which helps prevent data hot spots in the resulting shards. | No | None |
Writer code example
{
"stepType": "starrocks",
"parameter": {
"selectedDatabase": "didb1",
"loadProps": {
"row_delimiter": "",
"column_separator": ""
},
"datasource": "starrocks_public",
"column": [
"id",
"name"
],
"loadUrl": [
"1.1.X.X:8030"
],
"table": "table1",
"preSql": [
"truncate table table1"
],
"postSql": [
],
"maxBatchRows": 500000,
"maxBatchSize": 5242880,
"strategyOnError": "exit"
},
"name": "Writer",
"category": "writer"
}Writer parameters
Parameter | Description | Required | Default |
datasource | The name of the StarRocks data source. | Yes | None |
selectedDatabase | The name of the StarRocks database. | No | The database name that you specified when you configured the StarRocks data source. |
loadProps | The request parameters for a StarRocks StreamLoad job. You can configure import parameters for importing data in the CSV format. If no special configurations are required, set this parameter to
If your data contains or , you must specify other characters as delimiters. The following example shows how to use special characters: StreamLoad also supports data import in the JSON format. You can set the format parameter to json: The following parameters can be configured for the JSON format:
| Yes | None |
column | The destination columns to write data to. | Yes | None |
loadUrl | Enter the StarRocks FrontEnd IP and HTTP port (the default is | Yes | None |
table | The destination table. | Yes | None |
preSql | An SQL statement to execute before the synchronization task starts. For example, you can use TRUNCATE TABLE tablename to clear the existing data in the table. | No | None |
postSql | An SQL statement to execute after the synchronization task finishes. | No | None |
maxBatchRows | The maximum number of rows per write batch. | No | 500000 |
maxBatchSize | The maximum data size per write batch, in bytes. | No | 5242880 |
strategyOnError | The policy for handling exceptions during batch writes. Valid values:
Default value: | No | exit |