DataWorks provides StarRocks Reader and StarRocks Writer for you to read data from and write data to StarRocks data sources. This topic describes the capabilities of synchronizing data from or to StarRocks data sources.
Supported versions
E-MapReduce (EMR) Serverless StarRocks 2.5 and 3.1.
EMR on ECS: StarRocks 2.1.
Support for Community Edition StarRocks.
NoteBecause DataWorks only supports internal network connections to StarRocks, Community Edition StarRocks must be deployed on EMR on ECS.
Community Edition StarRocks has strong openness. If you encounter compatibility issues when using the data source, you are welcome to submit a ticket for feedback.
Supported field types
Only numeric, string, and date field types are supported.
Preparations before data synchronization
To ensure network connectivity of resource groups, you need to add the IP addresses of the DataWorks resource groups that you will use to the internal network whitelist of the EMR Serverless StarRocks instance in advance. You also need to allow access from this network segment to ports 9030
, 8030
, and 8040
.
For the whitelist IP addresses of DataWorks resource groups, see: Add whitelist.
The following figure shows the entry points for accessing the IP address whitelists of an EMR Serverless StarRocks instance.
Add a data source
Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Add and manage data sources. You can view the infotips of parameters in the DataWorks console to understand the meanings of the parameters when you add a data source.
Choose a StarRocks connection mode based on your network environment:
Scenario 1: Internal network connection (recommended)
Internal network links have low latency and more secure data transmission, without requiring additional public network permissions.
Applicable scenario: Your StarRocks instance and Serverless resource group are in the same VPC.
Supports both Alibaba Cloud instance mode and connection string mode:
Select Alibaba Cloud Instance Mode: Directly select the StarRocks instance in the same VPC, and the system automatically obtains the connection information without manual configuration.
Select Connection String Mode: Manually enter the internal address/IP, port, and Load URL of the instance.
Scenario 2: Public network connection
Public network transmission has security risks. We recommend using it with security policies such as whitelists and IP authentication.
Applicable scenario: You need to access the StarRocks instance through the public network (such as cross-region or local environment access).
Supports connection string mode (make sure the StarRocks instance has public network access enabled):
Select Connection String Mode: Manually enter the public address/IP, port, and Load URL of the instance.
Serverless resource groups do not have public network access capabilities by default. When connecting to a StarRocks instance using a public address, you need to configure a public NAT Gateway and EIP for the bound VPC to support public network access to the data source.
If you are using Alibaba Cloud EMR StarRocks Serverless, use the Internal Address or Public Address for the Host Address/ip, and use the Query Port for the port.
FE information: You can obtain this from the instance details page.
Database: After connecting to the instance using EMR StarRocks Manager, you can view the corresponding database in SQL Editor or Metadata Management.
NoteIf you need to create a database, you can directly run SQL statements in the SQL editor.
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Configure a batch synchronization task to synchronize data of a single table
For the operation procedure, see Configure a batch synchronization task by using the codeless UI and Configure a batch synchronization task by using the code editor.
For all parameters and script demos in code editor mode, see Appendix: Script demos and parameter descriptions below.
Appendix: Script demos and parameter descriptions
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.
Code for StarRocks reader
{
"stepType": "starrocks",
"parameter": {
"selectedDatabase": "didb1",
"datasource": "starrocks_datasource",
"column": [
"id",
"name"
],
"where": "id>100",
"table": "table1",
"splitPk": "id"
},
"name": "Reader",
"category": "reader"
}
Parameters in code for StarRocks Reader
Parameter | Description | Required | Default value |
datasource | The name of the StarRocks data source. | Yes | None |
selectedDatabase | The name of the StarRocks database. | No | The database is the one that is configured in the StarRocks data source. |
column | The names of the columns from which you want to read data. | Yes | None |
where | The filter condition. In actual business scenarios, you often need to synchronize data of the current day. In this case, you can set the where condition to
| No | None |
table | The name of the table from which you want to read data. | Yes | None |
splitPk | When StarRocks Reader extracts data, if you specify splitPk, it indicates that you want to use the field represented by splitPk for data sharding. The data synchronization will then start concurrent tasks for data synchronization, improving the efficiency of data synchronization. We recommend that you set the splitPk parameter to the name of the primary key column of the source table. This way, data can be evenly distributed to different shards based on the primary key column, instead of being intensively distributed only to specific shards. | No | None |
Code for StarRocks writer
{
"stepType": "starrocks",
"parameter": {
"selectedDatabase": "didb1",
"loadProps": {
"row_delimiter": "\\x02",
"column_separator": "\\x01"
},
"datasource": "starrocks_public",
"column": [
"id",
"name"
],
"loadUrl": [
"1.1.X.X:8030"
],
"table": "table1",
"preSql": [
"truncate table table1"
],
"postSql": [
],
"maxBatchRows": 500000,
"maxBatchSize": 5242880
},
"name": "Writer",
"category": "writer"
}
Parameters in code for StarRocks writer
Parameter | Description | Required | Default value |
datasource | The name of the StarRocks data source. | Yes | None |
selectedDatabase | The name of the StarRocks database. | No | The name of the database that is configured in the StarRocks data source. |
loadProps | The request parameters for the StarRocks Stream Load import method. If you want to import data as CSV files by using the Stream Load import method, you can configure request parameters. If you have no special requirements, set the parameter to {}. Request parameters that you can configure for the Stream Load import method:
If the data you want to write to StarRocks contains \t or \n, you must use other characters as delimiters. Example:
Stream Load also supports importing data in JSON format. You can configure:
Parameters that can be configured in JSON format:
| Yes | None |
column | The names of the columns to which you want to write data. | Yes | None |
loadUrl | Enter the StarRocks FrontEnd IP and Http Port (usually default is | Yes | None |
table | The name of the table to which you want to write data. | Yes | None |
preSql | The SQL statement that you want to execute before the synchronization task is run. For example, you can execute the TRUNCATE TABLE tablename statement to delete outdated data before the synchronization task is run. | No | None |
postSql | The SQL statement that you want to execute after the data synchronization task is run. | No | None |
maxBatchRows | The maximum number of rows of data written each time. | No | 500000 |
maxBatchSize | The maximum number of bytes written each time. | No | 5242880 |