The StarRocks data source offers a bidirectional channel for reading from and writing to StarRocks. This topic describes the capabilities of DataWorks for synchronizing data with StarRocks data sources.
Supported versions
-
E-MapReduce (EMR) Serverless StarRocks 2.5 and 3.1 are supported.
-
EMR on ECS: StarRocks 2.1 is supported.
-
The StarRocks Community Edition is supported. For details about the Community Edition, visit the StarRocks official website.
Note-
DataWorks only supports internal network connections to StarRocks; hence, the Community Edition must be deployed on EMR on ECS.
-
The Community Edition of StarRocks is highly open. If you experience compatibility issues with the data source, please or submit a ticket for feedback.
-
Supported field types
Most StarRocks data types, including numeric, string, and date types, are supported.
Prepare before data synchronization
To ensure network connectivity for a resource group, add the IP address or CIDR block of the resource group to the internal IP address whitelist of the desired EMR Serverless StarRocks instance beforehand. Additionally, allow the CIDR block to access ports 9030, 8030, and 8040.
-
For more information about whitelisting IP addresses of DataWorks resource groups, see add a whitelist.
-
The following figure shows how to access the IP address whitelists of an EMR Serverless StarRocks instance.
Add a data source
Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Add and manage data sources. You can view the infotips of parameters in the DataWorks console to understand the meanings of the parameters when you add a data source.
When creating a StarRocks data source, you can use either the Alibaba Cloud instance mode or the connection string mode:
-
Alibaba Cloud Instance Mode lets you directly select a StarRocks instance created in the same VPC on Alibaba Cloud.
-
Connection String Mode allows you to connect using the host address/IP, port, and Load URL of StarRocks. This mode is compatible with internal network instances in the same VPC or public network address connections.
NoteServerless resource groups do not have public network access by default. To connect to a StarRocks instance using a public network address, configure a public NAT Gateway and EIP for the associated VPC to enable public network access to the data source.
For Alibaba Cloud EMR StarRocks Serverless, use the Internal Address or Public Address for the Host Address/ip, and the Query Port for the port.
-
FE node information: Obtain FE node details on the instance's details tab.
-
Database: After connecting to the instance using EMR StarRocks Manager, view the corresponding database in the SQL Editor or Metadata Management.
NoteTo create a database, run SQL statements directly in the SQL editor.
-
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Configure a batch synchronization task to synchronize data of a single table
-
For the procedure, see configure a batch synchronization task using the codeless UI and configure a batch synchronization task using the code editor.
-
For full parameters and code for the code editor, see appendix: code and parameters.
Appendix: Code and parameters
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.
Code for StarRocks reader
{
"stepType": "starrocks",
"parameter": {
"selectedDatabase": "didb1",
"datasource": "starrocks_datasource",
"column": [
"id",
"name"
],
"where": "id>100",
"table": "table1",
"splitPk": "id"
},
"name": "Reader",
"category": "reader"
}
Parameters in code for StarRocks Reader
Parameter | Description | Required | Default value |
datasource | The name of the StarRocks data source. | Yes | None |
selectedDatabase | The name of the StarRocks database. | No | The name of the database that is configured in the StarRocks data source. |
column | The names of the columns from which you want to read data. | Yes | None |
where | The filter condition. In actual business scenarios, you often select the data of the current day for synchronization and specify the where condition as
| No | None |
table | The name of the table from which you want to read data. | Yes | None |
splitPk | When StarRocks Reader extracts data, if you specify splitPk, it indicates that you want to use the field represented by splitPk for data sharding. As a result, concurrent tasks are started for data synchronization to improve the efficiency of data synchronization. We recommend that you specify the name of the primary key column of a source table as the shard key. This way, data can be evenly distributed to different shards based on the primary key column, instead of being intensively distributed only to specific shards. | No | None |
Code for StarRocks writer
{
"stepType": "starrocks",
"parameter": {
"selectedDatabase": "didb1",
"loadProps": {
"row_delimiter": "\\x02",
"column_separator": "\\x01"
},
"datasource": "starrocks_public",
"column": [
"id",
"name"
],
"loadUrl": [
"1.1.X.X:8030"
],
"table": "table1",
"preSql": [
"truncate table table1"
],
"postSql": [
],
"maxBatchRows": 500000,
"maxBatchSize": 5242880
},
"name": "Writer",
"category": "writer"
}
Parameters in code for StarRocks writer
Parameter | Description | Required | Default value |
datasource | The name of the StarRocks data source. | Yes | None |
selectedDatabase | The name of the StarRocks database. | No | The name of the database that is configured in the StarRocks data source. |
loadProps | The request parameters for the StarRocks Stream Load import method. If you want to import data as CSV files by using the Stream Load import method, you can configure request parameters. If you have no special requirements, set the parameter to {}. Request parameters that you can configure for the Stream Load import method:
| Yes | None |
column | The names of the columns to which you want to write data. | Yes | None |
loadUrl | The URL of a StarRocks frontend node. The URL consists of the IP address of the frontend node and the HTTP port number. The default HTTP port number is 8030. If you specify URLs for multiple frontend nodes, separate them with commas (,). | Yes | None |
table | The name of the table to which you want to write data. | Yes | None |
preSql | The SQL statement that you want to execute before the synchronization task is run. For example, you can execute the TRUNCATE TABLE tablename statement to delete outdated data before the synchronization task is run. | No | None |
postSql | The SQL statement that you want to execute after the synchronization task is run. | No | None |
maxBatchRows | The maximum number of rows of data written each time. | No | 500000 |
maxBatchSize | The maximum number of bytes written each time. | No | 5242880 |