Use the Data Integration service of DataWorks to run an offline sync task that reads data from ApsaraDB RDS for SQL Server and writes it to an Alibaba Cloud Elasticsearch instance.
Supported data sources and sync modes
DataWorks Data Integration supports the following source databases and sync modes:
Source databases
| Category | Data sources |
|---|---|
| Alibaba Cloud databases | ApsaraDB RDS for MySQL, ApsaraDB RDS for PostgreSQL, ApsaraDB RDS for SQL Server, ApsaraDB for MongoDB, ApsaraDB for HBase, Alibaba Cloud PolarDB for Xscale (PolarDB-X) (formerly DRDS), MaxCompute, Object Storage Service (OSS), Tablestore |
| Self-managed databases | HDFS, Oracle, FTP, Db2, MySQL, PostgreSQL, SQL Server, MongoDB, HBase |
Sync modes
| Mode | Description |
|---|---|
| Offline batch sync | Synchronize a database or table to Elasticsearch in offline mode. See Create a batch synchronization task to synchronize all data in a database to Elasticsearch. |
| Real-time sync | Synchronize full and incremental data to Elasticsearch in real time. See Create a real-time synchronization task to synchronize data to Elasticsearch. |
Prerequisites
Before you begin, ensure that you have:
-
An ApsaraDB RDS for SQL Server instance. This topic uses a SQL Server 2019 Standard Edition instance as an example. See Quickly create and use an ApsaraDB RDS for SQL Server instance.
-
An Alibaba Cloud Elasticsearch instance with the automatic index creation feature enabled. See Create an Alibaba Cloud Elasticsearch instance and Configure YML parameters.
-
A DataWorks workspace. See Create a workspace.
Keep the following constraints in mind before you start:
-
Data can only be synchronized to Alibaba Cloud Elasticsearch. Self-managed Elasticsearch instances are not supported.
-
The ApsaraDB RDS for SQL Server instance, Elasticsearch instance, and DataWorks workspace must be in the same region.
-
All three resources must be in the same time zone. A time zone mismatch causes offsets in time-related data after synchronization.
Billing
-
For Elasticsearch billing details, see Elasticsearch billable items.
-
For exclusive resource group billing details, see Billing of exclusive resource groups for Data Integration (subscription).
Synchronize data from RDS SQL Server to Elasticsearch
The procedure consists of five steps:
-
Prepare source data in ApsaraDB RDS for SQL Server.
-
Purchase an exclusive resource group and connect it to your Virtual Private Cloud (VPC).
-
Add the SQL Server and Elasticsearch data sources in Data Integration.
-
Configure and run an offline sync task.
-
Verify the synchronized data in Kibana.
Step 1: Prepare the source data
Create the test table and insert a record in your ApsaraDB RDS for SQL Server instance.
The data in this topic is for testing only. Do not use it in a production environment.
Create the table:
CREATE TABLE students(
id INT,
name VARCHAR(20),
age INT
)
Insert a record:
INSERT INTO [students]( [id] , [name] , [age] ) values('1', 'Xiaoming','21');
Step 2: Purchase and configure an exclusive resource group
An exclusive resource group for Data Integration provides a dedicated, isolated runtime environment for sync tasks. It connects to your VPC so that Data Integration can reach both your RDS SQL Server and Elasticsearch instances over the internal network.
Network connectivity overview
To synchronize data over the VPC internal network, you must:
-
Purchase an exclusive resource group.
-
Attach the VPC, Zone, and vSwitch shared by both instances to the resource group.
-
Add the vSwitch CIDR block to the internal-facing whitelists of both instances.
-
Attach the workspace to the resource group.
Procedure
-
Log on to the DataWorks consoleDataWorks consoleDataWorks console.
-
In the top menu bar, select a region. In the left navigation pane, click Resource Group.
-
On the Exclusive Resource Groups tab, click Create Legacy Resource Group > Data Integration Resource Group.
-
On the DataWorks Exclusive Resource (Subscription) purchase page, set Exclusive Resource Type to Exclusive Resource for Data Integration, enter a name for the resource group, and then click Buy Now. For more configuration options, see Step 1: Purchase a resource group.
-
Attach a VPC to the exclusive resource group. In the Operation column for the resource group, click Network Settings. Attach the VPC, Zone, and vSwitch where both the RDS SQL Server and Elasticsearch instances reside. To look up the VPC information for each instance, see Switch the vSwitch for an ApsaraDB RDS for SQL Server instance and View the basic information of an Elasticsearch instance. For step-by-step instructions, see Attach a VPC.
ImportantAfter attaching the VPC, add the vSwitch CIDR block to the internal-facing whitelists of both the RDS SQL Server and Elasticsearch instances. See Configure a whitelist for an ApsaraDB RDS for SQL Server instance and Configure a public or internal-facing whitelist for an Elasticsearch instance.
-
Click the back icon in the upper-left corner to return to the Resource Group List page.
-
In the Operation column for the resource group, click Attach Workspace and select your workspace. For details, see Step 2: Attach a workspace.
Step 3: Add data sources
Add both the ApsaraDB RDS for SQL Server and the Elasticsearch data sources to Data Integration.
-
Go to the Data Integration page.
-
Log on to the DataWorks consoleDataWorks consoleDataWorks console.
-
In the left navigation pane, click Workspaces.
-
In the Operation column of your workspace, choose Quick Access > Data Integration.
-
-
In the navigation pane, click Data Source.
-
Add the SQL Server data source.
-
On the Data Source List page, click Add Data Source.
-
Search for and select SQL Server.
-
In the Add SQL Server Data Source dialog box, fill in the parameters in the Basic Information area. For parameter details, see Configure a SQL Server data source.
-
In the Connection Configuration area, click Test Connectivity. A status of Connected confirms a successful connection.
-
Click Finish.
-
-
Add the Elasticsearch data source the same way. For details, see Configure an Elasticsearch data source.
Step 4: Configure and run an offline sync task
The offline sync task runs on the exclusive resource group, which pulls data from the SQL Server data source and writes it to Elasticsearch.
You can configure an offline sync task using the codeless UI or the code editor. This topic uses the codeless UI. For the code editor method, see Configure an offline sync task using the code editor and Elasticsearch Writer.
-
Go to the Data Development page.
-
Log on to the DataWorks consoleDataWorks consoleDataWorks console.
-
In the left navigation pane, click Workspaces.
-
In the Operation column of your workspace, choose Quick Access > Data Development.
-
-
Create an offline sync task.
-
In the navigation pane, click the
icon and choose New > New Business Flow. -
Right-click the new business flow and choose New Node > Offline Synchronization.
-
In the New Node dialog box, enter a node name and click Confirm.
-
-
Configure the network and resources.
-
In the Data Source area, set Data Source to SQL Server and select your data source from Data Source Name.
-
In the My Resource Groups area, select the exclusive resource group.
-
In the Data Destination area, set Data Destination to Elasticsearch and select your data source from Data Source Name.
-
-
Click Next.
-
Configure the task details. For the full list of configuration options, see Configure an offline sync task using the codeless UI.
-
In the Data Source area, select the table to sync.
-
Configure the parameters in the Data Destination area.
-
In the Field Mapping area, map the Source Field to the Destination Field. For details, see Configure an offline sync task using the codeless UI.
-
In the Channel Control area, you can configure the channel parameters.
-
-
Run the task. If you configured a schedule, the task runs automatically at the specified intervals. To run the task immediately, click the run icon in the upper-right corner. A log message of
Shell run successfully!confirms that the task completed successfully.-
(Optional) In the right pane, click Scheduling Configuration to configure the scheduling parameters for the task. For parameter details, see Scheduling configuration.
-
In the upper-right corner of the node configuration tab, click the save icon.
-
Click the submit icon to submit the task.
-
Step 5: Verify the data synchronization result
-
Log on to the Kibana console of your Elasticsearch instance. See Log on to the Kibana console.
-
In the top-left corner of the Kibana page, click the menu icon and select Dev Tools.
-
In the Console, run the following query to view the synchronized data:
POST /dbo.students/_search?pretty { "query": { "match_all": {}} }NoteReplace
dbo.studentswith the index name you specified when configuring the sync task in DataWorks.The expected result is shown in the following figure.
