Use a batch synchronization node to write data to a MongoDB data source - DataWorks

DataWorks Data Integration provides MongoDB Writer that allows you to write the data from other data sources to a MongoDB data source. This topic provides an example on how to use a batch synchronization node in Data Integration to synchronize data from a MaxCompute data source to a MongoDB data source.

Prerequisites

DataWorks is activated and a MaxCompute compute engine is associated with a workspace.
An exclusive resource group for Data Integration is purchased and configured. The resource group is used to run the batch synchronization node in this topic. For more information, see Create and use an exclusive resource group for Data Integration.

Make preparations

In this example, you must prepare a MongoDB data collection and a MaxCompute table for data synchronization.

Prepare a MaxCompute table and construct data for the table.

Create a partitioned table named test_write_mongo. The partition field is pt.

CREATE TABLE IF NOT EXISTS test_write_mongo(
    id STRING ,
    col_string STRING,
    col_int int,
    col_bigint bigint,
    col_decimal decimal,
    col_date DATETIME,
    col_boolean boolean,
    col_array string
) PARTITIONED BY (pt STRING) LIFECYCLE 10;

Add the value 20230215 for the partition field.

insert into test_write_mongo partition (pt='20230215')
values ('11','name11',1,111,1.22,cast('2023-02-15 15:01:01' as datetime),true,'1,2,3');

Check whether the partitioned table is correctly created.

SELECT  * FROM test_write_mongo
WHERE   pt = '20230215';

Prepare a MongoDB data collection to which you want to write the data read from the partitioned MaxCompute table.
In this example, ApsaraDB for MongoDB is used and a data collection named test_write_mongo is created.
```
db.createCollection('test_write_mongo')
```

Configure a batch synchronization node

Step 1: Add a MongoDB data source

Add a MongoDB data source and make sure that a network connection is established between the data source and the exclusive resource group for Data Integration. For more information, see Add a MongoDB data source.

Step 2: Create and configure a batch synchronization node

Create a batch synchronization node on the DataStudio page in the DataWorks console and configure items such as the items related to the source and destination for the batch synchronization node. This step describes only some items that you must configure. For the other items, retain the default values. For more information, see Configure a batch synchronization node by using the codeless UI.

Establish network connections between the data sources and the exclusive resource group for Data Integration.
Select the MaxCompute data source that is automatically generated when you associate the MaxCompute compute engine with the workspace, the MongoDB data source that you added in Step 1, and the exclusive resource group for Data Integration. Then, test the network connectivity between the data sources and the resource group.

Select the data sources.

Select the partitioned MaxCompute table and MongoDB data collection that you prepare in the data preparation step. The following table describes how to configure the key parameters for the batch synchronization node.


Parameter	Description
WriteMode(overwrite or not)	Specifies whether to overwrite existing data in the MongoDB data collection. If you set this parameter to Yes, you must configure the `ReplaceKey` parameter. `WriteMode(overwrite or not)`: If you set the value to No, data is inserted into the MongoDB data collection as new data entries. The value No is the default value. If you set the value to Yes, you must configure the `ReplaceKey` parameter. This setting ensures that an existing data entry is overwritten by the new data entry that has the same primary key value. `ReplaceKey`: the primary key for each data record. Data is overwritten based on the primary key. You can specify only one primary key column. In most cases, the primary key in the MongoDB data collection is used. Note If you set the `WriteMode(overwrite or not)` parameter to Yes, and you specify a field other than the _id field as the primary key, an error that is similar to the following error may occur when the node is run: `After applying the update, the (immutable) field '_id' was found to have been altered to _id: "2"`. The reason is that the value of the _id field does not match the value of the replaceKey parameter for some of the data that is written to the destination MongoDB data collection. For more information, see Error: After applying the update, the (immutable) field '_id' was found to have been altered to _id: "2".
Statement Run Before Writing	The SQL statement that you want to execute before data synchronization. You can configure the SQL statement in the JSON format, and configure the `type` and `json` properties. `type`: Required. Valid values: `remove` and `drop`. The values must be in lowercase letters. `json`: If `type` is set to `remove`, this property is required. You must configure the property based on the syntax of the standard MongoDB query operations. For more information, see Query Documents. If `type` is set to `drop`, this property is not required.

Configure field mappings.
By default, if a MongoDB data source is added, the method of mapping fields in a row of the source to the fields in the same row of the destination is used. You can also click the icon to manually edit fields in the source table. The following sample code provides an example on how to edit fields in the source table:
```
{"name":"id","type":"string"}
{"name":"col_string","type":"string"}
{"name":"col_int","type":"long"}
{"name":"col_bigint","type":"long"}
{"name":"col_decimal","type":"double"}
{"name":"col_date","type":"date"}
{"name":"col_boolean","type":"bool"}
{"name":"col_array","type":"array","splitter":","}
```
After you edit the fields, the new mappings between the source fields and destination fields are displayed on the configuration tab of the node.

Step 3: Commit and deploy the batch synchronization node

If you use a workspace in standard mode and want to periodically schedule the batch synchronization node in the production environment, you can commit and deploy the node to the production environment. For more information, see Deploy nodes.

Step 4: Run the batch synchronization node and view the synchronization result

After you complete the preceding configurations, you can run the batch synchronization node. After the running is complete, you can view the data synchronized to the MongoDB data collection. Result data 2

Appendix: Data type conversion during data synchronization

Values of the type parameter

The following data types are supported for the type parameter: INT, LONG, DOUBLE, STRING, BOOL, DATE, and ARRAY.

Data written to the MangoDB data collection when type is set to ARRAY

If you set the type parameter to ARRAY, you must configure the splitter property. This way, data can be written to the MongoDB data collection as arrays. Example:

The source data is a string: a,b,c.
You set the type parameter to ARRAY and the splitter property to , for the batch synchronization node.
The data written to the destination is ["a","b","c"] when the node is run.