DataWorks Data Integration provides MongoDB Writer that allows you to write the data from other data sources to a MongoDB data source. This topic provides an example on how to use a batch synchronization node in Data Integration to synchronize data from a MaxCompute data source to a MongoDB data source.
Prerequisites
- DataWorks is activated and a MaxCompute compute engine is associated with a workspace.
- An exclusive resource group for Data Integration is purchased and configured. The resource group is used to run the batch synchronization node in this topic. For more information, see Create and use an exclusive resource group for Data Integration.
Make preparations
In this example, you must prepare a MongoDB data collection and a MaxCompute table for data synchronization.
- Prepare a MaxCompute table and construct data for the table.
- Prepare a MongoDB data collection to which you want to write the data read from the partitioned MaxCompute table. In this example, ApsaraDB for MongoDB is used and a data collection named
test_write_mongo
is created.db.createCollection('test_write_mongo')
Configure a batch synchronization node
Step 1: Add a MongoDB data source
Add a MongoDB data source and make sure that a network connection is established between the data source and the exclusive resource group for Data Integration. For more information, see Add a MongoDB data source.
Step 2: Create and configure a batch synchronization node
- Establish network connections between the data sources and the exclusive resource group for Data Integration.
Select the MaxCompute data source that is automatically generated when you associate the MaxCompute compute engine with the workspace, the MongoDB data source that you added in Step 1, and the exclusive resource group for Data Integration. Then, test the network connectivity between the data sources and the resource group.
- Select the data sources. Select the partitioned MaxCompute table and MongoDB data collection that you prepare in the data preparation step. The following table describes how to configure the key parameters for the batch synchronization node.
Parameter Description WriteMode(overwrite or not) Specifies whether to overwrite existing data in the MongoDB data collection. If you set this parameter to Yes, you must configure the ReplaceKey
parameter.WriteMode(overwrite or not)
:- If you set the value to No, data is inserted into the MongoDB data collection as new data entries. The value No is the default value.
- If you set the value to Yes, you must configure the
ReplaceKey
parameter. This setting ensures that an existing data entry is overwritten by the new data entry that has the same primary key value.
ReplaceKey
: the primary key for each data record. Data is overwritten based on the primary key. You can specify only one primary key column. In most cases, the primary key in the MongoDB data collection is used.
Note If you set theWriteMode(overwrite or not)
parameter to Yes, and you specify a field other than the _id field as the primary key, an error that is similar to the following error may occur when the node is run:After applying the update, the (immutable) field '_id' was found to have been altered to _id: "2"
. The reason is that the value of the _id field does not match the value of the replaceKey parameter for some of the data that is written to the destination MongoDB data collection. For more information, see Error: After applying the update, the (immutable) field '_id' was found to have been altered to _id: "2".Statement Run Before Writing The SQL statement that you want to execute before data synchronization. You can configure the SQL statement in the JSON format, and configure the type
andjson
properties.type
: Required. Valid values:remove
anddrop
. The values must be in lowercase letters.json
:- If
type
is set toremove
, this property is required. You must configure the property based on the syntax of the standard MongoDB query operations. For more information, see Query Documents. - If
type
is set todrop
, this property is not required.
- If
- Configure field mappings. By default, if a MongoDB data source is added, the method of mapping fields in a row of the source to the fields in the same row of the destination is used. You can also click the icon to manually edit fields in the source table. The following sample code provides an example on how to edit fields in the source table:
After you edit the fields, the new mappings between the source fields and destination fields are displayed on the configuration tab of the node.{"name":"id","type":"string"} {"name":"col_string","type":"string"} {"name":"col_int","type":"long"} {"name":"col_bigint","type":"long"} {"name":"col_decimal","type":"double"} {"name":"col_date","type":"date"} {"name":"col_boolean","type":"bool"} {"name":"col_array","type":"array","splitter":","}
Step 3: Commit and deploy the batch synchronization node
If you use a workspace in standard mode and want to periodically schedule the batch synchronization node in the production environment, you can commit and deploy the node to the production environment. For more information, see Deploy nodes.
Step 4: Run the batch synchronization node and view the synchronization result
Appendix: Data type conversion during data synchronization
Values of the type parameter
The following data types are supported for the type parameter: INT
, LONG
, DOUBLE
, STRING
, BOOL
, DATE
, and ARRAY
.
Data written to the MangoDB data collection when type is set to ARRAY
- The source data is a string:
a,b,c
. - You set the type parameter to
ARRAY
and the splitter property to,
for the batch synchronization node. - The data written to the destination is
["a","b","c"]
when the node is run.