MongoDB stores data as nested JSON documents. When migrating this data to MaxCompute, you need to extract specific fields using dot-notation paths and map them to the correct column types. This topic shows you how to use the Data Integration service of DataWorks to configure a batch sync job that reads nested JSON fields from a MongoDB collection and loads them into a MaxCompute table.
The example uses an ApsaraDB for MongoDB instance in a virtual private cloud (VPC) and extracts the store.bicycle.color field from a nested JSON document.
Prerequisites
Before you begin, make sure you have:
-
MaxCompute and DataWorks activated. For more information, see Activate MaxCompute and DataWorks.
-
A MaxCompute data source added in DataWorks. For more information, see Add a MaxCompute data source.
-
A workflow created in your workspace in basic mode. For more information, see Create a workflow.
Step 1: Prepare test data in MongoDB
This step creates a database user, uploads sample JSON data to MongoDB, and confirms the data is accessible before you configure the sync job.
Create a database user
Create a user in your MongoDB database so DataWorks can connect to it. Run the following command:
db.createUser({user:"bookuser",pwd:"123456",roles:["user1"]})
This creates a user named bookuser with password 123456, assigned to the user1 role, which has data access permissions.
Upload test data
If you use an ApsaraDB for MongoDB instance in a VPC, apply for a public endpoint for the instance first. DataWorks shared resource groups connect to MongoDB over the public network.
Upload the following JSON document to the userlog collection in the admin database:
{
"store": {
"book": [
{
"category": "reference",
"author": "Nigel Rees",
"title": "Sayings of the Century",
"price": 8.95
},
{
"category": "fiction",
"author": "Evelyn Waugh",
"title": "Sword of Honour",
"price": 12.99
},
{
"category": "fiction",
"author": "J. R. R. Tolkien",
"title": "The Lord of the Rings",
"isbn": "0-395-19395-8",
"price": 22.99
}
],
"bicycle": {
"color": "red",
"price": 19.95
}
},
"expensive": 10
}
Verify the uploaded data
Log on to the MongoDB database in the Data Management (DMS) console and run the following command to confirm the data was uploaded:
db.userlog.find().limit(10)
Step 2: Configure the sync job in DataWorks
This step creates the destination table in MaxCompute, adds MongoDB as a data source, and configures a batch sync node that maps the JSON field path to the MaxCompute column.
Create the destination table in MaxCompute
-
Log on to the DataWorks console.
-
Right-click your workflow and choose new > MaxCompute > table.
-
On the create a table page, select the engine type and enter a table name.
-
On the table editing page, click DDL Statement.
-
In the DDL Statement dialog box, enter the following statement and click Generate Table Schema:
ImportantThe table name in the DDL statement must match the Name value you set in the create a table page.
create table mqdata (mqdata string); -
Click Commit to Production Environment.
Add a MongoDB data source
Add your MongoDB instance as a data source in DataWorks. For more information, see Add a MongoDB data source.
Create a batch sync node
-
On the data analytics page, right-click your workflow and choose new > data integration > offline synchronization.
-
In the create a node dialog box, enter a node name and click submit.
-
In the top navigation bar, click the
icon to switch to script mode. -
In script mode, click the
icon. -
In the import Template dialog box, select the SOURCE type, data source, target type, and target data source, then click confirm.
-
Replace the template content with the following script:
{ "type": "job", "steps": [ { "stepType": "mongodb", "parameter": { "datasource": "mongodb_userlog", "column": [ { "name": "store.bicycle.color", "type": "document.String" } ], "collectionName": "userlog" }, "name": "Reader", "category": "reader" }, { "stepType": "odps", "parameter": { "partition": "", "isCompress": false, "truncate": true, "datasource": "odps_source", "column": [ "mqdata" ], "emptyAsNull": false, "table": "mqdata" }, "name": "Writer", "category": "writer" } ], "version": "2.0", "order": { "hops": [ { "from": "Reader", "to": "Writer" } ] }, "setting": { "errorLimit": { "record": "" }, "speed": { "concurrent": 2, "throttle": false } } }The following table describes the key parameters:
Parameter Value Description datasource(Reader)mongodb_userlogThe name of your MongoDB data source name(column)store.bicycle.colorThe dot-notation path to the JSON field to extract type(column)document.StringThe field data type. Use document.Stringfor nested fields. For top-level fields (such asexpensive), usestringcollectionNameuserlogThe MongoDB collection to read from datasource(Writer)odps_sourceThe name of your MaxCompute data source column(Writer)mqdataThe column in the MaxCompute table to write to tablemqdataThe MaxCompute table name The following example shows how the JSON document is mapped to the MaxCompute table after migration:
MongoDB document field Value MaxCompute column Stored value store.bicycle.color"red"mqdatared -
Click the
icon to run the job. -
Check the operation Log to confirm the job completed without errors.
Step 3: Verify the migration result
-
Right-click your workflow and choose new > MaxCompute > ODPS SQL.
-
In the create a node dialog box, enter a node name and click submit.
-
On the node configuration tab, enter the following query:
SELECT * from mqdata; -
Click the
icon to run the query. -
Check the operation Log to confirm the query returns the migrated data. The result should contain the value
red, extracted fromstore.bicycle.colorin the source document.
JSON field type reference
The type parameter in the Reader column configuration controls how DataWorks interprets the extracted JSON value.
| Field location | Example field | type value |
Notes |
|---|---|---|---|
| Nested (non-top-level) | store.bicycle.color |
document.String |
Use the final type of the extracted value |
| Top-level | expensive |
string |
Top-level fields use simple type names |