Sync MongoDB JSON Data to MaxCompute for Analytics - DataWorks

MongoDB stores data as nested JSON documents. When migrating this data to MaxCompute, you need to extract specific fields using dot-notation paths and map them to the correct column types. This topic shows you how to use the Data Integration service of DataWorks to configure a batch sync job that reads nested JSON fields from a MongoDB collection and loads them into a MaxCompute table.

The example uses an ApsaraDB for MongoDB instance in a virtual private cloud (VPC) and extracts the store.bicycle.color field from a nested JSON document.

Prerequisites

Before you begin, make sure you have:

MaxCompute and DataWorks activated. For more information, see Activate MaxCompute and DataWorks.
A MaxCompute data source added in DataWorks. For more information, see Add a MaxCompute data source.
A workflow created in your workspace in basic mode. For more information, see Create a workflow.

Step 1: Prepare test data in MongoDB

This step creates a database user, uploads sample JSON data to MongoDB, and confirms the data is accessible before you configure the sync job.

Create a database user

Create a user in your MongoDB database so DataWorks can connect to it. Run the following command:

db.createUser({user:"bookuser",pwd:"123456",roles:["user1"]})

This creates a user named bookuser with password 123456, assigned to the user1 role, which has data access permissions.

Upload test data

If you use an ApsaraDB for MongoDB instance in a VPC, apply for a public endpoint for the instance first. DataWorks shared resource groups connect to MongoDB over the public network.

Upload the following JSON document to the userlog collection in the admin database:

{
    "store": {
        "book": [
            {
                "category": "reference",
                "author": "Nigel Rees",
                "title": "Sayings of the Century",
                "price": 8.95
            },
            {
                "category": "fiction",
                "author": "Evelyn Waugh",
                "title": "Sword of Honour",
                "price": 12.99
            },
            {
                "category": "fiction",
                "author": "J. R. R. Tolkien",
                "title": "The Lord of the Rings",
                "isbn": "0-395-19395-8",
                "price": 22.99
            }
        ],
        "bicycle": {
            "color": "red",
            "price": 19.95
        }
    },
    "expensive": 10
}

Verify the uploaded data

Log on to the MongoDB database in the Data Management (DMS) console and run the following command to confirm the data was uploaded:

db.userlog.find().limit(10)

Step 2: Configure the sync job in DataWorks

This step creates the destination table in MaxCompute, adds MongoDB as a data source, and configures a batch sync node that maps the JSON field path to the MaxCompute column.

Create the destination table in MaxCompute

Log on to the DataWorks console.
Right-click your workflow and choose new > MaxCompute > table.
On the create a table page, select the engine type and enter a table name.
On the table editing page, click DDL Statement.
In the DDL Statement dialog box, enter the following statement and click Generate Table Schema:

Important
The table name in the DDL statement must match the Name value you set in the create a table page.
```
create table mqdata (mqdata string);
```
Click Commit to Production Environment.

Add a MongoDB data source

Add your MongoDB instance as a data source in DataWorks. For more information, see Add a MongoDB data source.

Create a batch sync node

On the data analytics page, right-click your workflow and choose new > data integration > offline synchronization.
In the create a node dialog box, enter a node name and click submit.
In the top navigation bar, click the icon to switch to script mode.
In script mode, click the icon.
In the import Template dialog box, select the SOURCE type, data source, target type, and target data source, then click confirm.

Replace the template content with the following script:

{
    "type": "job",
    "steps": [
        {
            "stepType": "mongodb",
            "parameter": {
                "datasource": "mongodb_userlog",
                "column": [
                    {
                        "name": "store.bicycle.color",
                        "type": "document.String"
                    }
                ],
                "collectionName": "userlog"
            },
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "odps",
            "parameter": {
                "partition": "",
                "isCompress": false,
                "truncate": true,
                "datasource": "odps_source",
                "column": [
                    "mqdata"
                ],
                "emptyAsNull": false,
                "table": "mqdata"
            },
            "name": "Writer",
            "category": "writer"
        }
    ],
    "version": "2.0",
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    },
    "setting": {
        "errorLimit": {
            "record": ""
        },
        "speed": {
            "concurrent": 2,
            "throttle": false
        }
    }
}

The following table describes the key parameters:

Parameter	Value	Description
`datasource` (Reader)	`mongodb_userlog`	The name of your MongoDB data source
`name` (column)	`store.bicycle.color`	The dot-notation path to the JSON field to extract
`type` (column)	`document.String`	The field data type. Use `document.String` for nested fields. For top-level fields (such as `expensive`), use `string`
`collectionName`	`userlog`	The MongoDB collection to read from
`datasource` (Writer)	`odps_source`	The name of your MaxCompute data source
`column` (Writer)	`mqdata`	The column in the MaxCompute table to write to
`table`	`mqdata`	The MaxCompute table name

The following example shows how the JSON document is mapped to the MaxCompute table after migration:

MongoDB document field	Value	MaxCompute column	Stored value
`store.bicycle.color`	`"red"`	`mqdata`	`red`

Click the icon to run the job.
Check the operation Log to confirm the job completed without errors.

Step 3: Verify the migration result

Right-click your workflow and choose new > MaxCompute > ODPS SQL.
In the create a node dialog box, enter a node name and click submit.
On the node configuration tab, enter the following query:
```
SELECT * from mqdata;
```
Click the icon to run the query.
Check the operation Log to confirm the query returns the migrated data. The result should contain the value red, extracted from store.bicycle.color in the source document.

JSON field type reference

The type parameter in the Reader column configuration controls how DataWorks interprets the extracted JSON value.

Field location	Example field	`type` value	Notes
Nested (non-top-level)	`store.bicycle.color`	`document.String`	Use the final type of the extracted value
Top-level	`expensive`	`string`	Top-level fields use simple type names