Migrate MongoDB to LindormTable via DataWorks DataX - Lindorm

Use DataWorks to batch-migrate offline data from ApsaraDB for MongoDB to LindormTable. DataWorks is an important platform as a service (PaaS) provided by Alibaba Cloud that supports multiple computing engines and storage engines. For more information about DataWorks, see What is DataWorks?.

Prerequisites

Before you begin, make sure you have:

An ApsaraDB for MongoDB instance with the source data
A LindormTable with the target schema created
Access to the Data Integration service of DataWorks to configure a DataX task (see Use DataWorks to configure synchronization tasks in DataX)

Nested field mapping

MongoDB documents can contain nested JSON objects. LindormTable stores data in a flat, row-oriented structure, so you must unnest all nested fields before or during migration.

Use dot notation to reference nested fields in the MongoDB Reader configuration. For example, a document field map.a maps to column a in LindormTable, with type document.string in the reader configuration. The following table shows how nested fields translate:

MongoDB field	Dot notation in Reader	LindormTable column	Reader type
`map.a`	`map.a`	`a`	`document.string`
`map.b`	`map.b`	`b`	`document.string`

Data type conversion is not required for other field types.

If you need to transform data during migration (for example, apply MD5 hashing to the primary key), use the following three-step approach instead:

Migrate data from ApsaraDB for MongoDB to MaxCompute.
Run SQL statements in MaxCompute to process the data.
Migrate the processed data from MaxCompute to LindormTable.

Prepare the source and target data

Source document in ApsaraDB for MongoDB:

{
   "id" : ObjectId("624573dd7c0e2eea4cc8****"),
   "title" : "ApsaraDB for MongoDB tutorial",
   "description" : "ApsaraDB for MongoDB is a NoSQL database",
   "by" : "beginner tutorial",
   "url" : "http://www.runoob.com",
   "map" : {
        "a" : "mapa",
        "b" : "mapb"
    },
   "likes" : 100
}

Target schema in LindormTable:

CREATE TABLE t1(title varchar, desc varchar, by1 varchar, url varchar, a varchar, b varchar, likes int, primary key(title));

The nested fields map.a and map.b in MongoDB are flattened into columns a and b in LindormTable. The id field is omitted because title serves as the primary key.

Migrate data

Step 1: Add a MongoDB data source

In the DataWorks console, configure the source ApsaraDB for MongoDB instance as a data source. For detailed steps, see Add a MongoDB data source.

Step 2: Create a workflow

For more information about configuring a batch synchronization task using the code editor, see Configure a batch synchronization task by using the code editor.

Log on to the DataWorks console.
In the left-side navigation pane, click Workspace.
In the top navigation bar, select the region where your workspace resides. On the Workspaces page, find your workspace and choose Shortcuts > Data Development in the Actions column.
On the DataStudio page, hover over the icon and select Create Workflow.
In the Create Workflow dialog box, enter a Workflow Name and Description.
The name must be 1 to 128 characters and can contain letters, digits, underscores (_), and periods (.).
Click Create.

Step 3: Create a batch synchronization node

Click the new workflow, then right-click Data Integration.
Choose Create Node > Offline synchronization.
In the Create Node dialog box, enter the Name of the node.
The node name must be 1 to 128 characters and can contain letters, digits, underscores (_), and periods (.).
Click Submit.

Step 4: Configure the reader and writer

On the node configuration tab, click the Conversion script icon in the top toolbar.
In the Tips dialog, click OK to open the code editor.

Replace the generated code with the following configuration. The job uses MongoDB Reader as the source and Lindorm Writer as the destination.

For MongoDB Reader parameters, see MongoDB Reader.
For Lindorm Writer parameters, see Lindorm Writer.

{
    "type": "job",
    "version": "2.0",
    "steps": [
        {
            "stepType": "mongodb",
            "parameter": {
                "datasource": "test_mongo",   // The name of the ApsaraDB for MongoDB data source.
                "column": [
                    {
                        "name": "title",
                        "type": "string"
                    },
                    {
                        "name": "description",
                        "type": "string"
                    },
                    {
                        "name": "by",
                        "type": "string"
                    },
                    {
                        "name": "url",
                        "type": "string"
                    },
                    {
                        "name": "map.a",
                        "type": "document.string"
                    },
                    {
                        "name": "map.b",
                        "type": "document.string"
                    },
                    {
                        "name": "likes",
                        "type": "int"
                    }
                ],
                "collectionName": "testdatax"
            },
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "lindorm",
            "parameter": {
                "configuration": {
                    "lindorm.client.seedserver": "ld-xxxx-proxy-lindorm.lindorm.rds.aliyuncs.com:30020",
                    "lindorm.client.username": "root",
                    "lindorm.client.namespace": "test",
                    "lindorm.client.password": "root"
                },
                "nullMode": "skip",
                "datasource": "",
                "writeMode": "api",
                "envType": 1,
                "columns": [
                    "title",
                    "desc",
                    "by",
                    "url",
                    "a",
                    "b",
                    "likes"
                ],
                "dynamicColumn": "false",
                "table": "t1",
                "encoding": "utf8"
            },
            "name": "Writer",
            "category": "writer"
        }
    ],
    "setting": {
        "executeMode": null,
        "errorLimit": {
            "record": ""
        },
        "speed": {
            "concurrent": 2,
            "throttle": false
        }
    },
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    }
}

Save the node configuration, then click the icon to run the job. Monitor progress on the Runtime Log

Verify the migration

After the job completes, confirm that the data was migrated correctly:

On the Runtime Log tab, verify that the job status shows no errors and that the record count matches the number of documents in the source collection.
Query LindormTable to spot-check the migrated data. Confirm that the flattened nested fields (a, b) are populated correctly and that numeric fields such as likes contain accurate integer values.