All Products
Search
Document Center

DataWorks:Read data from a MongoDB data source via batch synchronization

Last Updated:Mar 26, 2026

DataWorks Data Integration provides the MongoDB Reader plugin to read data from MongoDB and synchronize it with other data sources. This tutorial walks you through syncing data from MongoDB to MaxCompute using an offline sync node.

In this tutorial, you will:

  1. Prepare a MongoDB collection and a MaxCompute partitioned table with sample data

  2. Add a MongoDB data source and configure network connectivity

  3. Create and configure an offline sync node in DataStudio

  4. Submit, publish, and run the node to verify the sync result

Prerequisites

Before you begin, make sure you have:

Prepare sample data

Prepare a MongoDB collection and a MaxCompute partitioned table before configuring the sync task.

Step 1: Prepare a MongoDB collection

This tutorial uses ApsaraDB for MongoDB as an example.

  1. Create a collection named di_mongodb_conf_test.

    db.createCollection('di_mongodb_conf_test')
  2. Insert the sample data into the collection.

    db.di_mongodb_conf_test.insertOne({
        'col_string':'mock string value',
        'col_int32':NumberInt("1"),
        'col_int32_min':NumberInt("-2147483648"),
        'col_int32_max':NumberInt("2147483647"),
        'col_int64':NumberLong("1234567890123456"),
        'col_int64_min':NumberLong("-9223372036854775807"),
        'col_int64_max':NumberLong("9223372036854775807"),
        'col_decimal':NumberDecimal("9999999.4999999999"),
        'col_double':9999999.99,
        'col_boolean':true,
        'col_timestamp':ISODate(),
        'col_date':new Date(),
        'col_array_to_json':['a','b'],
        'col_array_to_join':['a','b'],
        'col_doc':{
            'key_string':'mock string value',
            'key_int32':NumberInt("1"),
            'key_int32_min':NumberInt("-2147483648"),
            'key_int32_max':NumberInt("2147483647"),
            'key_int64':NumberLong("1234567890123456"),
            'key_int64_min':NumberLong("-9223372036854775807"),
            'key_int64_max':NumberLong("9223372036854775807"),
            'key_decimal':NumberDecimal("9999999.4999999999"),
            'key_double':9999999.99,
            'key_boolean':true,
            'key_timestamp':ISODate(),
            'key_date':new Date(),
            'key_array_to_json':['a','b'],
            'key_array_to_join':['a','b'],
        },
        'col_extra_1':'this is extra 1',
        'col_extra_2':'this is extra 2',
    })
  3. Verify the inserted data.

    db.getCollection("di_mongodb_conf_test").find({})

    The query result is shown in the following figure:mongodb

Step 2: Prepare a MaxCompute table

  1. Create a partitioned table named di_mongodb_conf_test with the partition field pt.

    CREATE TABLE IF NOT EXISTS di_mongodb_conf_test
    (
      `id`                 STRING
      ,`col_string`        STRING
      ,`col_int32`         INT
      ,`col_int32_min`     INT
      ,`col_int32_max`     INT
      ,`col_int64`         BIGINT
      ,`col_int64_min`     BIGINT
      ,`col_int64_max`     BIGINT
      ,`col_decimal`       DECIMAL(38,18)
      ,`col_double`        DOUBLE
      ,`col_boolean`       BOOLEAN
      ,`col_timestamp`     TIMESTAMP
      ,`col_date`          DATE
      ,`col_array_to_json` STRING
      ,`col_array_to_join` STRING
      ,`key_string`        STRING
      ,`key_int32`         INT
      ,`key_int32_min`     INT
      ,`key_int32_max`     INT
      ,`key_int64`         BIGINT
      ,`key_int64_min`     BIGINT
      ,`key_int64_max`     BIGINT
      ,`key_decimal`       DECIMAL(38,18)
      ,`key_double`        DOUBLE
      ,`key_boolean`       BOOLEAN
      ,`key_timestamp`     TIMESTAMP
      ,`key_date`          DATE
      ,`key_array_to_json` STRING
      ,`key_array_to_join` STRING
      ,`col_doc`           STRING
      ,`col_combine`       STRING
    )
    PARTITIONED BY
    (
      pt                   STRING
    )
    LIFECYCLE 36500
    ;
  2. Add a partition with the value 20230202.

    ALTER TABLE di_mongodb_conf_test ADD IF NOT EXISTS PARTITION (pt='20230202');
  3. Verify the partition was created.

    SELECT * FROM di_mongodb_conf_test WHERE pt='20230202';

Configure the offline sync task

Step 1: Add a MongoDB data source

Add a MongoDB data source and verify network connectivity between the data source and the exclusive resource group for Data Integration. For more information, see Configure a MongoDB data source.

Step 2: Create and configure an offline sync node

In DataWorks DataStudio, create an offline sync node, then configure the source, destination, and field mapping. The steps below cover the key configuration points. For parameters not mentioned here, use the default values. For more information, see Configure a task in the codeless UI.

  1. Configure network connectivity. Select the MongoDB and MaxCompute data sources and the exclusive resource group, then test connectivity.

  2. Select data sources. Select the MongoDB collection and the MaxCompute partitioned table prepared in the previous steps.

  3. Configure field mapping. For MongoDB data sources, Same-line Mapping is used by default. To manually specify fields, click the 图标 icon and enter the field definitions. The example below covers all field types used in this tutorial.

    {"name":"_id","type":"string"}
    {"name":"col_string","type":"string"}
    {"name":"col_int32","type":"long"}
    {"name":"col_int32_min","type":"long"}
    {"name":"col_int32_max","type":"long"}
    {"name":"col_int64","type":"long"}
    {"name":"col_int64_min","type":"long"}
    {"name":"col_int64_max","type":"long"}
    {"name":"col_decimal","type":"double"}
    {"name":"col_double","type":"double"}
    {"name":"col_boolean","type":"boolean"}
    {"name":"col_timestamp","type":"date"}
    {"name":"col_date","type":"date"}
    {"name":"col_array_to_json","type":"string"}
    {"name":"col_array_to_join","type":"array","splitter":","}
    {"name":"col_doc.key_string","type":"document.string"}
    {"name":"col_doc.key_int32","type":"document.long"}
    {"name":"col_doc.key_int32_min","type":"document.long"}
    {"name":"col_doc.key_int32_max","type":"document.long"}
    {"name":"col_doc.key_int64","type":"document.long"}
    {"name":"col_doc.key_int64_min","type":"document.long"}
    {"name":"col_doc.key_int64_max","type":"document.long"}
    {"name":"col_doc.key_decimal","type":"document.double"}
    {"name":"col_doc.key_double","type":"document.double"}
    {"name":"col_doc.key_boolean","type":"document.boolean"}
    {"name":"col_doc.key_timestamp","type":"document.date"}
    {"name":"col_doc.key_date","type":"document.date"}
    {"name":"col_doc.key_array_to_json","type":"document"}
    {"name":"col_doc.key_array_to_join","type":"document.array","splitter":","}
    {"name":"col_doc","type":"string"}
    {"name":"col_combine","type":"combine"}

    After saving the manual edits, the interface displays the source-to-destination field mapping.

Step 3: Submit and publish the node

If you use a standard mode workspace and want the task to run on a schedule in the production environment, submit and publish the node. For more information, see Publish a task.

Step 4: Run the node and verify the result

Run the offline sync node. After it completes successfully, the synchronized data is available in the MaxCompute table.结果数据

  • The content of the col_doc field is shown in the following figure.字段信息

  • The content of the col_combine field is shown in the following figure.字段2

Note

For details on the Decimal type output, see Appendix 2: Output of the Decimal type in a document.

Appendix 1: Data format transformation reference

The MongoDB Reader plugin handles arrays, nested documents, and complex types through specific type settings in the field mapping configuration. The following table summarizes all supported transformations.

Scenario Raw MongoDB data Field mapping configuration Output to MaxCompute
Array to JSON string {"col_array_to_json": ["a", "b"]} {"name":"col_array_to_json","type":"string"} — set type to string to serialize the array as JSON [a, b]
Array to concatenated string {"col_array_to_join": ["a", "b"]} {"name":"col_array_to_join","type":"array","splitter":","} — set type to array and specify a splitter; elements are joined using the splitter a,b
Read a field from a nested document {"col_doc": {"key_string": "mock string value"}} {"name":"col_doc.key_string","type":"document.string"} — use dot notation in name to specify the field path within the document mock string value
Serialize an entire nested document to JSON {"col_doc": {"key_string": "mock string value", "key_int32": 1}} {"name":"col_doc","type":"string"} — set type to string to serialize the entire document as JSON {"key_string":"mockstringvalue","key_int32":1}
Serialize remaining fields to JSON {"col_1": "value1", "col_2": "value2", "col_3": "value3", "col_4": "value4"} {"name":"col_combine","type":"combine"} — all fields not explicitly mapped are serialized to JSON {"col_3":"value3","col_4":"value4"}

Appendix 2: Output of the Decimal type in a document

When a document is serialized to JSON format, a Decimal128 field is output as a JSON object by default:

{
    "key_decimal":
    {
        "finite": true,
        "high": 3471149412795809792,
        "infinite": false,
        "low": 99999994999999999,
        "naN": false,
        "negative": false
    }
}

To output the value as a number type, follow these steps:

  1. In the offline sync task editor, click Convert to Script to switch to the code editor.

  2. In the Reader configuration, add the decimal128OutputType parameter and set its value to bigDecimal.decimal

  3. Rerun the task. The Decimal128 field is now output as a numeric string:

    {
        "key_decimal": "9999999.4999999999"
    }