This topic describes the data types and parameters that are supported by MongoDB Reader and how to configure MongoDB Reader by using the codeless user interface (UI) and code editor.

MongoDB Reader connects to a remote MongoDB database by using the Java client MongoClient and reads data from the database. The locking feature in the latest version of MongoDB is improved from database-level locking to document-level locking. This enables MongoDB Reader to efficiently read data from MongoDB databases by using the powerful indexing capabilities in MongoDB.
Note
  • If you use ApsaraDB for MongoDB, the MongoDB database has a root account by default. For security purposes, Data Integration can access a MongoDB database only by using a MongoDB database account. When you add a MongoDB data source, do not use the root account for access.
  • The query parameter does not support the JavaScript syntax.

MongoDB Reader shards data in a MongoDB database based on specific rules, reads data from the database by using parallel threads, and then converts the data to a format that is readable to Data Integration.

Data types

MongoDB Reader supports most MongoDB data types. Make sure that the data types of your database are supported.

The following table lists the data types that are supported by MongoDB Reader.
Data Integration data type MongoDB data type
LONG INT, LONG, document.INT, and document.LONG
DOUBLE DOUBLE and document.DOUBLE
STRING STRING, ARRAY, document.STRING, document.ARRAY, and COMBINE
DATE DATE and document.DATE
BOOLEAN BOOLEAN and document.BOOLEAN
BYTES BYTES and document.BYTES
Note The DOCUMENT data type is used to store embedded documents. It is also called the OBJECT data type.

When you use the COMBINE data type, take note of the following items:

When MongoDB Reader reads data from a MongoDB database, MongoDB Reader combines multiple fields in MongoDB documents into a JSON string.

For example, doc1, doc2, and doc3 are three MongoDB documents that contain different fields. The fields are represented by keys instead of key-value pairs. The keys a and b are common fields in all of the three documents. The key x_n represents a document-specific field.

doc1: a b x_1 x_2

doc2: a b x_2 x_3 x_4

doc3: a b x_5

To import the preceding three MongoDB documents to MaxCompute, you must specify the fields that you want to retain, specify a name for each JSON string that is obtained, and specify the data type of each obtained JSON string to COMBINE in the configuration file. Make sure that the name of each obtained JSON string is different from that of an existing field in the documents.
"column": [
{
"name": "a",
"type": "string",
},
{
"name": "b",
"type": "string",
},
{
"name": "doc",
"type": "combine",
}
]
The following table lists the output in MaxCompute.
odps_column1 odps_column2 odps_column3
a b {x_1,x_2}
a b {x_2,x_3,x_4}
a b {x_5}
Note

When you combine multiple fields in a MongoDB document and set the data type of each obtained JSON string to COMBINE, the result that is exported to MaxCompute contains only fields specific to the document. Common fields are automatically deleted.

In the preceding example, a and b are common fields in all of the three documents. After fields in the document file doc1: a b x_1 x_2 are combined and the data type of the obtained JSON strings is set to COMBINE, the result is {a,b,x_1,x_2}. When the result is exported to MaxCompute, common fields a and b are deleted, and the result is {x_1,x_2}.

Limits

  • A maximum of one parallel thread can be used to read data from the source or write data to the destination in a synchronization node that uses MongoDB Reader.
  • The shard key must be a field of an integer data type. Otherwise, non-consecutive shards may be generated, and data may be lost.
  • MongoDB Reader can read data from only MongoDB 4.X data sources.

Parameters

Parameter Description
datasource The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor.
collectionName The name of the collection in the MongoDB database.
column The names of the document fields from which you want to read data. Specify the names in an array.
  • name: the name of a field.
  • type: the data type of a field. Valid values:
    • string: string.
    • long: integer.
    • double: floating point.
    • date: date.
    • bool: Boolean.
    • bytes: binary.
    • arrays: MongoDB Reader reads data from the MongoDB documents as a JSON array, such as ["a","b","c"].
    • array: MongoDB Reader reads data from the MongoDB documents as a common array, in which elements are separated by delimiters, such as a,b,c. We recommend that you set type to arrays.
    • combine: MongoDB Reader combines multiple fields in the MongoDB documents into a JSON string.
  • splitter: the delimiter. Configure this parameter only if you want to convert an array to a string. MongoDB supports arrays, but Data Integration does not. The array elements that are read by MongoDB Reader are joined into a string by using this delimiter.
batchSize The number of data records that are read at a time. This parameter is optional. Default value: 1000.
cursorTimeoutInMs The timeout period of the cursor. Unit: milliseconds. This parameter is optional. Default value: 600000. The default value 600000 is equivalent to 10 minutes. If you set this parameter to a negative number, the cursor never times out.
Note
  • We recommend that you do not set this parameter to a negative number. If you set this parameter to a negative number and the MongoDB client unexpectedly exits, the cursor that never times out persists in the MongoDB server until the MongoDB client is restarted.
  • If the cursor times out, you can perform one of the following operations to fix the issue:
    • Specify a small value for the batchSize parameter.
    • Specify a large value for the cursorTimeoutInMs parameter.
query The condition that is used to filter data from MongoDB. Only data of the time type is supported. For example, you can specify "query":"{'operationTime':{'$gte':ISODate('${last_day}T00:00:00.424+0800')}}" to obtain data in which the time that is specified by operationTime is not earlier than 00:00 on the day that is specified by ${last_day}. ${last_day} is a scheduling parameter of DataWorks. Specify last_day in the yyyy-mm-dd format. You can use comparison operators such as $gt, $lt, $gte, and $lte, logical operators such as "and" and "or", and functions such as max, min, sum, avg, and ISODate that are supported by MongoDB based on your business requirements. This parameter is optional.

Configure MongoDB Reader by using the codeless UI

Create a synchronization node and configure the node. For more information, see Configure a synchronization node by using the codeless UI.

You must perform the following steps on the configuration tab of the synchronization node:
  1. Configure data sources.
    Configure Source and Target for the synchronization node. Configure data sources
    Parameter Description
    Connection The name of the data source from which you want to read data. This parameter is equivalent to the datasource parameter that is described in the preceding section.
    CollectionName The name of the collection in the MongoDB database. This parameter is equivalent to the collectionName parameter that is described in the preceding section.
    BatchSize The number of data records to read from the MongoDB database at a time. Default value: 1000.
    CursorTimeoutInMs The timeout period of the cursor. Default value: 3600000. Unit: milliseconds. If you set this parameter to a negative number, the cursor never times out.
    Query Conditions This parameter is equivalent to the query parameter that is described in the preceding section. You can configure this parameter to filter data from MongoDB.
  2. Configure field mappings. This operation is equivalent to setting the column parameter that is described in the preceding section. By default, the system maps the field in a row of the source to the field in the same row of the destination. You can click the Icon icon to manually edit fields in the MongoDB documents.
    Field mappings
  3. Configure channel control policies. Channel control
    Parameter Description
    Expected Maximum Concurrency The maximum number of parallel threads that the synchronization node can use to read data from the source or write data to the destination. You can configure the parallelism for the synchronization node on the codeless UI.
    Note You can set this parameter only to 1.
    Bandwidth Throttling Specifies whether to enable bandwidth throttling. You can enable bandwidth throttling and specify a maximum transmission rate to prevent heavy read workloads on the source. We recommend that you enable bandwidth throttling and set the maximum transmission rate to an appropriate value based on the configurations of the source.
    Dirty Data Records Allowed The maximum number of dirty data records allowed.
    Distributed Execution

    The distributed execution mode that allows you to split your node into pieces and distribute them to multiple Elastic Compute Service (ECS) instances for parallel execution. This speeds up synchronization. If you use a large number of parallel threads to run your synchronization node in distributed execution mode, excessive access requests are sent to the data sources. Therefore, before you use the distributed execution mode, you must evaluate the access load on the data sources. You can enable this mode only if you use an exclusive resource group for Data Integration. For more information about exclusive resource groups for Data Integration, see Exclusive resource groups for Data Integration and Create and use an exclusive resource group for Data Integration.

Configure MongoDB Reader by using the code editor

For more information about how to configure a synchronization node by using the code editor, see Create a synchronization node by using the code editor.

In the following code, a synchronization node is configured to read data from a MongoDB database. For more information about the parameters, see the preceding parameter description.
Notice
  • Delete the comments from the following code before you run the code.
  • MongoDB Reader cannot read some elements in arrays.
{
    "type":"job",
    "version":"2.0", // The version number. 
    "steps":[
        {
            "category": "reader",
            "name": "Reader",
            "parameter": {
                "datasource": "datasourceName", // The name of the data source. 
                "collectionName": "tag_data", // The name of the collection in the MongoDB database. 
                "query": "", // The condition that is used to filter data from MongoDB. 
                "column": [
                    {
                        "name": "unique_id", // The name of the field. 
                        "type": "string" // The data type of the field. 
                    },
                    {
                        "name": "sid",
                        "type": "string"
                    },
                    {
                        "name": "user_id",
                        "type": "string"
                    },
                    {
                        "name": "auction_id",
                        "type": "string"
                    },
                    {
                        "name": "content_type",
                        "type": "string"
                    },
                    {
                        "name": "pool_type",
                        "type": "string"
                    },
                    {
                        "name": "frontcat_id",
                        "type": "array",
                        "splitter": ""
                    },
                    {
                        "name": "categoryid",
                        "type": "array",
                        "splitter": ""
                    },
                    {
                        "name": "gmt_create",
                        "type": "string"
                    },
                    {
                        "name": "taglist",
                        "type": "array",
                        "splitter": " "
                    },
                    {
                        "name": "property",
                        "type": "string"
                    },
                    {
                        "name": "scorea",
                        "type": "int"
                    },
                    {
                        "name": "scoreb",
                        "type": "int"
                    },
                    {
                        "name": "scorec",
                        "type": "int"
                    },
                    {
                        "name": "a.b",
                        "type": "document.int"
                    },
                    {
                        "name": "a.b.c",
                        "type": "document.array",
                        "splitter": " "
                    }
                ]
            },
            "stepType": "mongodb"
        },
        { 
            "stepType":"stream",
            "parameter":{},
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The maximum number of dirty data records allowed. 
        },
        "speed":{
            "throttle":true,// Specifies whether to enable bandwidth throttling. The value false indicates that bandwidth throttling is disabled, and the value true indicates that bandwidth throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "concurrent":1 // The maximum number of parallel threads. 
            "mbps":"12"// The maximum transmission rate.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}