This topic describes the data types and parameters that are supported by MongoDB Reader and how to configure MongoDB Reader by using the codeless user interface (UI) and code editor.
- If you use ApsaraDB for MongoDB, the MongoDB database has a root account by default. For security purposes, Data Integration can access a MongoDB database only by using a MongoDB database account. When you add a MongoDB data source, do not use the root account for access.
- The query parameter does not support the JavaScript syntax.
MongoDB Reader shards data in a MongoDB database based on specified rules, reads data from the database by using parallel threads, and then converts the data to a format that is readable to Data Integration.
Data types
MongoDB Reader supports most MongoDB data types. Make sure that the data types of your database are supported.
Data Integration data type | MongoDB data type |
---|---|
LONG | INT, LONG, document.INT, and document.LONG |
DOUBLE | DOUBLE and document.DOUBLE |
STRING | STRING, ARRAY, document.STRING, document.ARRAY, and COMBINE |
DATE | DATE and document.DATE |
BOOLEAN | BOOLEAN and document.BOOLEAN |
BYTES | BYTES and document.BYTES |
When you use the COMBINE data type, take note of the following items:
When MongoDB Reader reads data from a MongoDB database, MongoDB Reader combines multiple fields in MongoDB documents into a JSON string.
For example, doc1, doc2, and doc3 are three MongoDB documents that contain different fields. The fields are represented by keys instead of key-value pairs. The keys a and b are common fields in all of the three documents. The key x_n represents an unfixed field.
doc1: a b x_1 x_2
doc2: a b x_2 x_3 x_4
doc3: a b x_5
"column": [
{
"name": "a",
"type": "string",
},
{
"name": "b",
"type": "string",
},
{
"name": "doc",
"type": "combine",
}
]
odps_column1 | odps_column2 | odps_column3 |
---|---|---|
a | b | {x_1,x_2} |
a | b | {x_2,x_3,x_4} |
a | b | {x_5} |
When you combine multiple fields in a MongoDB document and set the data type of each obtained JSON string to COMBINE, the result that is exported to MaxCompute contains only fields specific to the document. Common fields are automatically deleted.
In the preceding example, a and b are common fields in all of the three documents.
After fields in the document file doc1: a b x_1 x_2
are combined and the data type of the obtained JSON strings is set to COMBINE, the
result is {a,b,x_1,x_2}. When the result is exported to MaxCompute, common fields a and b are deleted, and
the result is {x_1,x_2}.
Limits
A maximum of one parallel thread can be used to read data from the source or write data to the destination in a synchronization node that uses MongoDB Reader.
Parameters
Parameter | Description |
---|---|
datasource | The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor. |
collectionName | The name of the collection in the MongoDB database. |
column | The names of the document fields from which you want to read data. Specify the names
in an array.
|
batchSize | The number of data records that are read at a time. This parameter is optional. Default
value: 1000 .
|
cursorTimeoutInMs | The timeout period of the cursor. Unit: milliseconds. This parameter is optional.
Default value: 600000 . The default value 600000 is equivalent to 10 minutes. If you set this parameter
to a negative number, the cursor never times out.
Note
|
query | The condition that is used to filter data from MongoDB. Only data of the time type
is supported. For example, you can specify "query":"{'operationTime':{'$gte':ISODate('${last_day}T00:00:00.424+0800')}}" to obtain data in which the time that is specified by operationTime is not earlier
than 00:00 on the day that is specified by ${last_day}. ${last_day} is a scheduling
parameter of DataWorks. Specify last_day in the yyyy-mm-dd format. You can use comparison operators such as $gt, $lt, $gte, and $lte, logical
operators such as "and" and "or", and functions such as max, min, sum, avg, and ISODate
that are supported by MongoDB based on your business requirements. This parameter
is optional.
|
Configure MongoDB Reader by using the codeless UI
Create a synchronization node and configure the node. For more information, see Configure a synchronization node by using the codeless UI.
- Configure data sources.
Configure Source and Target for the synchronization node.
Parameter Description Connection The name of the data source from which you want to read data. This parameter is equivalent to the datasource parameter that is described in the preceding section. CollectionName The name of the collection in the MongoDB database. This parameter is equivalent to the collectionName parameter that is described in the preceding section. BatchSize The number of data records to read at a time from the MongoDB database. Default value: 1000. CursorTimeoutInMs The timeout period of the cursor. Default value: 3600000. Unit: milliseconds. If you set this parameter to a negative number, the cursor never times out. Query Conditions This parameter is equivalent to the query parameter that is described in the preceding section. You can configure this parameter to filter data from MongoDB. - Configure field mappings. This operation is equivalent to setting the column parameter that is described in the preceding section. By default, the system maps
the field in a row of the source to the field in the same row of the destination.
You can click the
icon to manually edit fields in the MongoDB documents.
- Configure channel control policies.
Parameter Description Expected Maximum Concurrency The maximum number of parallel threads that the synchronization node can use to read data from the source or write data to the destination. You can configure the parallelism for the synchronization node on the codeless UI. Note You can set this parameter only to 1.Bandwidth Throttling Specifies whether to enable bandwidth throttling. You can enable bandwidth throttling and specify a maximum transmission rate to prevent heavy read workloads on the source. We recommend that you enable bandwidth throttling and set the maximum transmission rate to an appropriate value based on the configurations of the source. Dirty Data Records Allowed The maximum number of dirty data records allowed. Distributed Execution The distributed execution mode that allows you to split your node into pieces and distribute them to multiple Elastic Compute Service (ECS) instances for parallel execution. This speeds up synchronization. If you use a large number of parallel threads to run your synchronization node in distributed execution mode, excessive access requests are sent to the data sources. Therefore, before you use the distributed execution mode, you must evaluate the access load on the data sources. You can enable this mode only if you use an exclusive resource group for Data Integration. For more information about exclusive resource groups for Data Integration, see Exclusive resource groups for Data Integration and Create and use an exclusive resource group for Data Integration.
Configure MongoDB Reader by using the code editor
For more information about how to configure a synchronization node by using the code editor, see Create a synchronization node by using the code editor.
- Delete the comments from the following code before you run the code.
- MongoDB Reader cannot read some elements in arrays.
{
"type":"job",
"version":"2.0",// The version number.
"steps":[
{
"category": "reader",
"name": "Reader",
"parameter": {
"datasource": "datasourceName", // The name of the data source.
"collectionName": "tag_data", // The name of the collection in the MongoDB database.
"query": "", // The condition that is used to filter data from MongoDB.
"column": [
{
"name": "unique_id", // The name of the field.
"type": "string" // The data type of the field.
},
{
"name": "sid",
"type": "string"
},
{
"name": "user_id",
"type": "string"
},
{
"name": "auction_id",
"type": "string"
},
{
"name": "content_type",
"type": "string"
},
{
"name": "pool_type",
"type": "string"
},
{
"name": "frontcat_id",
"type": "array",
"splitter": ""
},
{
"name": "categoryid",
"type": "array",
"splitter": ""
},
{
"name": "gmt_create",
"type": "string"
},
{
"name": "taglist",
"type": "array",
"splitter": " "
},
{
"name": "property",
"type": "string"
},
{
"name": "scorea",
"type": "int"
},
{
"name": "scoreb",
"type": "int"
},
{
"name": "scorec",
"type": "int"
},
{
"name": "a.b",
"type": "document.int"
},
{
"name": "a.b.c",
"type": "document.array",
"splitter": " "
}
]
},
"stepType": "mongodb"
},
{
"stepType":"stream",
"parameter":{},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":"0"// The maximum number of dirty data records allowed.
},
"speed":{
"throttle":true,// Specifies whether to enable bandwidth throttling. The value false indicates that bandwidth throttling is disabled, and the value true indicates that bandwidth throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true.
"concurrent":1 // The maximum number of parallel threads.
"mbps":"12"// The maximum transmission rate.
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}