Analysis of Alibaba Cloud Realtime Compute for Apache Flink: Deep Exploration into MongoDB Schema Inference

This article provides a deep exploration into MongoDB schema inference, focusing on the core features of MongoDB CDC Community Edition and its implementation in Realtime Compute for Apache Flink.

By Guiyuan

This article is compiled from the research: Principle Analysis and Application of Alibaba Cloud Realtime Compute for Apache Flink: Deep Exploration into MongoDB Schema Inference. The research is conducted by Guiyuan of the Alibaba Cloud Flink team. The content is mainly divided into the following four parts:

Introduction to MongoDB
Core Features of MongoDB CDC Community Edition
The Practice of MongoDB CDC in Realtime Compute for Apache Flink
Summary

1. Introduction to MongoDB

MongoDB is a document-oriented non-relational database that supports semi-structured data storage. It is also a distributed database providing two cluster deployment modes: replica set and shard set. MongoDB is highly available and horizontally scalable, making it suitable for large-scale data storage.

MongoDB uses a weakly structured storage mode and supports flexible data structures and a wide range of data types. It is suitable for business scenarios such as JSON documents, tags, snapshots, geographic locations, and content storage. Its naturally distributed architecture provides an out-of-the-box sharding mechanism and automatic rebalance capability, which is suitable for large-scale data storage. Additionally, MongoDB also provides the distributed grid file storage feature, GridFS, which is suitable for storing large files such as images, audio, and videos.

2. Core Features of MongoDB CDC Community Edition

Flink CDC is a database-based log CDC (Change Data Capture) technology that implements full and incremental integrated read capabilities. With Flink's excellent pipeline capabilities and rich upstream and downstream ecosystems, Flink CDC supports real-time capture and processing of a variety of data changes and outputs them to the downstream. MongoDB is one of the supported databases. The main features supported include:

Exactly-once semantics
Full and incremental subscriptions
Snapshot data filtering
Recovery from checkpoints and savepoints
Metadata extraction

MongoDB CDC Community Edition uses the Change Streams introduced in MongoDB 3.6 to achieve MongoDB CDC Table Source by converting change streams into Flink upsert changelogs. In MongoDB versions earlier than 6.0, the data of the documents before the change and the deleted documents are not provided by default. You can only use this information to implement the upsert semantics shown in the following figure.

The new Pre- and Post-Image feature of MongoDB 6.0 provides a more efficient solution: as long as the changeStreamPreAndPostImages feature is enabled, MongoDB will record the complete state of the document before and after each change in a special collection. MongoDB CDC allows you to read these records and generate a complete event stream. This eliminates the dependency on ChangelogNormalize nodes. The community and Realtime Compute for Apache Flink support this feature.

3. The Practice of MongoDB CDC in Realtime Compute for Apache Flink

MongoDB CDC Community Edition is very powerful as a pure engine. However, as a commercial product, it still has a shortcoming, that is, it cannot support schema changes.

As a NoSQL database, MongoDB does not have a fixed schema requirement, and schema changes are common. However, MongoDB CDC Community Edition can only support fixed schemas and cannot support schema changes. In addition, it requires users to manually define the schemas of the table, which is not convenient.

To address the preceding deficiencies, Realtime Compute for Apache Flink provides MongoDB catalogs to support schema inference for MongoDB without the need to manually define schemas. In addition, you can use the CTAS or CDAS statement to synchronize schema changes of upstream tables to downstream tables while synchronizing MongoDB data in real time. This improves the efficiency of creating tables and maintaining schema changes.

3.1 Implementation of Schema Inference

MongoDB schema inference is implemented by using MongoDB catalogs. MongoDB catalogs infer the schema of collections and can be used as Flink source tables, dimension tables, or result tables without the need to manually specify DDL statements. Schema inference includes the following steps:

3.1.1 Data Sampling

MongoDB catalogs sample 100 documents from the collection by default. If the number of documents in the collection is less than this value, all data in the collection is obtained.

The amount of sampled data can be set through the configuration max.fetch.records provided by MongoDB catalogs.

3.1.2 Schema Parsing

In MongoDB, each document is a BSON document. Compared with JSON, the BSON type is a superset of the JSON type. Compared with JSON, the BSON type additionally supports types such as DateTime and Binary. When you parse the schema of a single BSON document, the BSON type will correspond to the Flink SQL type one-to-one. For a document type nested in a BSON document, it is parsed as STRING by default.

To better resolve nested document types, MongoDB catalogs provide the configuration scan.flatten-nested-columns.enabled that can be used to recursively resolve fields in a document type. Assume that the initial BSON document is as follows:

{
  "unnested": "value",
  "nested": {
    "col1": 99,
    "col2": true
  }
}

If you set the scan.flatten-nested-columns.enabled to false (by default), the schema contains two columns:

Column name	Flink SQL data type
unnested	STRING
nested	STRING

If you set the scan.flatten-nested-columns.enabled to true, the schema contains three columns:

Column name	Flink SQL data type
unnested	STRING
nested.col1	INT
nested.col2	BOOLEAN

In addition, MongoDB catalogs provide the configuration scan.primitive-as-string to map all BSON basic data types to STRING.

3.1.3 Schema Merging

After you obtain a set of BSON documents, the MongoDB catalog parses the BSON documents one by one and merges the parsed physical columns based on the following rules. The final schema is used as the schema of the entire collection. The following are the merging rules:

If the physical columns parsed from the current BSON document contain fields that are not in the result schema, the MongoDB catalog automatically adds these fields to the result schema.
If the specific physical columns parsed from the current BSON document are named the same as specific columns in the schema, and they are of different types, the nearest common parent node is found in the tree structure shown in the following figure as the type of the column with the same name.

For example, assume a collection that contains the following three pieces of data:

{
  "_id": {
    "$oid": "100000000000000000000101"
  },
  "name": "Alice",
  "age": 10,
  "phone": {
    "mother": "111",
    "fatehr": "222"
  }
}

{
  "_id": {
    "$oid": "100000000000000000000102"
  },
  "name": "Bob",
  "age": 20,
  "phone": {
    "mother": "333",
    "fatehr": "444"
  }
  "address": ["Shanghai"],
  "desc": 1024
}

{
  "_id": {
    "$oid": "100000000000000000000103"
  },
  "name": "John",
  "age": 30,
  "phone": {
    "mother": "555",
    "fatehr": "666"
  }
  "address": ["Shanghai"],
  "desc": "test value"
}

In the above three BSON documents, the last two have address and desc fields that the first one does not. These two fields will be merged into the final schema after schema merging. The desc field types of the latter two documents are different. When the schema of a single document is parsed, the two fields are mapped to the INT and STRING of the Flink SQL type, respectively. According to the preceding rules for type merging during schema merging, the desc field type is eventually inferred to STRING.

Therefore, the final schema of the MongoDB catalog is as follows:

Column name	Flink SQL data type	Description
_id	STRING NOT NULL	The primary key field
name	STRING
age	INT
phone	STRING
address	STRING
desc	STRING	Types merged into STRING

In MongoDB, each document has a special field_id, which is used to uniquely identify a document in a collection. This field is automatically generated when the document is created.

MongoDB catalogs use the _id column as the primary key and add the default primary key constraint to ensure that data is not duplicated.

3.2 Implementation of Schema Evolution

When you use a table in a MongoDB catalog as a CDC source, schema changes such as adding or changing field types may occur in the data in the collection. When you use a connector to process data, you must consider the schema evolution.

MongoDB CDC connector, the schema inferred from the MongoDB catalog is used as the initial schema. When you read an oplog, perform the following steps:

Parse the schema of the BSON document corresponding to the current data. The procedure is the same as that in the preceding BSON document schema parsing.
Merge the schema parsed in Step 1 with the current schema.
Compare the merged schema in Step 2 with the current schema.

If they are the same, the current schema is used to parse the data.
If they are different, the current schema is updated and the information on schema changes is sent.

3.3 Use CTAS or CDAS Statement to Synchronize Data and Schemas

The CTAS statement allows you to synchronize full and incremental data from a source table to a result table. When you synchronize data, you can also synchronize schema changes from the source table to the result table in real time. The CDAS statement supports real-time data synchronization at the database level and synchronization of schema changes.

Before you use the CTAS or CDAS statement to synchronize data, you must create a MongoDB catalog:

CREATE CATALOG <yourcatalogname> WITH(
  'type'='mongodb',
  'default-database'='<dbName>',
  'hosts'='<hosts>',
  'scheme'='<scheme>',
  'username'='<username>',
  'password'='<password>',
  'connection.options'='<connectionOptions>',
  'max.fetch.records'='100',
  'scan.flatten-nested-columns.enable'='<flattenNestedColumns>',
  'scan.primitive-as-string'='<primitiveAsString>'
);

Example:

After you create a MongoDB catalog, you can use one of the following methods to synchronize data and schemas:

3.3.1 Use a CTAS Statement to Synchronize the Data and Schemas of a Single MongoDB Collection to the Downstream Storage

CREATE TABLE IF NOT EXISTS `${target_table_name}`
WITH(...)
AS TABLE `${mongodb_catalog}`.`${db_name}`.`${collection_name}`
/*+ OPTIONS('scan.incremental.snapshot.enabled'='true') */;

Example:

3.3.2 Use Multiple CTAS Statements to Synchronize Data and Schemas of Multiple MongoDB Collections to the Downstream Storage

BEGIN STATEMENT SET;

CREATE TABLE IF NOT EXISTS `some_catalog`.`some_database`.`some_table0`
AS TABLE `mongodb-catalog`.`database`.`collection0`
/*+ OPTIONS('scan.incremental.snapshot.enabled'='true') */;

CREATE TABLE IF NOT EXISTS `some_catalog`.`some_database`.`some_table1`
AS TABLE `mongodb-catalog`.`database`.`collection1`
/*+ OPTIONS('scan.incremental.snapshot.enabled'='true') */;

CREATE TABLE IF NOT EXISTS `some_catalog`.`some_database`.`some_table2`
AS TABLE `mongodb-catalog`.`database`.`collection2`
/*+ OPTIONS('scan.incremental.snapshot.enabled'='true') */;

END;

Example:

3.3.3 Use CDAS to Synchronize Data and Schemas of Certain Collections in a MongoDB Database to Downstream Storage

CREATE DATABASE IF NOT EXISTS `some_catalog`.`some_database` 
AS DATABASE `mongo-catalog`.`database` INCLUDING TABLE 'table-name'
/*+ OPTIONS('scan.incremental.snapshot.enabled'='true') */;

Example:

3.4 Usage Example

The following example uses Realtime Compute for Apache Flink:

Assume that we need to synchronize the data and schemas of all collections in a single database in MongoDB to Hologres. New fields may appear in the data in MongoDB.

The name of the MongoDB database is guiyuan_cdas_test, which contains two collections named test_coll_0 and test_coll_1. We want to synchronize data to the database of Hologres with the same name: cdas_test.

The initial data of the two collections in MongoDB is as follows:

After you create MongoDB and Hologres catalogs, write a CDAS job on the SQL development page. MongoDB catalogs will infer the schema of the collection, so you do not need to manually define the DDL of the table.

After the deployment is run, you can see that the guiyuan_cdas_test database has been automatically created in the Hologres database and the initial data of the two tables has been synchronized:

At this point, a data entry that contains the address field is inserted into test_coll_0 and a data entry that contains the phone field is inserted into test_coll_1.

Observe the Hologres table. You can see that both tables have synchronized the new data and schemas:

4. Summary

Flink CDC implements MongoDB CDC Source based on MongoDB's change streams, supporting full-incremental integrated data synchronization for MongoDB. Realtime Compute for Apache Flink uses MongoDB catalogs to infer MongoDB schemas. With the use of the CTAS or CDAS statements, Realtime Compute for Apache Flink can synchronize schema changes while synchronizing data. When a schema changes, there is no need to modify the Flink job to synchronize the schema to the downstream storage, which greatly improves the flexibility and convenience of data integration.

Community

Analysis of Alibaba Cloud Realtime Compute for Apache Flink: Deep Exploration into MongoDB Schema Inference

1. Introduction to MongoDB

2. Core Features of MongoDB CDC Community Edition

3. The Practice of MongoDB CDC in Realtime Compute for Apache Flink

3.1 Implementation of Schema Inference

3.1.1 Data Sampling

3.1.2 Schema Parsing

3.1.3 Schema Merging

3.2 Implementation of Schema Evolution

3.3 Use CTAS or CDAS Statement to Synchronize Data and Schemas

3.3.1 Use a CTAS Statement to Synchronize the Data and Schemas of a Single MongoDB Collection to the Downstream Storage

3.3.2 Use Multiple CTAS Statements to Synchronize Data and Schemas of Multiple MongoDB Collections to the Downstream Storage

3.3.3 Use CDAS to Synchronize Data and Schemas of Certain Collections in a MongoDB Database to Downstream Storage

3.4 Usage Example

4. Summary

Read previous post:

Read next post:

Apache Flink Community

You may also like

Comments

Apache Flink Community

Related Products

Realtime Compute for Apache Flink

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

ApsaraDB for HBase