Access data in a Lindorm CDC data source - Lindorm - Alibaba Cloud Documentation Center

Lindorm provides a compute engine service named Lindorm Distributed Processing System (LDPS). After LDPS is activated for a Lindorm instance, a Lindorm Change Data Capture (CDC) data source is assigned to the Lindorm instance. Changes in data stored in other engine services that are activated for the Lindorm instance are synchronized to the CDC data source. You can use Spark SQL to query these data changes from the CDC data source.

Prerequisites

Lindorm Tunnel Service (LTS) is activated for your Lindorm instance. For more information, see Activate and log on to LTS.
A subscription channel is created for LindormTable. For more information, see Create a Pull channel for data subscription.
Note When you create a subscription channel, take note of the following points:
- Do not select Ignore family prefix for column name in message.
- Select json for the Serialize Type parameter.
- One topic name corresponds to only one Lindorm table name.
Configure the LINDORM_HBASE_CATALOG attribute for your HBase table. For more information, see Access data in wide tables.
Note The LINDORM_HBASE_CATALOG attribute specifies the mapping between a Spark SQL schema and the schema of the HBase table. The Lindorm CDC data source extracts the schema of the HBase table based on the value of this attribute.

Limits

Only HBase tables are supported. HBase tables are tables whose data is written to LindormTable by using HBase clients.
The real-time change tracking feature allows you to consume only files in the JSON format.

How to submit a job

You can use one of the following methods to write and submit a Spark job for a Lindorm CDC data source:

Note For information about the syntax that is used to read data from and write data to a Lindorm CDC data source, see Configure a Lindorm CDC data source.

Configure a Lindorm CDC data source

Table schemas and database schemas of the Lindorm CDC data source

The name of the Lindorm CDC data source provided by LDPS is lindorm_cdc.
You cannot manage namespaces in the Lindorm CDC data source. You can manage only tables in the Lindorm CDC data source. The tables in the Lindorm CDC data source use the same names as the topics that you specified when you created data subscription channels.

Schemas of the Lindorm CDC data source

The Lindorm CDC data source extracts the schemas of HBase tables based on the LINDORM_HBASE_CATALOG attribute and uses the extracted schemas as the schemas of the Lindorm CDC data source. The Lindorm CDC data source reads data from Kafka. Each operation record is saved. The following table describes the meta fields that are supported in the schemas of the Lindorm CDC data source.


Field	Category	Description	Configuration
_cdc_timestamp_kafka	long	The timestamp when the operation record was written to Kafka. Unit: milliseconds.	No configuration is required. The default configuration value that is contained in the schema is used.
_cdc_operation_type	string	The change type of the operation record. C: adds data. U: updates data. D: deletes data.	No configuration is required. The default configuration value that is contained in the schema is used.
_cdc_timestamp_lindorm	long	The timestamp when the operation record was processed by a Lindorm engine service other than LDPS. Unit: milliseconds.	spark.sql.catalog.lindorm_cdc.lindormTsEnabled
_cdc_timestamp_lts	long	The timestamp when the operation record was processed by LTS. Unit: milliseconds.	spark.sql.catalog.lindorm_cdc.ltsTsEnabled

Configuration items of the Lindorm CDC data source

The following table describes the configuration items of the Lindorm CDC data source.


Configuration item	Required	Description	Example
spark.sql.catalog.lindorm_cdc.username	This parameter is required if you submit a JAR job or a Python job. This parameter is optional if you submit an SQL job. In this case, the system automatically assigns a value to this parameter.	The username that is used to connect to LindormTable.	root (default username)
spark.sql.catalog.lindorm_cdc.password	This parameter is required if you submit a JAR job or a Python job. This parameter is optional if you submit an SQL job. In this case, the system automatically assigns a value to this parameter.	The password that is used to connect to LindormTable.	root (default password)
spark.sql.catalog.lindorm_cdc.lindormTsEnabled	No	Specifies whether to include the timestamp when Lindorm processed the operation record into the schema. The default value is false. If you set this parameter to true, the _cdc_timestamp_lindorm field is added to the schema of the Lindorm CDC data source.	true
spark.sql.catalog.lindorm_cdc.ltsTsEnabled	No	Specifies whether to include the timestamp when LTS processed the operation record into the schema. The default value is false. If you set this parameter to true, the _cdc_timestamp_lts field is added to the schema of the Lindorm CDC data source.	true

Statements that are supported for the Lindorm CDC data source

The following table describes the statements that can be executed on the Lindorm CDC data source.


Statement	Description	Example
USE table_name	Uses a specified table.	USE test
SHOW TABLES	Views all tables.	SHOW TABLES
DESCRIBE table_name	Views the details of a specified table.	DESC test or DESCRIBE test
SELECT	For more information about the SELECT statement, see Spark SQL. Note When you execute the SELECT statement, take note of the following items: You must use `_cdc_timestamp_kafka > $startTimestamp and _cdc_timestamp_kafka < $endTimestamp` to specify the range of the data that you want to read. If the value of the `_cdc_operation_type` field is set to D, only values of the field that is specified as the `row key` are displayed. Empty strings are displayed for other fields.