Lindorm provides a compute engine service named Lindorm Distributed Processing System (LDPS). After LDPS is activated for a Lindorm instance, a Lindorm Change Data Capture (CDC) data source is assigned to the Lindorm instance. Changes in data stored in other engine services that are activated for the Lindorm instance are synchronized to the CDC data source. You can use Spark SQL to query these data changes from the CDC data source.

Prerequisites

  • Lindorm Tunnel Service (LTS) is activated for your Lindorm instance. For more information, see Activate and log on to LTS.
  • A subscription channel is created for LindormTable. For more information, see Create a Pull channel for data subscription.
    Note When you create a subscription channel, take note of the following points:
    • Do not select Ignore family prefix for column name in message.
    • Select json for the Serialize Type parameter.
    • One topic name corresponds to only one Lindorm table name.
  • Configure the LINDORM_HBASE_CATALOG attribute for your HBase table. For more information, see Access data in wide tables.
    Note The LINDORM_HBASE_CATALOG attribute specifies the mapping between a Spark SQL schema and the schema of the HBase table. The Lindorm CDC data source extracts the schema of the HBase table based on the value of this attribute.

Limits

  • Only HBase tables are supported. HBase tables are tables whose data is written to LindormTable by using HBase clients.
  • The real-time change tracking feature allows you to consume only files in the JSON format.

How to submit a job

You can use one of the following methods to write and submit a Spark job for a Lindorm CDC data source:
Note For information about the syntax that is used to read data from and write data to a Lindorm CDC data source, see Configure a Lindorm CDC data source.

Configure a Lindorm CDC data source

Table schemas and database schemas of the Lindorm CDC data source

  • The name of the Lindorm CDC data source provided by LDPS is lindorm_cdc.
  • You cannot manage namespaces in the Lindorm CDC data source. You can manage only tables in the Lindorm CDC data source. The tables in the Lindorm CDC data source use the same names as the topics that you specified when you created data subscription channels.

Schemas of the Lindorm CDC data source

The Lindorm CDC data source extracts the schemas of HBase tables based on the LINDORM_HBASE_CATALOG attribute and uses the extracted schemas as the schemas of the Lindorm CDC data source. The Lindorm CDC data source reads data from Kafka. Each operation record is saved. The following table describes the meta fields that are supported in the schemas of the Lindorm CDC data source.
FieldCategoryDescriptionConfiguration
_cdc_timestamp_kafkalongThe timestamp when the operation record was written to Kafka. Unit: milliseconds. No configuration is required. The default configuration value that is contained in the schema is used.
_cdc_operation_typestringThe change type of the operation record.
  • C: adds data.
  • U: updates data.
  • D: deletes data.
No configuration is required. The default configuration value that is contained in the schema is used.
_cdc_timestamp_lindormlongThe timestamp when the operation record was processed by a Lindorm engine service other than LDPS. Unit: milliseconds. spark.sql.catalog.lindorm_cdc.lindormTsEnabled
_cdc_timestamp_ltslongThe timestamp when the operation record was processed by LTS. Unit: milliseconds. spark.sql.catalog.lindorm_cdc.ltsTsEnabled

Configuration items of the Lindorm CDC data source

The following table describes the configuration items of the Lindorm CDC data source.
Configuration itemRequiredDescriptionExample
spark.sql.catalog.lindorm_cdc.username
  • This parameter is required if you submit a JAR job or a Python job.
  • This parameter is optional if you submit an SQL job. In this case, the system automatically assigns a value to this parameter.
The username that is used to connect to LindormTable. root (default username)
spark.sql.catalog.lindorm_cdc.password
  • This parameter is required if you submit a JAR job or a Python job.
  • This parameter is optional if you submit an SQL job. In this case, the system automatically assigns a value to this parameter.
The password that is used to connect to LindormTable. root (default password)
spark.sql.catalog.lindorm_cdc.lindormTsEnabledNoSpecifies whether to include the timestamp when Lindorm processed the operation record into the schema. The default value is false. If you set this parameter to true, the _cdc_timestamp_lindorm field is added to the schema of the Lindorm CDC data source. true
spark.sql.catalog.lindorm_cdc.ltsTsEnabledNoSpecifies whether to include the timestamp when LTS processed the operation record into the schema. The default value is false. If you set this parameter to true, the _cdc_timestamp_lts field is added to the schema of the Lindorm CDC data source. true

Statements that are supported for the Lindorm CDC data source

The following table describes the statements that can be executed on the Lindorm CDC data source.
StatementDescriptionExample
USE table_nameUses a specified table. USE test
SHOW TABLESViews all tables. SHOW TABLES
DESCRIBE table_nameViews the details of a specified table. DESC test or DESCRIBE test
SELECTFor more information about the SELECT statement, see Spark SQL.
Note When you execute the SELECT statement, take note of the following items:
  • You must use _cdc_timestamp_kafka > $startTimestamp and _cdc_timestamp_kafka < $endTimestamp to specify the range of the data that you want to read.
  • If the value of the _cdc_operation_type field is set to D, only values of the field that is specified as the row key are displayed. Empty strings are displayed for other fields.