Configure Databricks data synchronization - DataWorks - Alibaba Cloud Documentation Center

This topic describes the features and limitations of synchronizing data from a Databricks data source in DataWorks.

Usage notes

Limitations
- Resource groups: Databricks Reader supports only Serverless resource groups. You must also configure a public NAT gateway and an Elastic IP Address (EIP) for the VPC bound to the resource group.
- Read mode: Data synchronization tasks can only read data in JDBC mode.
- Catalog types and feature limitations: When you test data source connectivity or configure a data synchronization task in wizard mode, DataWorks uses databricks-sdk to call the Databricks REST API. This API supports only Unity Catalog. If you use a catalog other than Unity Catalog, such as hive_metastore, these features are unavailable. You can use one of the following solutions:
  - Solution 1 (Recommended): Migrate to Unity Catalog. As recommended by Databricks, migrate your data and metadata to Unity Catalog. This enables you to use all features that DataWorks provides. For more information, see Migrate to Unity Catalog.
  - Solution 2: Use script mode directly. After configuring the data source in DataWorks, skip the Test Connectivity step and configure the data synchronization task in script mode.
Concurrent reads and data consistency
Databricks Reader uses the splitPk parameter for data partitioning, which initiates multiple concurrent tasks to improve synchronization throughput. Note the following:
- The concurrent tasks do not run in the same database transaction and have time intervals between them.
- If data is continuously written to the data source, concurrent reads can result in an incomplete or inconsistent data snapshot.
Due to technical limitations, achieving a perfectly consistent snapshot across multiple concurrent reads is impossible. You can use one of the following solutions, depending on your business requirements:
- Solution 1: Use single-threaded synchronization and do not configure data partitioning by using the splitPk parameter. This guarantees strict data consistency but slows down the synchronization.
- Solution 2: Ensure the data source is static during synchronization. For example, use table locking, temporarily stop application writes, or pause standby database synchronization. This method is faster but can affect online services.
Encoding
Databricks Reader extracts data using Java Database Connectivity (JDBC). JDBC automatically detects and converts character encodings across different databases, so you do not need to manually configure the encoding.
Incremental data synchronization
Databricks Reader extracts data by executing SELECT ... WHERE ... statements. The key to implementing incremental data synchronization is to correctly construct the WHERE clause.
1. Recommended approach (based on a timestamp column):
  1. Design a timestamp column, such as modify_time, in your source database table.
  2. When your application adds or updates data (including logical deletes), ensure that this timestamp column is updated accordingly.
  3. In the data synchronization task, use this timestamp column in the WHERE clause to pull data that has changed since the last synchronization point.
2. Unsupported scenario: If a table does not have a column that can distinguish new from modified records, such as a timestamp or an auto-incrementing ID, Databricks Reader cannot perform incremental data synchronization. In this case, only full data synchronization is possible.

Supported data types

For offline reads, Databricks Reader supports most Databricks data types. However, some types are not supported. Verify your data types before you start.

The following table lists the Databricks data types that Databricks Reader supports.

Category	Databricks data type
Integers	TINYINT, SMALLINT, INT, BIGINT
Floating-point	FLOAT, DOUBLE, DECIMAL
Strings	STRING
Date/time	DATE, TIMESTAMP, TIMESTAMP_NTZ
Booleans	BOOLEAN
Complex types	ARRAY, MAP, STRUCT
Other types	INTERVAL, BINARY, GEOGRAPHY(srid), GEOMETRY(srid)

Create a data source

Before you develop a data synchronization task, create the corresponding data source in DataWorks. For details, see Data source management. For detailed parameter descriptions, refer to the tooltips on the configuration page.

Develop a data synchronization task

This section describes how to configure a data synchronization task.

Single-table offline synchronization

For the procedure, see Configure an offline synchronization task in wizard mode and Configure an offline synchronization task in script mode.
For a full list of parameters and a sample script for Script Mode, see Appendix: Script sample and parameters.

FAQ

Q: I get the following error when reading data:[Databricks][JDBCDriver](500313) Error getting the data value from result set: Column13:[Databricks][JDBCDriver](500312) Error in fetching data rows: Timestamp Conversion has failed.
A: The value range of the Databricks TIMESTAMP data type exceeds the value range of a Java Timestamp. If a value is out of the Java range, the JDBC driver returns an error. To resolve this issue, modify the column parameter. Example: "column": ["CAST(col_timestamp AS STRING)"].

Appendix: Script sample and parameters

Script configuration for offline tasks

If you use script mode to configure an offline task, you must specify the parameters in the task script according to the required format. For more information, see Configure an offline synchronization task in script mode. The following sections describe the Reader parameters for script mode.

Reader script sample

{
  "type": "job",
  "version": "2.0",
  "steps": [
    {
      "stepType": "databricks",
      "parameter": {
        "datasource": "databricks",
        "schema": "schema1",
        "table": "table1",
        "readMode": "jdbc",
        "where": "id>1",
        "splitPk": "id",
        "column": [
          "c1",
          "c2"
        ]
      },
      "name": "Reader",
      "category": "reader"
    },
    {
      "stepType": "stream",
      "parameter": {},
      "name": "Writer",
      "category": "writer"
    }
  ],
  "setting": {
    "errorLimit": {
      "record": "0"
    },
    "speed": {
      "concurrent": 1
    }
  }
}

Reader script parameters

Parameter	Description	Required	Default
datasource	The name of your DataWorks data source.	Yes	N/A
column	An array of column names from the source table to be synchronized. This parameter is specified as a JSON array. Supports column pruning, allowing you to synchronize a subset of columns. Supports column reordering, allowing you to synchronize columns in an order different from the table schema. Supports constant configuration. You must follow Databricks SQL syntax. Example: `"column":["id", "1", "'const name'", "null", "upper('abc_lower')", "2.3" , "true"]` `id` is a regular column name. `1` is an integer constant. `'const name'` is a string constant. It must be enclosed in single quotation marks. `null` is a null value. `upper('abc_lower')` is a function expression. `2.3` is a floating-point number. `true` is a Boolean value. You must explicitly specify the columns to synchronize. This parameter cannot be empty.	Yes	N/A
splitPk	The column used for data partitioning. Specifying this parameter enables concurrent tasks, which can improve synchronization throughput. Using the table's primary key for the `splitPk` parameter is recommended. Primary keys are typically distributed evenly, which helps prevent data hotspots in the partitioned data. Currently, the `splitPk` parameter supports data partitioning only for integer types. Floating-point, string, date, and other data types are not supported. Databricks Reader returns an error if you specify an unsupported type.	No	N/A
where	A filter condition for selecting data. DataWorks uses the specified `column`, `table`, and `where` parameters to build a SELECT statement to extract data. For example, to perform incremental data synchronization, you can specify a condition like `gmt_create>${bizdate}`. This is an effective way to perform incremental synchronization. If you omit this parameter, the task synchronizes the entire table.	No	N/A
schema	The schema to be synchronized.	Yes	N/A
table	The table to be synchronized. A single task can synchronize only one table.	Yes	N/A
readMode	The data read mode. Only the `jdbc` mode is supported.	No	jdbc