Sync Databricks Data via Unity Catalog in DataWorks - DataWorks

Capability summary

Capability	Details
Read mode	JDBC only
Resource group	Serverless resource groups only
Catalog support	Unity Catalog only (wizard mode and connectivity test)
Concurrent reads	Supported via `splitPk`; no cross-task transaction guarantee
Incremental sync	Supported via `WHERE` clause with a timestamp or auto-increment column
Encoding	Auto-detected by JDBC; no manual configuration needed

Limitations

Resource group

Databricks Reader supports only Serverless resource groups. The virtual private cloud (VPC) bound to the resource group must have a public NAT gateway and an elastic IP address (EIP) configured.

Read mode

Data synchronization tasks read data in JDBC mode only.

Catalog type

When testing data source connectivity or configuring a task in wizard mode, DataWorks uses databricks-sdk to call the Databricks REST API. This API supports only Unity Catalog. Catalogs other than Unity Catalog—such as hive_metastore—cannot use these features.

To work around this limitation, choose one of the following:

Migrate to Unity Catalog (recommended). Migrate your data and metadata to Unity Catalog to use all DataWorks features. See Migrate to Unity Catalog.
Use script mode directly. After adding the data source, skip the Test Connectivity step and configure the task in script mode.

Concurrent reads and data consistency

Databricks Reader uses the splitPk parameter to partition data across multiple concurrent tasks, which improves synchronization throughput. Be aware of the following:

Concurrent tasks do not share a database transaction and have time intervals between them.
If data is continuously written to the source during synchronization, concurrent reads can produce an incomplete or inconsistent snapshot.

A perfectly consistent snapshot across concurrent reads is not possible. To manage this trade-off:

Prioritize consistency. Use single-threaded synchronization without splitPk. This guarantees strict data consistency but reduces throughput.
Prioritize speed. Keep the data source static during synchronization—for example, use table locking, pause application writes, or stop standby database synchronization. This is faster but may affect online services.

Encoding

Databricks Reader extracts data over JDBC, which automatically detects and converts character encodings. No manual encoding configuration is needed.

Incremental synchronization

Databricks Reader extracts data using SELECT ... WHERE ... statements. The key to incremental synchronization is constructing the WHERE clause correctly.

Recommended approach: Design a timestamp column (such as modify_time) in the source table. Update this column whenever a row is added, modified, or logically deleted. In the synchronization task, use this column in the WHERE clause to pull rows changed since the last synchronization point.

Not supported: Tables without a column that distinguishes new or modified rows—such as a timestamp or an auto-incrementing ID—cannot use incremental synchronization. Only full synchronization is possible for these tables.

Supported data types

Databricks Reader supports most Databricks data types for offline reads. Verify that your column types are in the supported list before configuring the task.

Category	Databricks data types
Integers	TINYINT, SMALLINT, INT, BIGINT
Floating-point	FLOAT, DOUBLE, DECIMAL
Strings	STRING
Date/time	DATE, TIMESTAMP, TIMESTAMP_NTZ
Booleans	BOOLEAN
Complex types	ARRAY, MAP, STRUCT
Other types	INTERVAL, BINARY, GEOGRAPHY(srid), GEOMETRY(srid)

Create a data source

Create the Databricks data source in DataWorks before developing a synchronization task. See Data source management for the procedure. Refer to the tooltips on the configuration page for parameter descriptions.

Develop a data synchronization task

Single-table offline synchronization

For the configuration procedure, see Configure an offline synchronization task in wizard mode and Configure an offline synchronization task in script mode.
For the full parameter list and a script sample, see Appendix: Script sample and parameters.

FAQ

I get the following JDBC error when reading data:

[Databricks][JDBCDriver](500313) Error getting the data value from result set: Column13:
[Databricks][JDBCDriver](500312) Error in fetching data rows: Timestamp Conversion has failed.

The Databricks TIMESTAMP type supports a wider value range than Java's Timestamp type. When a value falls outside the Java range, the JDBC driver throws this error. To fix it, cast the column to STRING in the column parameter:

"column": ["CAST(col_timestamp AS STRING)"]

Appendix: Script sample and parameters

Reader script sample

Use the following JSON as a starting point for script mode. See Configure an offline synchronization task in script mode for the full script structure.

{
  "type": "job",
  "version": "2.0",
  "steps": [
    {
      "stepType": "databricks",
      "parameter": {
        "datasource": "databricks",
        "schema": "schema1",
        "table": "table1",
        "readMode": "jdbc",
        "where": "id>1",
        "splitPk": "id",
        "column": [
          "c1",
          "c2"
        ]
      },
      "name": "Reader",
      "category": "reader"
    },
    {
      "stepType": "stream",
      "parameter": {},
      "name": "Writer",
      "category": "writer"
    }
  ],
  "setting": {
    "errorLimit": {
      "record": "0"
    },
    "speed": {
      "concurrent": 1
    }
  }
}

Reader parameters

Parameter	Description	Required	Default
`datasource`	The name of the DataWorks data source.	Yes	N/A
`column`	A JSON array of column names to synchronize. Supports column pruning, column reordering, and constant expressions. Follows Databricks SQL syntax. Example: `["id", "1", "'const name'", "null", "upper('abc_lower')", "2.3", "true"]`. Cannot be empty.	Yes	N/A
`schema`	The schema to synchronize.	Yes	N/A
`table`	The table to synchronize. One table per task.	Yes	N/A
`splitPk`	The column used for data partitioning. Enables concurrent tasks to improve throughput. Use the primary key when possible—primary keys are typically distributed evenly, which avoids data hotspots. Supports integer types only; floating-point, string, and date types are not supported.	No	N/A
`where`	A filter condition for the `SELECT` statement. Use this to implement incremental synchronization—for example, `gmt_create>${bizdate}`. If omitted, the task synchronizes the entire table.	No	N/A
`readMode`	The data read mode. Only `jdbc` is supported.	No	`jdbc`