All Products
Search
Document Center

DataWorks:Databricks

Last Updated:Mar 26, 2026

Databricks Reader reads data from a Databricks data source into DataWorks offline synchronization tasks using JDBC.

Capability summary

Capability

Details

Read mode

JDBC only

Resource group

Serverless resource groups only

Catalog support

Unity Catalog only (wizard mode and connectivity test)

Concurrent reads

Supported via splitPk; no cross-task transaction guarantee

Incremental sync

Supported via WHERE clause with a timestamp or auto-increment column

Encoding

Auto-detected by JDBC; no manual configuration needed

Limitations

Resource group

Databricks Reader supports only Serverless resource groups. The virtual private cloud (VPC) bound to the resource group must have a public NAT gateway and an elastic IP address (EIP) configured.

Read mode

Data synchronization tasks read data in JDBC mode only.

Catalog type

When testing data source connectivity or configuring a task in wizard mode, DataWorks uses databricks-sdk to call the Databricks REST API. This API supports only Unity Catalog. Catalogs other than Unity Catalog—such as hive_metastore—cannot use these features.

To work around this limitation, choose one of the following:

  • Migrate to Unity Catalog (recommended). Migrate your data and metadata to Unity Catalog to use all DataWorks features. See Migrate to Unity Catalog.

  • Use script mode directly. After adding the data source, skip the Test Connectivity step and configure the task in script mode.

Concurrent reads and data consistency

Databricks Reader uses the splitPk parameter to partition data across multiple concurrent tasks, which improves synchronization throughput. Be aware of the following:

  • Concurrent tasks do not share a database transaction and have time intervals between them.

  • If data is continuously written to the source during synchronization, concurrent reads can produce an incomplete or inconsistent snapshot.

A perfectly consistent snapshot across concurrent reads is not possible. To manage this trade-off:

  • Prioritize consistency. Use single-threaded synchronization without splitPk. This guarantees strict data consistency but reduces throughput.

  • Prioritize speed. Keep the data source static during synchronization—for example, use table locking, pause application writes, or stop standby database synchronization. This is faster but may affect online services.

Encoding

Databricks Reader extracts data over JDBC, which automatically detects and converts character encodings. No manual encoding configuration is needed.

Incremental synchronization

Databricks Reader extracts data using SELECT ... WHERE ... statements. The key to incremental synchronization is constructing the WHERE clause correctly.

Recommended approach: Design a timestamp column (such as modify_time) in the source table. Update this column whenever a row is added, modified, or logically deleted. In the synchronization task, use this column in the WHERE clause to pull rows changed since the last synchronization point.

Not supported: Tables without a column that distinguishes new or modified rows—such as a timestamp or an auto-incrementing ID—cannot use incremental synchronization. Only full synchronization is possible for these tables.

Supported data types

Databricks Reader supports most Databricks data types for offline reads. Verify that your column types are in the supported list before configuring the task.

Category

Databricks data types

Integers

TINYINT, SMALLINT, INT, BIGINT

Floating-point

FLOAT, DOUBLE, DECIMAL

Strings

STRING

Date/time

DATE, TIMESTAMP, TIMESTAMP_NTZ

Booleans

BOOLEAN

Complex types

ARRAY, MAP, STRUCT

Other types

INTERVAL, BINARY, GEOGRAPHY(srid), GEOMETRY(srid)

Create a data source

Create the Databricks data source in DataWorks before developing a synchronization task. See Data source management for the procedure. Refer to the tooltips on the configuration page for parameter descriptions.

Develop a data synchronization task

Single-table offline synchronization

FAQ

I get the following JDBC error when reading data:

[Databricks][JDBCDriver](500313) Error getting the data value from result set: Column13:
[Databricks][JDBCDriver](500312) Error in fetching data rows: Timestamp Conversion has failed.

The Databricks TIMESTAMP type supports a wider value range than Java's Timestamp type. When a value falls outside the Java range, the JDBC driver throws this error. To fix it, cast the column to STRING in the column parameter:

"column": ["CAST(col_timestamp AS STRING)"]

Appendix: Script sample and parameters

Reader script sample

Use the following JSON as a starting point for script mode. See Configure an offline synchronization task in script mode for the full script structure.

{
  "type": "job",
  "version": "2.0",
  "steps": [
    {
      "stepType": "databricks",
      "parameter": {
        "datasource": "databricks",
        "schema": "schema1",
        "table": "table1",
        "readMode": "jdbc",
        "where": "id>1",
        "splitPk": "id",
        "column": [
          "c1",
          "c2"
        ]
      },
      "name": "Reader",
      "category": "reader"
    },
    {
      "stepType": "stream",
      "parameter": {},
      "name": "Writer",
      "category": "writer"
    }
  ],
  "setting": {
    "errorLimit": {
      "record": "0"
    },
    "speed": {
      "concurrent": 1
    }
  }
}

Reader parameters

Parameter

Description

Required

Default

datasource

The name of the DataWorks data source.

Yes

N/A

column

A JSON array of column names to synchronize. Supports column pruning, column reordering, and constant expressions. Follows Databricks SQL syntax. Example: ["id", "1", "'const name'", "null", "upper('abc_lower')", "2.3", "true"]. Cannot be empty.

Yes

N/A

schema

The schema to synchronize.

Yes

N/A

table

The table to synchronize. One table per task.

Yes

N/A

splitPk

The column used for data partitioning. Enables concurrent tasks to improve throughput. Use the primary key when possible—primary keys are typically distributed evenly, which avoids data hotspots. Supports integer types only; floating-point, string, and date types are not supported.

No

N/A

where

A filter condition for the SELECT statement. Use this to implement incremental synchronization—for example, gmt_create>${bizdate}. If omitted, the task synchronizes the entire table.

No

N/A

readMode

The data read mode. Only jdbc is supported.

No

jdbc