Databricks Reader reads data from a Databricks data source into DataWorks offline synchronization tasks using JDBC.
Capability summary
|
Capability |
Details |
|
Read mode |
JDBC only |
|
Resource group |
Serverless resource groups only |
|
Catalog support |
Unity Catalog only (wizard mode and connectivity test) |
|
Concurrent reads |
Supported via |
|
Incremental sync |
Supported via |
|
Encoding |
Auto-detected by JDBC; no manual configuration needed |
Limitations
Resource group
Databricks Reader supports only Serverless resource groups. The virtual private cloud (VPC) bound to the resource group must have a public NAT gateway and an elastic IP address (EIP) configured.
Read mode
Data synchronization tasks read data in JDBC mode only.
Catalog type
When testing data source connectivity or configuring a task in wizard mode, DataWorks uses databricks-sdk to call the Databricks REST API. This API supports only Unity Catalog. Catalogs other than Unity Catalog—such as hive_metastore—cannot use these features.
To work around this limitation, choose one of the following:
-
Migrate to Unity Catalog (recommended). Migrate your data and metadata to Unity Catalog to use all DataWorks features. See Migrate to Unity Catalog.
-
Use script mode directly. After adding the data source, skip the Test Connectivity step and configure the task in script mode.
Concurrent reads and data consistency
Databricks Reader uses the splitPk parameter to partition data across multiple concurrent tasks, which improves synchronization throughput. Be aware of the following:
-
Concurrent tasks do not share a database transaction and have time intervals between them.
-
If data is continuously written to the source during synchronization, concurrent reads can produce an incomplete or inconsistent snapshot.
A perfectly consistent snapshot across concurrent reads is not possible. To manage this trade-off:
-
Prioritize consistency. Use single-threaded synchronization without
splitPk. This guarantees strict data consistency but reduces throughput. -
Prioritize speed. Keep the data source static during synchronization—for example, use table locking, pause application writes, or stop standby database synchronization. This is faster but may affect online services.
Encoding
Databricks Reader extracts data over JDBC, which automatically detects and converts character encodings. No manual encoding configuration is needed.
Incremental synchronization
Databricks Reader extracts data using SELECT ... WHERE ... statements. The key to incremental synchronization is constructing the WHERE clause correctly.
Recommended approach: Design a timestamp column (such as modify_time) in the source table. Update this column whenever a row is added, modified, or logically deleted. In the synchronization task, use this column in the WHERE clause to pull rows changed since the last synchronization point.
Not supported: Tables without a column that distinguishes new or modified rows—such as a timestamp or an auto-incrementing ID—cannot use incremental synchronization. Only full synchronization is possible for these tables.
Supported data types
Databricks Reader supports most Databricks data types for offline reads. Verify that your column types are in the supported list before configuring the task.
|
Category |
Databricks data types |
|
Integers |
TINYINT, SMALLINT, INT, BIGINT |
|
Floating-point |
FLOAT, DOUBLE, DECIMAL |
|
Strings |
STRING |
|
Date/time |
DATE, TIMESTAMP, TIMESTAMP_NTZ |
|
Booleans |
BOOLEAN |
|
Complex types |
ARRAY, MAP, STRUCT |
|
Other types |
INTERVAL, BINARY, GEOGRAPHY(srid), GEOMETRY(srid) |
Create a data source
Create the Databricks data source in DataWorks before developing a synchronization task. See Data source management for the procedure. Refer to the tooltips on the configuration page for parameter descriptions.
Develop a data synchronization task
Single-table offline synchronization
-
For the configuration procedure, see Configure an offline synchronization task in wizard mode and Configure an offline synchronization task in script mode.
-
For the full parameter list and a script sample, see Appendix: Script sample and parameters.
FAQ
I get the following JDBC error when reading data:
[Databricks][JDBCDriver](500313) Error getting the data value from result set: Column13:
[Databricks][JDBCDriver](500312) Error in fetching data rows: Timestamp Conversion has failed.
The Databricks TIMESTAMP type supports a wider value range than Java's Timestamp type. When a value falls outside the Java range, the JDBC driver throws this error. To fix it, cast the column to STRING in the column parameter:
"column": ["CAST(col_timestamp AS STRING)"]
Appendix: Script sample and parameters
Reader script sample
Use the following JSON as a starting point for script mode. See Configure an offline synchronization task in script mode for the full script structure.
{
"type": "job",
"version": "2.0",
"steps": [
{
"stepType": "databricks",
"parameter": {
"datasource": "databricks",
"schema": "schema1",
"table": "table1",
"readMode": "jdbc",
"where": "id>1",
"splitPk": "id",
"column": [
"c1",
"c2"
]
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "stream",
"parameter": {},
"name": "Writer",
"category": "writer"
}
],
"setting": {
"errorLimit": {
"record": "0"
},
"speed": {
"concurrent": 1
}
}
}
Reader parameters
|
Parameter |
Description |
Required |
Default |
|
|
The name of the DataWorks data source. |
Yes |
N/A |
|
|
A JSON array of column names to synchronize. Supports column pruning, column reordering, and constant expressions. Follows Databricks SQL syntax. Example: |
Yes |
N/A |
|
|
The schema to synchronize. |
Yes |
N/A |
|
|
The table to synchronize. One table per task. |
Yes |
N/A |
|
|
The column used for data partitioning. Enables concurrent tasks to improve throughput. Use the primary key when possible—primary keys are typically distributed evenly, which avoids data hotspots. Supports integer types only; floating-point, string, and date types are not supported. |
No |
N/A |
|
|
A filter condition for the |
No |
N/A |
|
|
The data read mode. Only |
No |
|