This topic describes the features and limitations of synchronizing data from a Databricks data source in DataWorks.
Usage notes
Limitations
Resource groups: Databricks Reader supports only Serverless resource groups. You must also configure a public NAT gateway and an Elastic IP Address (EIP) for the VPC bound to the resource group.
Read mode: Data synchronization tasks can only read data in
JDBCmode.Catalog types and feature limitations: When you test data source connectivity or configure a data synchronization task in wizard mode, DataWorks uses
databricks-sdkto call the Databricks REST API. This API supports only Unity Catalog. If you use a catalog other than Unity Catalog, such ashive_metastore, these features are unavailable. You can use one of the following solutions:Solution 1 (Recommended): Migrate to Unity Catalog. As recommended by Databricks, migrate your data and metadata to Unity Catalog. This enables you to use all features that DataWorks provides. For more information, see Migrate to Unity Catalog.
Solution 2: Use script mode directly. After configuring the data source in DataWorks, skip the Test Connectivity step and configure the data synchronization task in script mode.
Concurrent reads and data consistency
Databricks Reader uses the
splitPkparameter for data partitioning, which initiates multiple concurrent tasks to improve synchronization throughput. Note the following:The concurrent tasks do not run in the same database transaction and have time intervals between them.
If data is continuously written to the data source, concurrent reads can result in an incomplete or inconsistent data snapshot.
Due to technical limitations, achieving a perfectly consistent snapshot across multiple concurrent reads is impossible. You can use one of the following solutions, depending on your business requirements:
Solution 1: Use single-threaded synchronization and do not configure data partitioning by using the
splitPkparameter. This guarantees strict data consistency but slows down the synchronization.Solution 2: Ensure the data source is static during synchronization. For example, use table locking, temporarily stop application writes, or pause standby database synchronization. This method is faster but can affect online services.
Encoding
Databricks Reader extracts data using Java Database Connectivity (JDBC). JDBC automatically detects and converts character encodings across different databases, so you do not need to manually configure the encoding.
Incremental data synchronization
Databricks Reader extracts data by executing
SELECT ... WHERE ...statements. The key to implementing incremental data synchronization is to correctly construct theWHEREclause.Recommended approach (based on a timestamp column):
Design a timestamp column, such as
modify_time, in your source database table.When your application adds or updates data (including logical deletes), ensure that this timestamp column is updated accordingly.
In the data synchronization task, use this timestamp column in the
WHEREclause to pull data that has changed since the last synchronization point.
Unsupported scenario: If a table does not have a column that can distinguish new from modified records, such as a timestamp or an auto-incrementing ID, Databricks Reader cannot perform incremental data synchronization. In this case, only full data synchronization is possible.
Supported data types
For offline reads, Databricks Reader supports most Databricks data types. However, some types are not supported. Verify your data types before you start.
The following table lists the Databricks data types that Databricks Reader supports.
Category | Databricks data type |
Integers | TINYINT, SMALLINT, INT, BIGINT |
Floating-point | FLOAT, DOUBLE, DECIMAL |
Strings | STRING |
Date/time | DATE, TIMESTAMP, TIMESTAMP_NTZ |
Booleans | BOOLEAN |
Complex types | ARRAY, MAP, STRUCT |
Other types | INTERVAL, BINARY, GEOGRAPHY(srid), GEOMETRY(srid) |
Create a data source
Before you develop a data synchronization task, create the corresponding data source in DataWorks. For details, see Data source management. For detailed parameter descriptions, refer to the tooltips on the configuration page.
Develop a data synchronization task
This section describes how to configure a data synchronization task.
Single-table offline synchronization
For the procedure, see Configure an offline synchronization task in wizard mode and Configure an offline synchronization task in script mode.
For a full list of parameters and a sample script for Script Mode, see Appendix: Script sample and parameters.
FAQ
Q: I get the following error when reading data:
[Databricks][JDBCDriver](500313)Error getting the data value from result set: Column13:[Databricks][JDBCDriver](500312)Error in fetching data rows: Timestamp Conversion has failed.A: The value range of the Databricks TIMESTAMP data type exceeds the value range of a Java Timestamp. If a value is out of the Java range, the JDBC driver returns an error. To resolve this issue, modify the
columnparameter. Example:"column": ["CAST(col_timestamp AS STRING)"].
Appendix: Script sample and parameters
Script configuration for offline tasks
If you use script mode to configure an offline task, you must specify the parameters in the task script according to the required format. For more information, see Configure an offline synchronization task in script mode. The following sections describe the Reader parameters for script mode.
Reader script sample
{
"type": "job",
"version": "2.0",
"steps": [
{
"stepType": "databricks",
"parameter": {
"datasource": "databricks",
"schema": "schema1",
"table": "table1",
"readMode": "jdbc",
"where": "id>1",
"splitPk": "id",
"column": [
"c1",
"c2"
]
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "stream",
"parameter": {},
"name": "Writer",
"category": "writer"
}
],
"setting": {
"errorLimit": {
"record": "0"
},
"speed": {
"concurrent": 1
}
}
}Reader script parameters
Parameter | Description | Required | Default |
datasource | The name of your DataWorks data source. | Yes | N/A |
column | An array of column names from the source table to be synchronized. This parameter is specified as a JSON array.
Example:
You must explicitly specify the columns to synchronize. This parameter cannot be empty. | Yes | N/A |
splitPk | The column used for data partitioning. Specifying this parameter enables concurrent tasks, which can improve synchronization throughput.
| No | N/A |
where | A filter condition for selecting data. DataWorks uses the specified | No | N/A |
schema | The schema to be synchronized. | Yes | N/A |
table | The table to be synchronized. A single task can synchronize only one table. | Yes | N/A |
readMode | The data read mode. Only the | No | jdbc |