The BigQuery data source connects DataWorks to Google BigQuery, letting you read BigQuery data as part of an offline synchronization task. Configure the task using the codeless UI or the code editor.
Supported features
| Feature | Supported |
|---|---|
| Offline (batch) sync | Yes |
| Codeless UI configuration | Yes |
| Code editor configuration | Yes |
Incremental sync (via where filter) |
Yes |
| Partition-based sync | Yes |
Concurrent sync (via splitPk) |
Yes |
Supported regions
BigQuery data sources are available in the following regions: China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Germany (Frankfurt), UK (London), US (Silicon Valley), and US (Virginia).
The BigQuery SDK version is google-cloud-bigquery 2.29.0. For SDK details, see the BigQuery documentation.
Supported field types
The following table lists the BigQuery field types that DataWorks supports and their corresponding Java types.
| BigQuery type | Java type |
|---|---|
| BOOL | Bool |
| INT64 | Long |
| FLOAT64 | BigDecimal |
| NUMERIC | BigDecimal |
| BIGNUMERIC | BigDecimal |
| STRING | String |
| BYTES | Bytes |
| STRUCT | String |
| ARRAY | String |
| TIMESTAMP | Date |
| DATE | Date |
| TIME | Date |
| DATETIME | Date |
| GEOGRAPHY | String |
| JSON | String |
| INTERVAL | String |
For the full BigQuery type reference, see the BigQuery data types documentation.
Prerequisites
Before you begin, make sure you have:
-
Network connectivity between your BigQuery data source and a serverless resource group or an exclusive resource group for Data Integration. Use a serverless resource group for most synchronization tasks. For setup instructions, see Network connectivity solutions.
-
A service account authentication file (JSON key) downloaded from Google Cloud.
-
The BigQuery Project ID for your Google Cloud project.
Add the BigQuery data source
Add BigQuery as a data source in DataWorks before creating a synchronization task. Follow the instructions in Data source management.
The following parameters are specific to the BigQuery data source:
| Parameter | Description |
|---|---|
| BigQuery Project ID | The ID of your Google Cloud project. |
| BigQuery authorization information | Upload the JSON authentication file downloaded from Google Cloud. |
Tip: View parameter descriptions directly in the DataWorks console when adding the data source.
Configure a synchronization task
For step-by-step task configuration, see:
Appendix: Script example and parameter descriptions
Use the following script example and parameter reference when configuring a BigQuery batch synchronization task in the code editor. For general code editor requirements, see Configure a task in the code editor.
Reader script example
{
"stepType": "bigquery",
"parameter": {
"datasource": "bq_test1",
"dataSet": "database_0724",
"table": "partition_1107",
"column": [
"id",
"table_id",
"table_no",
"table_name",
"table_status"
],
"where": "xxx=3",
"partition": [
"_PARTITIONTIME='2023-11-07'"
]
},
"name": "Reader",
"category": "reader"
}
Reader parameters
| Parameter | Required | Default | Description |
|---|---|---|---|
datasource |
Yes | None | The name of the data source. Must match the name of the data source added in DataWorks. |
dataSet |
Yes | None | The BigQuery dataset. |
table |
Yes | None | The name of the table to synchronize. |
column |
Yes | None | The columns to read. Example: ["id", "name", "age"]. |
where |
No | None | A filter condition applied to the query. BigQuery Reader combines column, table, and where into an SQL statement to extract data. For example, use LIMIT 10 for testing, or gmt_create>$bizdate for daily incremental sync. Leave blank to read all data. |
partition |
No | None | One or more partitions to synchronize. Example: ["_PARTITIONTIME='2023-11-07'"]. |
splitPk |
No | None | The field used to partition data for concurrent synchronization. Has no effect if partition is specified. |