The Databricks input component reads data from a Databricks data source. In scenarios where you need to synchronize data from a Databricks data source to other data sources, you must first configure the source data source information for the Databricks input component to read, and then configure the target data source for data synchronization. This topic describes how to configure the Databricks input component.
Prerequisites
You have created a Databricks data source. For more information, see Create a Databricks data source.
The account used to configure the properties of the Databricks input component has the read-through permission on the data source. If you do not have the permission, you need to request the data source permission. For more information, see Request data source permissions.
Procedure
In the top navigation bar of the Dataphin homepage, choose Develop > Data Integration.
In the top navigation bar of the integration page, select a project (In Dev-Prod mode, you need to select an environment).
In the left-side navigation pane, click Batch Pipeline. In the Batch Pipeline list, click the offline pipeline that you want to develop to open its configuration page.
Click Component Library in the upper-right corner of the page to open the Component Library panel.
In the left-side navigation pane of the Component Library panel, select Input. Find the Databricks component in the input component list on the right and drag it to the canvas.
Click the
icon in the Databricks input component card to open the Databricks Input Configuration dialog box.In the Databricks Input Configuration dialog box, configure the parameters.
Parameter
Description
Step Name
The name of the Databricks input component. Dataphin automatically generates a step name, which you can modify based on your business scenario. The name must meet the following requirements:
It can contain only Chinese characters, letters, underscores (_), and digits.
It can be up to 64 characters in length.
Datasource
The data source dropdown list displays all Databricks data sources and project levels in the current Dataphin instance, including data sources for which you may or may not have the read-through permission. Click the
icon to copy the current data source name.For data sources for which you do not have the read-through permission, you can click Request next to the data source to request the read-through permission. For more information, see Request data source permissions.
If you do not have a Databricks data source, click Create Data Source to create one. For more information, see Create a Databricks data source.
Time Zone
The time zone is used to process time format data. The default value is the time zone configured in the selected data source and cannot be modified.
NoteFor tasks created before V5.1.2, you can select Data Source Default Configuration or Channel Configuration Time Zone. The default value is Channel Configuration Time Zone.
Data Source Default Configuration: the default time zone of the selected data source.
Channel Configuration Time Zone: the time zone configured in Properties > Channel Configuration of the current integration task.
Schema (optional)
You can select tables across schemas. Select the schema where the table is located. If you do not specify a schema, the schema configured in the data source is used by default.
If you select a project as the data source, you cannot configure the schema. The system automatically obtains the schema corresponding to the project.
Table
You can enter a keyword to search for tables or enter the exact table name and click Exact Match. After you select a table, the system automatically checks the table status. Click the
icon to copy the name of the selected table.Shard Key (optional)
The system shards data based on the configured shard key field. You can use this parameter with the concurrency configuration to implement concurrent reading. You can use a column in the source data table as the shard key. We recommend that you use a primary key or a column with an index as the shard key to ensure transmission performance.
ImportantWhen you select a date and time type, the system identifies the maximum and minimum values, and performs sharding based on the total time range and concurrency. Even distribution is not guaranteed.
Batch Read Count (optional)
The number of records to read at a time. When reading data from the source database, you can configure a specific batch read count (such as 1,024 records) instead of reading records one by one. This reduces the number of interactions with the data source, improves I/O efficiency, and reduces network latency.
Input Filter (optional)
Enter a condition expression supported by the Databricks database as the data filter condition.
NoteEnter only the content after the WHERE keyword. Do not enter the WHERE keyword.
You can use system global variables, such as the data timestamp ${bizdate}.
Output Fields
The Output Fields section displays all fields that match the selected table and filter conditions. If you do not want to output certain fields to downstream components, you can delete these fields.
NoteThe data source table does not support hierarchical classification.
Delete a single field: If you need to delete a small number of fields, you can click the
icon in the Operation column to delete the unnecessary fields.Delete multiple fields in batches: If you need to delete many fields, you can click Field Management. In the Field Management dialog box, select multiple fields, click the
left arrow icon to move the selected input fields to the unselected input fields, and then click OK to complete the batch deletion of fields.
Click OK to complete the property configuration of the Databricks input component.