Configure HDFS Input Component - Dataphin - Alibaba Cloud Documentation Center

The HDFS input component is used to read data from HDFS data sources. In scenarios where data from HDFS data sources is synchronized to other data sources, you need to first configure the data source read by the HDFS input component, and then configure the target data source for data synchronization. This topic describes how to configure the HDFS input component.

Prerequisites

An HDFS data source has been created. For more information, see Create HDFS Data Source.
To configure the properties of the HDFS input component, the account must have read-through permission for the data source. If you lack the necessary permissions, you need to request access to the data source. For more information, see Request Data Source Permission.

Procedure

In the top menu bar on the Dataphin home page, select Development > Data Integration.
In the top menu bar on the integration page, select Project (Dev-Prod mode requires selecting an environment).
In the left-side navigation pane, click Batch Pipeline. In the Batch Pipeline list, click the offline pipeline that needs to be developed to open its configuration page.
Click the Component Library in the upper right corner of the page to open the Component Library panel.
In the left-side navigation pane of the Component Library panel, select Input, find the HDFS component in the input component list on the right, and drag the component to the canvas.
Click the icon in the HDFS input component card to open the HDFS Input Configuration dialog box.

In the HDFS Input Configuration dialog box, configure the parameters.

Parameter	Description
Step Name	This is the name of the HDFS input component. Dataphin automatically generates the step name, and you can also modify it according to the business scenario. The naming convention is as follows: Can only contain Chinese characters, letters, underscores (_), and numbers. The name can be up to 64 characters in length.
Datasource	The drop-down list of data sources displays all HDFS-type data sources in the current Dataphin, including data sources for which you have read-through permission and those for which you do not. Click the icon to copy the current data source name. For data sources without read-through permission, you can click Request after the data source to request read-through permission for the data source. For more information, see Request Data Source Permission. If you do not have an HDFS-type data source, click Create Data Source to create a data source. For more information, see Create HDFS Data Source.
File Path	Enter the path where the file is located. Because the data source has already configured `NameNode`, there is no need to enter the `hdfs://<namenode>:<port>` prefix. You only need to enter the absolute path. For example, `/hadoop/input/file.txt`. The actual path accessed by the system is: `hdfs://<NameNode configured in the data source>:<IPC Port configured in the data source><file path entered>`.
File Type	Select the file type. The system supports the following File Types: Text, ORC, RC, Sequence, CSV, Parquet.
When File Does Not Exist	When the file being read does not exist, you can choose to ignore it or set the task to fail. Ignore: When the file being read does not exist, ignore the file and continue reading other files. Set Task To Fail: When the file being read does not exist, terminate the task and set it to fail.
When File Is Empty	When the file being read is empty, you can choose to ignore it or set the task to fail. Ignore: When the file being read is empty, ignore the file and continue reading other files. Set Task To Fail: When the file being read is empty, terminate the task and set it to fail.
Data Content Starting Line	This item needs to be configured when the file type is Text or CSV. The default is 1, starting from the first line as data content. If you need to ignore the first N lines, you can set the data content starting line to N+1.
File Encoding (optional)	Select the file encoding. The system supports the following File Encodings: UTF-8 and GBK.
Field Separator (optional)	This item needs to be configured when the file type is Text or CSV. Please fill in the separator between content fields in the file according to the actual storage situation. If not filled, the default is a comma (,).
Compression Format (optional)	Select the compression format of the file. The system supports the following compression formats: zip gzip bzip2
Output Fields	Displays the output fields. You can manually add output fields: Click Batch Add. JSON and TEXT formats are supported for batch configuration. Batch configuration in JSON format, for example: `[{ "index": 0, "type": "double", "name": "HDFS1" },` Note Index represents the field index introduced, type represents the field type after introduction. Name represents the field name. Batch configuration in TEXT format, for example: `0,HDFS1,Double 1,HDFS2,String` The row delimiter is used to separate information for each field. The default is a line feed (\n). It can support line feed (\n), semicolon (;), or period (.). The column delimiter is used to separate the field name and field type. The default is a comma (,). Click Create Output Field and fill in Column and select Type according to the page prompts. You can also perform the following operations on the added fields: Click the Actions column icon to edit existing fields. Click the Actions column icon to delete the existing field.

Click Confirm to complete the property configuration of the HDFS input component.