How to configure OSS input component - Dataphin - Alibaba Cloud Documentation Center

The OSS input component reads data from OSS data sources. To synchronize data from an OSS data source to other data sources, configure the OSS input component as the source first, and then configure the destination data source.

Prerequisites

An OSS data source is created. For more information, see Create an OSS data source.
The account used to configure the OSS input component properties has the read-through permission on the data source. If you do not have the permission, request it. For more information, see Request permissions on a data source.

Procedure

In the top navigation bar of the Dataphin homepage, choose Develop > Data Integration.
In the top navigation bar of the integration page, select a project (In Dev-Prod mode, you need to select an environment).
In the left-side navigation pane, click Batch Pipeline. In the Batch Pipeline list, click the offline pipeline that you want to develop to open its configuration page.
Click Component Library in the upper-right corner of the page to open the Component Library panel.
In the left-side navigation pane of the Component Library panel, select Inputs. Find the OSS component in the input component list on the right and drag it to the canvas.
Click the icon in the OSS input component card to open the OSS Input Configuration dialog box.

In the OSS Input Configuration dialog box, configure the following parameters.

Parameter	Description
Step Name	The name of the OSS input component. Dataphin automatically generates a step name. You can modify the name based on your business scenario. The name must meet the following requirements: The name can contain only Chinese characters, letters, underscores (_), and digits. The name cannot exceed 64 characters in length.
Datasource	Select an OSS data source configured in Dataphin that meets the following conditions: The data source type is OSS Data Source. The account used to configure the properties has the read-through permission on the data source. If you do not have the permission, request it. For more information, see Request permissions on a data source. You can also click Create next to Data Source to go to the planning module and add a data source. For more information, see Create an OSS data source.
Object Prefix	The name of the OSS object from which to read data. You can specify multiple object names. For example, if a bucket contains a data folder with the phin.txt file, set the Object Prefix to `data/phin.txt` to synchronize a specific file. To synchronize all files in a folder, use a wildcard character, such as `data/*`.
File Type	The system supports reading files in Text, CSV, xls, and xlsx formats. Different formats require different configuration. Text and CSV formats: For configuration details, see Text and CSV formats. xls and xlsx formats: For configuration details, see xls and xlsx formats.
Output Fields	Displays the output fields. You can manually add output fields: Click Batch Add. Configure in JSON format, for example: `// Example: [{"index": 0,"name": "user_id","type": "String"}, {"index": 1,"name": "user_name","type": "String"}]` Note index indicates the column number of the specified object, name indicates the field name after import, and type indicates the field type after import. For example: `"index":3,"name":"user_id","type":"String"` indicates that the fourth column in the file is imported, the field name is user_id, and the field type is String. Configure in TEXT format, for example: `1,user_name,String` The row delimiter is used to separate the information of each field. The default value is a line feed (\n). The system supports line feeds (\n), semicolons (;), and periods (.). The column delimiter is used to separate field names from field types. The default value is a comma (,). Click Create Output Field, and fill in Source Index, Column, and select Type as prompted. For Text and CSV file types, you must fill in the numeric index of the column where the field is located in the Source Index field. The index starts from 0. You can also perform the following operations on added fields: Click and drag the icon next to a field to change its position. Click the Actions icon in the column to edit an existing field. Click the Actions icon in the column to delete an existing field.

Text and CSV formats

Parameter	Description
Column Delimiter	The column delimiter of the file. Defaults to a comma (,).
Row Delimiter	The row delimiter of the file. Defaults to a line feed (\n).
File Encoding	The encoding format of the source file. Supported values: UTF-8 and GBK.
Null Value	Enter the fields to represent as null. If these fields exist in the source, the corresponding values are converted to null.
Compression Format	The compression format of the files. Leave this parameter empty (default) if the files are not compressed. Supported formats: zip gzip bzip2 lzo lzo_deflate
First Row Content Type	The content type of the first row. Supported values: Data Content or Column Name.

Xls and xlsx formats

Parameter	Description
Sheet Selection	You can select sheets to read by name or index. If you want to read multiple sheets, make sure that they have the same data format. By Name: You need to fill in the Sheet Name that you want to read. By Index: You need to fill in the Sheet Index that you want to read, starting from 0.
Data Content Start Row	Specify the starting row of the data content. The default value is 1, which means data starts from the first row. To skip the first N rows, set this value to N+1.
Data Content End Row	Specify the ending row of the data content. If not specified, the system reads to the last row that contains data.
Export Sheet Name	Select whether to export the source sheet name of the data. The exported content is `{sheet name}`.
File Encoding	The system supports UTF-8 and GBK encoding.
Compression Format	The system supports zip, gzip, bzip2, lzo, and lzo_deflate compression formats.
Null Value Conversion	You can specify any string to be converted to a Null value.

Click OK to complete the property configuration of the OSS input component.

What to do next

After you configure the input component, configure downstream components to complete data synchronization. For more information, see Development description of the integration component library.