After configuring the Amazon S3 input component, you can read data from the Amazon S3 data source into Dataphin for data integration and data development. This topic describes how to configure the Amazon S3 input component.
Prerequisites
The Amazon S3 data source has been created. For more information, see Create an Amazon S3 Data Source.
To configure the Amazon S3 input component properties, the account must have read-through permission for the data source. If you lack the necessary permissions, you need to request access to the data source. For more information, see Request Data Source Permission.
Procedure
On the Dataphin home page, select Development > Data Integration from the top menu bar.
In the top menu bar of the integration page, select Project (Dev-Prod mode requires selecting an environment).
In the left-side navigation pane, click Batch Pipeline, and in the Batch Pipeline list, click the Offline Pipeline you need to develop to open its configuration page.
Click the Component Library in the upper right corner of the page to open the Component Library panel.
In the left-side navigation pane of the Component Library panel, select Input, find the Amazon S3 component in the input component list on the right, and drag the component to the canvas.
Click the
icon in the Amazon S3 input component card to open the Amazon S3 Input Configuration dialog box.In the Amazon S3 Input Configuration dialog box, configure the parameters.
Parameter
Description
Step Name
The name of the Amazon S3 input component. Dataphin automatically generates the step name, and you can also modify it according to the business scenario. The naming convention is as follows:
Can only contain Chinese characters, letters, underscores (_), and numbers.
Cannot exceed 64 characters.
Datasource
The data source drop-down list displays all Amazon S3 type data sources in Dataphin, including data sources for which you have read-through permission and those for which you do not. Click the
icon to copy the current data source name.For data sources without read-through permission, you can click Request after the data source to request read-through permission for the data source. For more information, see Request Data Source Permission.
If you do not have an Amazon S3 type data source, click Create Data Source to create a data source. For more information, see Create an Amazon S3 Data Source.
Object Prefix
An object is the basic unit for storing data in Amazon S3. It is also known as a file in Amazon S3. An object consists of metadata, user data, and a file name (key). The key of an object uniquely identifies the object in a bucket. The input component supports multiple object configurations. You can Click + Add Object Prefix to add.
If a directory is configured in the data source, the configured directory will be automatically displayed here. You can modify it, but you need to confirm whether you have permission for other directories. Otherwise, the task will fail.
File Type
Supports Text, CSV, xls, and xlsx. Different file types require different configuration parameters.
For configuration parameters required for Text and CSV file types, see Text and CSV File Types.
For configuration parameters required for xls and xlsx file types, see xls and xlsx File Types.
File Encoding
Supports UTF-8 and GBK encoding.
Null Value Conversion
The default is empty. You can specify any string to be converted to a NULL value.
Compression Format
Supports zip, gzip, bzip2, lzo, and lzo_deflate compression formats.
Output Fields
Displays the output fields. You can manually add output fields:
Click Batch Add.
Configure in JSON format in batches. For example,
"index":3,"name":"user_id","type":"String"means to introduce the fourth column in the file, with the field name user_id and the field type String.Configure in TEXT format in batches.
The row delimiter is used to separate information for each field. The default is a line feed (\n). It supports line feed (\n), semicolon (;), and period (.).
The column delimiter is used to separate the field name and field type. The default is a comma (,).
Click Create Output Field, and fill in the Source Ordinal Number, Column, and select Type according to the prompts on the page. The source ordinal number for Text and CSV file types must be filled in with the numeric ordinal number of the column where the field is located, starting from 0.
You can also perform the following operations on the added fields:
Click and drag the
icon next to the field to change the position of the field.Click the Actions column
icon to edit existing fields.Click the Actions column's
icon to delete the existing field.
Text and CSV File Types
Parameter
Description
Column Delimiter
Fill in the delimiter between columns in the file according to the actual storage situation. If not filled in, the default is a comma (,).
Row Delimiter
Fill in the delimiter between rows in the file according to the actual storage situation. If not filled in, the default is a line feed (\n).
First Row Content Type
If the first row is the field name, you can select the field name. Otherwise, select data content.
xls and xlsx File Types
Parameter
Description
Sheet Selection
You can select the sheet to read by name or index. If multiple sheets are read, the data format must be consistent.
Sheet Name
Separate multiple sheets to be read with a comma (,). You can also enter
*to read all sheets.Important*and comma (,) cannot be used together.Data Content Start Row
The default is 1, starting from the first row as data content. If you need to ignore the first N rows, set the data content start row to N+1.
Data Content End Row
Optional. If not specified, the default is to read to the last row with data.
ImportantThe content end row must be greater than or equal to the start row. Otherwise, the task will report an error.
Export Sheet Name
By default, it is not exported. If selected for export, a source sheet field will be added to the output fields.
Click Confirm to complete the configuration of the Amazon S3 input component properties.