The OSS input component reads data from OSS data sources. In scenarios where you need to synchronize data from an OSS data source to other data sources, you must first configure the source data source for the OSS input component, and then configure the destination data source for data synchronization. This topic describes how to configure an OSS input component.
Prerequisites
An OSS data source is created. For more information, see Create an OSS data source.
The account that configures the properties of the OSS input component has the read-through permission on the data source. If you do not have the permission, you must request the permission on the data source. For more information, see Request permissions on a data source.
Procedure
In the top navigation bar of the Dataphin homepage, choose Develop > Data Integration.
In the top navigation bar of the integration page, select a project (In Dev-Prod mode, you need to select an environment).
In the left-side navigation pane, click Batch Pipeline. In the Batch Pipeline list, click the offline pipeline that you want to develop to open its configuration page.
Click Component Library in the upper-right corner of the page to open the Component Library panel.
In the left-side navigation pane of the Component Library panel, select Inputs. Find the OSS component in the input component list on the right and drag it to the canvas.
Click the
icon in the OSS input component card to open the OSS Input Configuration dialog box.In the OSS Input Configuration dialog box, configure the following parameters.
Parameter
Description
Step Name
The name of the OSS input component. Dataphin automatically generates a step name. You can also modify the name based on your business scenario. The name must meet the following requirements:
The name can contain only Chinese characters, letters, underscores (_), and digits.
The name cannot exceed 64 characters in length.
Datasource
Select a data source. Select a data source that is configured in the Dataphin system and meets the following conditions:
The data source type is OSS Data Source.
The account that configures the properties has the read-through permission on the data source. If you do not have the permission, you must request the permission on the data source. For more information, see Request permissions on a data source.
You can also click Create next to Data Source to go to the planning module and add a data source. For more information, see Create an OSS data source.
Object Prefix
The name of the OSS object from which you want to read data. You can specify multiple object names. For example, if a bucket in OSS contains a data folder that includes the phin.txt file, you can set the Object Prefix to
data/phin.txtto synchronize a specific file. To synchronize all files in a folder, you need to use a wildcard character, such asdata/*.File Type
The system supports reading files in the Text, CSV, xls, and xlsx formats. Different formats require different configuration information.
Text and CSV formats: For configuration details, see Text and CSV formats.
xls and xlsx formats: For configuration details, see xls and xlsx formats.
Output Fields
Displays the output fields. You can manually add output fields:
Click Batch Add.
Configure in JSON format, for example:
// Example: [{"index": 0,"name": "user_id","type": "String"}, {"index": 1,"name": "user_name","type": "String"}]Noteindex indicates the column number of the specified object, name indicates the field name after import, and type indicates the field type after import. For example:
"index":3,"name":"user_id","type":"String"indicates that the fourth column in the file is imported, the field name is user_id, and the field type is String.Configure in TEXT format, for example:
1,user_name,StringThe row delimiter is used to separate the information of each field. The default value is a line feed (\n). The system supports line feeds (\n), semicolons (;), and periods (.).
The column delimiter is used to separate field names from field types. The default value is a comma (,).
Click Create Output Field, and fill in Source Index, Column, and select Type as prompted. For Text and CSV file types, you must fill in the numeric index of the column where the field is located in the Source Index field. The index starts from 0.
You can also perform the following operations on added fields:
Click and drag the
icon next to a field to change its position.Click the Actions
icon in the column to edit an existing field.Click the Actions
icon in the column to delete an existing field.
Text and CSV formats
Parameter
Description
Column Delimiter
The column delimiter of the file. If you do not specify this parameter, the system uses a comma (,) as the default value.
Row Delimiter
The row delimiter of the file. If you do not specify this parameter, the system uses a line feed (\n) as the default value.
File Encoding
The encoding format of the file from which you want to read data. The system supports UTF-8 and GBK for File Encoding.
Null Value
Enter the fields that you want to represent as null in the text box. If these fields exist in the source, the corresponding parts will be converted to null.
Compression Format
The format in which files are compressed. By default, this parameter is left empty, which indicates that files are not compressed. The system supports the following compression formats:
zip
gzip
bzip2
lzo
lzo_deflate
First Row Content Type
Select the content type of the first row in the text. The first row content type can be Data Content or Column Name.
Xls and xlsx formats
Parameter
Description
Sheet Selection
You can select sheets to read by name or index. If you want to read multiple sheets, make sure that they have the same data format.
By Name: You need to fill in the Sheet Name that you want to read.
By Index: You need to fill in the Sheet Index that you want to read, starting from 0.
Data Content Start Row
Fill in the starting row of the data content. The default value is 1, which means that the data content starts from the first row. If you want to ignore the first N rows, set the data content start row to N+1.
Data Content End Row
Fill in the ending row of the data content. If you do not specify this parameter, the system reads data to the last row that contains data by default.
Export Sheet Name
Select whether to export the source sheet name of the data. The exported content is
{sheet name}.File Encoding
The system supports UTF-8 and GBK encoding.
Compression Format
The system supports zip, gzip, bzip2, lzo, and lzo_deflate compression formats.
Null Value Conversion
You can specify any string to be converted to a Null value.
Click OK to complete the property configuration of the OSS input component.
What to do next
After you configure the input component, you can configure downstream components to implement data synchronization. For more information, see Development description of the integration component library.