All Products
Search
Document Center

Dataphin:Configure FTP Input Widget

Last Updated:Mar 05, 2025

The FTP input widget facilitates the transfer of data from an FTP server to the storage system associated with the big data platform, enabling data integration and further processing. This topic describes the steps to configure the FTP input widget.

Prerequisites

  • You have successfully created an FTP data source. For more information, see Create FTP Data Source.

  • To configure the FTP input widget properties, the account must have read-through permission for the data source. If you lack the necessary permissions, you must obtain them from the data source. For more information, see Request, Renew, and Return Data Source Permissions.

Procedure

  1. On the Dataphin home page, navigate to the top menu bar and select Development > Data Integration.

  2. At the top of the integration page, select Project (Dev-Prod mode requires selecting the environment).

  3. In the left-side navigation pane, click Batch Pipeline, and in the Batch Pipeline list, click the Offline Pipeline you want to develop to access its configuration page.

  4. Click the Component Library in the upper right corner to open the Component Library panel.

  5. In the Component Library panel's left-side navigation pane, select Input. Locate the FTP component in the input widget list on the right and drag it onto the canvas.

  6. Click the image icon on the FTP input widget card to open the FTP Input Configuration dialog box.

  7. In the FTP Input Configuration dialog box, set the necessary parameters.

    The FTP input widget supports File Types including Text, CSV,CSV, Xls, and . Different file types require different configurations, as detailed below:

    Text and CSV Parameter Configuration Instructions

    Parameter

    Description

    Basic Configuration

    Step Name

    Fill in the name according to the current widget's usage scenario. The naming convention is as follows:

    • Can only contain Chinese characters, letters, underscores (_), and numbers.

    • Cannot exceed 64 characters.

    Datasource

    Select the data source. Choose a data source configured in the Dataphin system, and the data source must meet the following two conditions:

    • The data source type is FTP Data Source, SFTP Data Source, or FTPS Data Source.

    • The account executing Attribute Configuration has read-through permission for the data source. If you do not have permission, you need to request data source permission. For more information, see Request, Renew, and Return Data Source Permissions.

    You can also click New after Datasource to enter the Management Center module to add a data source. For more information, see Create FTP Data Source.

    File Path

    Fill in the file path. Multiple file paths are supported, separated by semicolons (;), and wildcard characters are supported. For example, specifying /dataphin/* represents reading all files downstream of the dataphin directory.

    File Type

    Select Text or CSV file type.

    Mark Complete File Check

    Mark complete file check to verify whether the file data is ready to be read before reading. The default is Off.

    1. After enabling, click Check Configuration.

    2. In the Mark Complete File Check Configuration dialog box, configure the check parameters.

      • Mark Complete File Path: Fill in the path of the mark complete file to be checked. System parameters, global parameters, and cross-node parameters are supported. For example, /${check}/dataphin.

      • Check Interval (seconds): Fill in the interval time for each file check. The default is 60 seconds.

      • Check Duration (minutes): Fill in the duration for each file check. The default is 60 minutes.

        Important
        • The check duration and data transmission duration will be calculated together as the runtime of the integration task. Pay attention to the check duration and runtime timeout configuration. Resources will be occupied during the check period, so configure reasonably.

        • If the check time exceeds the task timeout, the task will be forcibly terminated.

      • Check failure handling policy: After the file check task fails, data extraction and writing will not actually be performed. For the handling policy of file check task failure, it supports setting the task as failed and setting the task as successful.

        • Set Task Failed: After the check fails, the system sets the check task to a failed status, and the integration task will not be executed.

        • Set Task Successful: After the check fails, the system sets the check task to a successful status and continues to execute subsequent integration tasks.

    3. Click Confirm to complete the mark complete file check configuration.

    When File Does Not Exist

    Supports Ignore and Set Task Failed policies. If mark complete file check is enabled, configuration for when the file does not exist is not supported.

    • Ignore: When the file being read does not exist, ignore the file and continue reading other files.

    • Set Task Failed: When the file being read does not exist, terminate the task and set it to failed.

    Data Content Start Row

    Set the start row for the input widget to read data. The default is 1, starting from the first row as data content. If you need to ignore the first N rows, set the data content start row to N+1.

    Advanced Configuration

    Splitting Method

    Text supports Separator Splitting and Field Length Splitting, while CSV supports Separator Splitting.

    • Separator Splitting: Rows and fields will be split based on Field Separator and Row Separator.

    • Field Length Splitting: Each line of the file will be treated as a long string, and fields will be extracted based on the start and end character positions.

    Field Separator

    When the splitting method is separator splitting, you need to fill in the field separator for file storage. If you do not fill it in, the system defaults to using a comma (,) as the field separator.

    Row Separator

    When the splitting method is Field Length Splitting, configuring Row Separator is not supported. If you do not fill it in, the system defaults to using a line feed (\n) as the row separator. When the file type is Text, configuring both row separator and more configuration of textReaderConfig is not supported.

    File Encoding

    Select the file encoding. The system supports file encodings such as UTF-8 and GBK.

    Null Value Conversion

    Configure the string representing NULL to replace the string in the source data with NULL. If this parameter is not configured, the source data will not be specially processed.

    Compression Format

    If the file is compressed, select the corresponding compression format for Dataphin to decompress. Supported compression formats include zip, gzip, bzip2, lzo, lzo-deflate, hadoop-snappy, and framing-snappy.

    More Configuration

    Enter other control configuration items for reading data. For example, use textReaderConfig to control the reading of Text files. The configuration example is as follows.

    {
      "textReaderConfig":{
      "useTextQualifier":false, //Whether there is a qualifier
      "textQualifier":"\"",//Configure the qualifier
      "caseSensitive":true, //Whether the qualifier is case-sensitive
      "trimWhitespace":false //Whether to remove whitespace before and after each column content
      }
    }

    Output Fields

    Display output fields for you. You can manually add output fields:

    • Batch Add Output Fields.

      • Format: Click Batch Add, supporting batch configuration in JSON format and TEXT format.

        • JSON Format:

          // Example:
           [{
             "startPos": 0,
             "endPos": 10,
             "name": "user_id",
             "type": "String"
            },
            {
             "startPos": 11,
             "endPos": 15,
             "name": "user_name",
             "type": "String"
            }]
        • TEXT Format:

          // Example:
          0,10,user_id,String
          11,15,user_name,String
      • Splitting Method: When the file type is Text and the splitting method is Field Length Splitting, the splitting method for batch addition can be configured, including By Field Start Position and By Field Length.

        • By Field Start Position: The first number indicates the start character position of the field, the second number indicates the end position, and the last two indicate the field name and field type. For example, the Text format 0,10,user_id,String indicates that the first to the eleventh character of each line of the file is introduced as a field, with the field name user_id and field type String.

        • Specify by Field Length: The first number indicates the field length, and the last two indicate the field name and field type. For example, the Text format 11,user_id,String indicates that a field with a length of 11 is introduced, with the field name user_id and field type String. The next field starts calculating the length from the first character after the previous field.

      • Row Separator, Column Separator: When the batch addition Format is TEXT, configuring the row separator and column separator is supported. The row separator is used to separate each field's information, with the default being a line feed \n, supporting \n ; . ; the column separator is used to separate the field name and field type, with the default being a comma (,).

    • Preview Splitting Effect.

      When the file type is Text and the splitting method is Field Length Splitting, previewing the splitting effect is supported.

      1. Click Preview Splitting Effect.

      2. In the preview splitting effect dialog box, enter the test string and click Test to view the splitting effect.

    • Create New Output Field.

      Click Create New Output Field, and fill in the Source Ordinal Number, Field, and select Type according to the page prompts. For Text and CSV file types, the source ordinal number must be filled in with the numeric ordinal number of the column where the field is located, starting from 0.

    • Manage Output Fields.

      For added fields, you can perform the following operations:

      • Click the Actions column's agag icon to edit existing fields.

      • Click the Actions column agfag icon to delete the existing field.

    Xls and Xlsx Parameter Configuration Instructions

    Parameter

    Description

    Step Name

    Fill in the name according to the current widget's usage scenario. The naming convention is as follows:

    • Can only contain Chinese characters, letters, underscores (_), and numbers.

    • Cannot exceed 64 characters.

    Datasource

    Select the data source. Choose a data source configured in the Dataphin system, and the data source must meet the following two conditions:

    • The data source type is FTP Data Source, SFTP Data Source, or FTPS Data Source.

    • The account executing Attribute Configuration has read-through permission for the data source. If you do not have permission, you need to request data source permission. For more information, see Request, Renew, and Return Data Source Permissions.

    You can also click New after Datasource to enter the planning module to add a data source. For more information, see Create FTP Data Source.

    File Path

    Fill in the file path. Multiple file paths are supported, separated by semicolons (;). Wildcard characters are supported. For example, specifying /dataphin/* represents reading all files downstream of the dataphin directory.

    File Type

    Select xls or xlsx file type.

    Mark Complete File Check

    Mark complete file check to verify whether the file data is ready to be read before reading. The default is Off.

    1. After enabling, click Check Configuration.

    2. In the Mark Complete File Check Configuration dialog box, configure the check parameters.

      • Mark Complete File Path: Fill in the path of the mark complete file to be checked. System parameters, global parameters, and cross-node parameters are supported. For example, /${check}/dataphin.

      • Check Interval (seconds): Fill in the interval time for each file check. The default is 60 seconds.

      • Check Duration (minutes): Fill in the duration for each file check. The default is 60 minutes.

        Important
        • The check duration and data transmission duration will be calculated together as the runtime of the integration task. Pay attention to the check duration and runtime timeout configuration. Resources will be occupied during the check period, so configure reasonably.

        • If the check time exceeds the task timeout, the task will be forcibly terminated.

      • Check failure handling policy: After the file check task fails, data extraction and writing will not actually be performed. For the handling policy of file check task failure, it supports setting the task as failed and setting the task as successful.

        • Set Task Failed: After the check fails, the system sets the check task to a failed status, and the integration task will not be executed.

        • Set Task Successful: After the check fails, the system sets the check task to a successful status and continues to execute subsequent integration tasks.

    3. Click Confirm to complete the mark complete file check configuration.

    When File Does Not Exist

    Supports Ignore and Set Task Failed policies. If mark complete file check is enabled, configuration for when the file does not exist is not supported.

    • Ignore: When the file being read does not exist, ignore the file and continue reading other files.

    • Set Task Failed: When the file being read does not exist, terminate the task and set it to failed.

    Sheet Selection

    Supports By Name and By Index methods. If multiple sheets are read, the data format must be consistent.

    • Sheet Name: Multiple sheets can be read, separated by commas (,), or you can enter * to read all sheets. * and commas cannot be mixed. For example, sheet1,sheet2.

    • Sheet Index: Multiple sheets can be read, separated by commas (,), or you can enter * to read all sheets. * and commas cannot be mixed. For example, you can use 0,3,7-9 to specify single or continuous sheets.

    Data Content Start Row

    Set the start row for the input widget to read data. The default is 1, starting from the first row as data content. If you need to ignore the first N rows, set the data content start row to N+1.

    Data Content End Row

    If the number of rows is not specified, it reads to the last row of data. The Data Content End Row must not be less than the Data Content Start Row.

    Export Sheet Name

    You can choose Export or Do Not Export. Selecting Export adds an export field, and the field content is the source sheet name of the row data.

    File Encoding

    Select the file encoding. The system supports file encodings such as UTF-8 and GBK.

    Null Value Conversion

    Configure the string representing NULL to replace the string in the source data with NULL. If this parameter is not configured, the source data will not be specially processed.

    Compression Format

    If the file is compressed, select the corresponding compression format for Dataphin to decompress. Supported compression formats include zip, gzip, bzip2, lzo, lzo-deflate, hadoop-snappy, and framing-snappy.

    Output Fields

    Display output fields for you. You can manually add output fields:

    • Batch Add Output Fields.

      • Click Batch Add, supporting batch configuration in JSON format and TEXT format.

        • JSON Format:

          // Example:
           [{
             "startPos": 0,
             "endPos": 10,
             "name": "user_id",
             "type": "String"
            },
            {
             "startPos": 11,
             "endPos": 15,
             "name": "user_name",
             "type": "String"
            }]
        • TEXT Format:

          Row Separator, Column Separator: When the batch addition Format is TEXT, configuring the row separator and column separator is supported. The row separator is used to separate each field's information, with the default being a line feed \n, supporting \n ; . ; the column separator is used to separate the field name and field type, with the default being a comma (,).

          // Example:
          0,10,user_id,String
          11,15,user_name,String
    • Create New Output Field.

      Click Create New Output Field, and fill in the Source Ordinal Number, Field, and select Type according to the page prompts. For xls and xlsx file types, the source ordinal number must be filled in with the uppercase letter ordinal number of the column where the field is located, or the numeric ordinal number of the column, starting from 0. If you fill in the lowercase letter ordinal number, the system will automatically convert it to the uppercase letter ordinal number. At the same time, if the export sheet name is selected, the source ordinal number is (-) and cannot be modified.

    • Manage Output Fields.

      You can also perform the following operations on added fields:

      • Click the Operation column agag icon to edit existing fields.

      • Click the Operation column's agfag icon to delete the existing field.

  8. Click Confirm to finalize the configuration of the FTP input widget.