This topic describes the data types and parameters that FTP Reader supports and how to configure it by using the codeless user interface (UI) and code editor.

Background information

FTP Reader allows you to read data from a remote FTP server. FTP Reader connects to an FTP server, reads data from the server, converts the data to a format that is readable by Data Integration, and then sends the converted data to a writer.

FTP Reader can read only FTP files that store logical two-dimensional tables, for example, text information in the CSV format.

FTP servers store unstructured data only. FTP Reader supports the following features:
  • Reads TXT files that store logical two-dimensional tables. FTP Reader can read only TXT files.
  • Reads data that is stored in formats similar to CSV with custom delimiters.
  • Reads data of various types as strings, and supports constants and column pruning.
  • Supports recursive reading and file name-based filtering.
  • Supports the following file compression formats: GZIP, BZIP2, ZIP, LZO, and LZO_DEFLATE.
  • Concurrently reads multiple files.
FTP Reader does not support the following features:
  • Uses concurrent threads to read an uncompressed file.
  • Uses concurrent threads to read a compressed file.

Data types

The data types of remote FTP files are defined by FTP Reader.
Data Integration data type FTP file data type
LONG LONG
DOUBLE DOUBLE
STRING STRING
BOOLEAN BOOLEAN
DATE DATE

Parameters

Parameter Description Required Default value
datasource The connection name. It must be the same as the name of the created connection. You can create connections in the code editor. Yes N/A
path The path of the FTP file to read. You can specify multiple FTP file paths.
  • If you specify a single FTP file, FTP Reader uses only one thread to read the file. Concurrent multi-thread reading of a single uncompressed file is coming soon.
  • If you specify multiple FTP files, FTP Reader uses multiple threads to read these files. The actual number of threads is determined by the number of channels.
  • When a path contains a wildcard, FTP Reader attempts to read all files that match the path. If the path is ended with a slash (/), FTP Reader reads all files in the specified directory. For example, if you specify the path as /bazhen/, FTP Reader reads all files in the bazhen directory. FTP Reader only supports asterisks (*) as file name wildcards. FTP Reader can flexibly generate node names based on Custom parameters.
Note
  • We recommend that you do not use asterisks (*) because this may cause out of memory (OOM) on a Java virtual machine (JVM).
  • Data Integration considers all the files on a sync node as a single table. Make sure that all the files on each sync node can adapt to the same schema and Data Integration has the permission to read all these files.
  • Make sure that the data format is similar to CSV.
  • An error occurs if no readable files exist in the specified path.
Yes N/A
column The columns to read. The type parameter specifies the source data type. The index parameter specifies the ID of the column in the source table, starting from 0. The value parameter specifies the column value if the column is a constant column.
By default, FTP Reader reads all data as strings. Specify this parameter as "column":["*"]. You can also specify the column parameter in the following format:
{
    "type": "long",
    "index": 0 // The first INT-type column of the source file.
  },
  {
    "type": "string",
    "value": "alibaba" // The value of the current column. In this code, the value is a constant "alibaba".
  }

For the column parameter, you must specify the type parameter and specify one of the index and value parameters.

Yes By default, FTP Reader reads all data as strings.
fieldDelimiter The column delimiter.
Note You must specify the column delimiter for FTP Reader. The default delimiter is comma (,). The default setting for the column delimiter on the codeless UI is comma (,), too.
Yes ,
skipHeader Specifies whether to skip the header (if exists) of a CSV-like file. The skipHeader parameter is not supported for compressed files. No false
encoding The encoding format of the file to read. No utf-8
nullFormat The string that represents null. No standard strings can represent null in text files. Therefore, Data Integration provides the nullFormat parameter to define which string represents a null pointer.

For example, if you specify nullFormat:"null", Data Integration considers null as a null pointer.

No N/A
markDoneFileName The name of the file that is used to indicate that the sync node can start. Data Integration checks whether the file exists before data synchronization. If the file does not exist, Data Integration checks again later. Data Integration starts the sync node only after the file is detected. No N/A
maxRetryTime The maximum number of checks for the file that is used to indicate that the sync node can start. By default, 60 checks are allowed. Data Integration checks for the file every 1 minute. The whole process lasts at most 60 minutes. No 60
csvReaderConfig The configurations for reading CSV files. The parameter value must match the MAP type. A specific CSV reader is used to read data from CSV files. The CSV reader supports many configurations. No N/A
fileFormat The format of the file that is saved by FTP Reader. By default, FTP Reader converts the data to a two-dimensional table and stores the table in a CSV file. If you specify binary as the file format, Data Integration converts data to the binary format for replication and transmission.

You specify this parameter only when you want to replicate the complete directory structure between storage systems such as FTP and Object Storage Service (OSS).

No N/A

Configure FTP Reader by using the codeless UI

  1. Configure the connections.
    Configure the connections to the source and destination data stores for the sync node.Connections section
    GUI element Description
    Connection The datasource parameter in the preceding parameter description. Select a connection type and select the name of a connection that you have configured in DataWorks.
    File Path The path parameter in the preceding parameter description.
    File Type The format of the files to be read. The default format is CSV.
    Field Delimiter The fieldDelimiter parameter in the preceding parameter description. The default delimiter is comma (,).
    Encoding The encoding parameter in the preceding parameter description. The default encoding format is UTF-8.
    Null String The nullFormat parameter in the preceding parameter description, which defines a string that represents the null value.
    Compression Format The compression format. By default, files are not compressed.
    Include Header The skipHeader parameter in the preceding parameter description. The default value is No.
  2. Configure field mapping. It is equivalent to setting the column parameter in the preceding parameter description.
    Fields in the source table on the left have a one-to-one mapping with fields in the destination table on the right. You can click Add to add a field. To delete a field, move the pointer over the field and click theDelete icon.Mappings section
    GUI element Description
    Map Fields with the Same Name Click Map Fields with the Same Name to establish a mapping between fields with the same name. The data types of the fields must match.
    Map Fields in the Same Line Click Map Fields in the Same Line to establish a mapping between fields in the same row. The data types of the fields must match.
    Delete All Mappings Click Delete All Mappings to remove mappings that have been established.
  3. Configure channel control policies.Channel section
    GUI element Description
    Expected Maximum Concurrency The maximum number of concurrent threads that the sync node uses to read data from or write data to data stores. You can configure the concurrency for the node on the codeless UI.
    Bandwidth Throttling Specifies whether to enable bandwidth throttling. You can enable bandwidth throttling and set a maximum transmission rate to avoid heavy read workload of the source. We recommend that you enable bandwidth throttling and set the maximum transmission rate to a proper value.
    Dirty Data Records Allowed The maximum number of dirty data records allowed.

Configure FTP Reader by using the code editor

The following example shows how to configure a sync node to read data from an FTP server. For more information, see Create a sync node by using the code editor.
{
    "type":"job",
    "version":"2.0", // The version number.
    "steps":[
        {
            "stepType":"ftp",// The reader type.
            "parameter":{
                "path":[], // The file path.
                "nullFormat":"", // The string that represents null.
                "compress":"", // The compression format.
                "datasource":"", // The connection name.
                "column":[ // The columns to be synchronized from the source table.
                    {
                        "index":0, // The ID of the column in the source table.
                        "type":"" // The data type.
                    }
                ],
                "skipHeader":"", // Specifies whether to skip the file header.
                "fieldDelimiter":",",// The column delimiter.
                "encoding":"UTF-8", // The encoding format.
                "fileFormat":"csv" // The format of the file that is saved by FTP Reader.
            },
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0" // The maximum number of dirty data records allowed.
        },
        "speed":{
            "throttle":false // Specifies whether to enable bandwidth throttling. A value of false indicates that the bandwidth is not throttled. A value of true indicates that the bandwidth is throttled. The maximum transmission rate takes effect only if you set this parameter to true.
            "concurrent":1, // The maximum number of concurrent threads.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}