FTP Reader provides the ability to read data from a remote FTP file system. At the underlying implementation level, Ftp Reader acquires the remote FTP file data, converts the data to the data synchronization and transmission protocol, and passes it to Writer.
What is saved to the local file is a two‑dimensional table in a logic sense, for example, text information in a CSV format.
FTP Reader provides the ability to read data from a remote FTP file and convert the data to the data synchronization protocol. Remote FTP file itself is a non‑structured data storage. For data synchronization, FTP Reader currently supports the following features:
Only supports reading TXT files and the shema in the TXT file must be a two‑dimensional table.
Supports CSV‑like format files with custom delimiters.
Supports reading multiple types of data (represented by Strings) and supports column pruning and column constants.
Supports recursive reading and filtering by File Name.
Supports text compression. The available compression formats include gzip, bzip2, zip, lzo, and lzo_deflate.
Supports concurrent reading of multiple files.
The following two features are not supported currently:
Multi‑thread concurrent reading of a single file. This feature involves the internal splitting algorithm of a single file (under planning).
Technically, the multi‑thread concurrent reading of a single compressed file is not supported.
The remote FTP file itself does not provide data types, which are defined by DataX FtpReader:
|Internal DataX type||Data type of a remote FTP file|
Description: A data source name. It must be identical to the data source name added. Adding data source is supported in script mode.
Default value: None
Description: The path of the remote FTP file system. Multiple paths can be specified.
If a single remote FTP file is specified, FTP Reader only supports single-threaded data extraction. We are planning to provide the ability to concurrently read a single non‑compressed file with multiple threads.
If multiple remote FTP files are specified, FTP Reader can extract data with multiple threads. The number of concurrent threads is specified based on the number of channels.
If a wildcard is specified, FTP Reader attempts to traverse multiple files. For example, specify/ to read all files under / directory, and specify /bazhen/\ to read all downstream files under the bazhen directory. Currently, FTP Reader only supports * as the file wildcard.
Data synchronization system identifies all text files synchronized in a job as a same data table. You must make sure that all files are applicable to the same schema information.
You must make sure that the file to be read is in CSV‑like format, and the read permission must be granted to the data synchronization system.
If no matching file exists for extraction in the path specified by Path, an error may occur in the synchronization task.
Default value: None
Description: It refers to the list of fields read, where the type indicates the type of source data, the index indicates the column in which the current column locates (starts from 0), and the value indicates that the current type is constant and the data is not read from the source file but the corresponding column is automatically generated according to the value.
By default, you can read data by taking String as the only type. The configuration is as follows:
You can configure the column field as follows:
"index": 0 //Read the int field from the first column of the remote FTP file text
"value": "alibaba" //FtpReader internally generates the alibaba string field as the current field
For the specified column information, you must enter type and choose one from index/value.
Default value: Read data by taking string as the only type
Description: The delimiter used to separate the read fields.
Note: A field delimiter must be specified when FTP Reader reads data. By default, if comma (,) is not specified, it is entered in the interface configuration.
Default value: comma (,)
Description: The header of a file in CSV‑like format is skipped if it is a title. Headers are not skipped by default. skipHeader is not supported for file compression.
Default value: False
Description: Encoding of the written files.
Default value: utf-8
Description: Defining null (null pointer) with a standard string is not allowed in text files. Data synchronization provides nullFormat to define which strings can be expressed as null.
For example, when
nullFormat:“null”is configured, if the source data is “null”, it is considered as a null field in data synchronization.
Default value: None
Description: The name of the file marked as “done”. Check MarkDoneFile before data synchronization. If it does not exist, wait for a while and check again. If it exists, start the data synchronization task.
Default value: None
Description: The number of attempts made to check MarkDoneFile. Default value is 60. Try once every one minute for 60 minutes in total.
Default value: 60
Data Sources: datasource in the preceding parameter description. Select the ftp data source.
the File Path: path in the preceding parameter description.
version of the type: The type version of added file.
Column Delimiter: fieldDelimiter in the preceding parameter description, which defaults to “,”.
Encoding Format: encoding in the preceding parameter description, which defaults to utf-8.
null Value: nullFormat in the preceding parameter description, to define a string that represents the null value.
Compression Format: compress in the preceding parameter description, which defaults to “no compression”.
Whether the Header: skipHeader in the preceding parameter description, which defaults to “No”.
In Field Mapping, the column field information can be specified.
The following is a script configuration sample. For relevant parameters, see Parameter description.
"traceId": "ftp to stream job test",