This topic describes the data types and parameters supported by File Transfer Protocol (FTP) Reader and how to configure it by using the codeless user interface (UI) and code editor.

FTP Reader allows you to read data from a remote FTP server. FTP Reader connects to an FTP server, reads data from the server, converts the data to a format that is readable by Data Integration, and then sends the converted data to a writer.

FTP Reader can read only FTP files that store logical two-dimensional tables, for example, text information in CSV format.

FTP servers store unstructured data only. Currently, FTP Reader supports the following features:
  • Reads TXT files that store logical two-dimensional tables. FTP Reader can read only TXT files.
  • Reads data stored in formats similar to CSV with custom delimiters.
  • Reads data of various types as strings. Supports constants and column pruning.
  • Supports recursive reading and file name-based filtering.
  • Supports the following file compression formats: GZIP, BZIP2, ZIP, LZO, and LZO_DEFLATE.
  • Reads multiple files concurrently.
Currently, FTP Reader does not support the following features:
  • Uses concurrent threads to read an uncompressed file.
  • Uses concurrent threads to read a compressed file.
The data types of remote FTP files are defined by FTP Reader.
Data Integration data type FTP file data type
LONG LONG
DOUBLE DOUBLE
STRING STRING
BOOLEAN BOOLEAN
DATE DATE

Parameters

Parameter Description Required Default value
datasource The connection name. It must be identical to the name of the added connection. You can add connections in the code editor. Yes None
path The path of the FTP file to read. You can specify multiple FTP file paths.
  • If you specify a single FTP file, FTP Reader uses only one thread to read the file. Concurrent multi-thread reading of a single uncompressed file is coming soon.
  • If you specify multiple FTP files, FTP Reader uses multiple threads to read these files. The actual number of threads is determined by the number of channels.
  • When a path contains a wildcard, FTP Reader attempts to read all files that match the path. If the path is ended with a slash (/), FTP Reader reads all files in the specified directory. For example, if you specify the path as /bazhen/, FTP Reader reads all files in the bazhen directory. Currently, FTP Reader only supports asterisks (*) as file name wildcards. FTP Reader can flexibly generate node names based on custom parameters.
Note
  • We recommend that you do not use asterisks (*) because this may cause out of memory (OOM) on a Java virtual machine (JVM).
  • Data Integration considers all the files on a sync node as a single table. Make sure that all the files on each sync node can adapt to the same schema and Data Integration has the permission to read all these files.
  • Make sure that the data format is similar to CSV.
  • An error occurs if no readable files exist in the specified path.
Yes None
column The columns to read. The type parameter specifies the source data type. The index parameter specifies the ID of the column in the source table, starting from 0. The value parameter specifies the column value if the column is a constant column.
By default, FTP Reader reads all data as strings. Specify this parameter as "column":["*"]. You can also specify the column parameter in the following way:
{
    "type": "long",
    "index": 0 // The first INT-type column of the source file.
  },
  {
    "type": "string",
    "value": "alibaba" // The value of the current column, that is, a constant "alibaba".
  }

For the column parameter, you must specify the type parameter and specify one of the index and value parameters.

Yes By default, FTP Reader reads all data as strings.
fieldDelimiter The column delimiter.
Note You must specify the column delimiter for FTP Reader. The default delimiter is comma (,). The default setting for the column delimiter on the codeless UI is comma (,), too.
Yes ,
skipHeader Specifies whether to skip the header (if exists) of a CSV-like file. The skipHeader parameter is not supported for compressed files. No false
encoding The encoding format of the file to read. No utf-8
nullFormat The string that represents null. No standard strings can represent null in text files. Therefore, Data Integration provides the nullFormat parameter to define which string represents a null pointer.

For example, if you specify nullFormat:"null", Data Integration considers null as a null pointer.

No None
markDoneFileName The name of the file used to indicate that the sync node can start. Data Integration checks whether the file exists before data synchronization. If the file does not exist, Data Integration checks again later. Data Integration starts the sync node only after the file is detected. No None
maxRetryTime The maximum number of checks for the file used to indicate that the sync node can start. By default, 60 checks are allowed. Data Integration checks for the file every 1 minute. The whole process lasts at most 60 minutes. No 60
csvReaderConfig The configurations for reading CSV files. The parameter value must match the MAP type. A specific CSV reader is used to read data from CSV files, which supports many configurations. No None
fileFormat The format of the file saved by FTP Reader. By default, FTP Reader converts the data to a two-dimensional table and stores the table in a CSV file. If you specify binary as the file format, Data Integration converts data to the binary format for replication and transmission.

Generally, you need to specify this parameter only when you want to replicate the complete directory structure between storage systems such as FTP and Object Storage Service (OSS).

No None

Configure FTP Reader by using the codeless UI

  1. Configure the connections.
    Configure the source and destination connections for the sync node.Connections
    Parameter Description
    Connection The datasource parameter in the preceding parameter description. Select a connection type, and enter the name of a connection that has been configured in DataWorks.
    File Path The path parameter in the preceding parameter description.
    File Type The format of the file saved by FTP Reader. Default value: CSV.
    Field Delimiter The fieldDelimiter parameter in the preceding parameter description. The default delimiter is comma (,).
    Encoding The encoding parameter in the preceding parameter description. Default value: UTF-8.
    Null String The nullFormat parameter in the preceding parameter description, which defines a string that represents the null value.
    Compression Format The compression format. By default, files are not compressed.
    Include Header The skipHeader parameter in the preceding parameter description. Default value: No.
  2. Configure field mapping, that is, the column parameter in the preceding parameter description.
    Fields in the source table on the left have a one-to-one mapping with fields in the destination table on the right. You can click Add to add a field, or move the pointer over a field and click the Delete icon to delete the field.Mappings
    Parameter Description
    Map Fields with the Same Name Click Map Fields with the Same Name to establish a mapping between fields with the same name. Note that the data types of the fields must match.
    Fields in the Same Line Click Map Fields in the Same Line to establish a mapping for fields in the same row. Note that the data types of the fields must match.
    Delete All Mappings Click Delete All Mappings to remove mappings that have been established.
  3. Configure channel control policies.Channel
    Parameter Description
    Expected Maximum Concurrency The maximum number of concurrent threads to read and write data to data storage within the sync node. You can configure the concurrency for a node on the codeless UI.
    Bandwidth Throttling Specifies whether to enable bandwidth throttling. You can enable bandwidth throttling and set a maximum transmission rate to avoid heavy read workload of the source. We recommend that you enable bandwidth throttling and set the maximum transmission rate to a proper value.
    Dirty Data Records Allowed The maximum number of dirty data records allowed.
    Resource Group The resource group used for running the sync node. If a large number of nodes including this sync node are deployed on the default resource group, the sync node may need to wait for resources. We recommend that you purchase an exclusive resource group for data integration or add a custom resource group. For more information, see DataWorks exclusive resources and Add a custom resource group.

Configure FTP Reader by using the code editor

In the following code, a node is configured to read data from an FTP server.
{
    "type":"job",
    "version":"2.0",// The version number.
    "steps":[
        {
            "stepType":"ftp",// The reader type.
            "parameter":{
                "path":[], // The file path.
                "nullFormat":"", // The string that represents null.
                "compress":"", // The compression format.
                "datasource":"",// The connection name.
                "column":[ // The columns to be synchronized.
                    {
                        "index":0, // The ID of the column in the source table.
                        "type":"" // The data type.
                    }
                ],
                "skipHeader":"", // Specifies whether to skip the file header.
                "fieldDelimiter":",", // The column delimiter.
                "encoding":"UTF-8", // The encoding format.
                "fileFormat":"csv" // The format of the file saved by FTP Reader.
            },
            "name":"Reader",
            "category":"reader"
        },
        {// The following template is used to configure Stream Writer. For more information, see the corresponding topic.
            "stepType":"stream",
            "parameter":{},
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The maximum number of dirty data records allowed.
        },
        "speed":{
            "throttle":false,// Specifies whether to enable bandwidth throttling. A value of false indicates that the bandwidth is not throttled. A value of true indicates that the bandwidth is throttled. The maximum transmission rate takes effect only if you set this parameter to true.
            "concurrent":1,// The maximum number of concurrent threads.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}