This topic describes the data types and parameters that are supported by FTP Reader and how to configure FTP Reader by using the codeless user interface (UI) and code editor.

Background information

FTP Reader reads data from a remote FTP server. FTP Reader connects to a remote FTP server, reads data from the server, converts the data to a format that is readable to Data Integration, and then sends the data to a writer.

FTP Reader can read only FTP files that store logical two-dimensional tables, such as CSV files that store text data.

The files on the FTP server store only unstructured data. FTP Reader provides the following features:
  • Reads data from TXT files. The data in the files must be logical two-dimensional tables.
  • Reads data from CSV-like files with custom delimiters.
  • Reads data of various types as strings and supports constants and column pruning.
  • Supports recursive read and file name-based filtering.
  • Supports file compression. The following compression formats are supported: GZIP, BZIP2, ZIP, LZO, and LZO_DEFLATE.
  • Uses parallel threads to read data from multiple files.
FTP Reader cannot
  • use parallel threads to read data from a single file.
  • Uses concurrent threads to read a compressed file.

Data types

A remote FTP file does not distinguish between data types. The data types are defined by FTP Reader.
Data type in Data Integration Data type in an FTP file
LONG LONG
DOUBLE DOUBLE
STRING STRING
BOOLEAN BOOLEAN
DATE DATE

Parameters

Parameter Description Required Default value
datasource The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor. Yes N/A
path The path on the FTP server from which you want to read data. The path is a full path that contains the path of the directory and the file name with a suffix. You can specify multiple paths.
  • If you specify only one path, FTP Reader uses only one thread to read the related file. A feature of using parallel threads to read data from a single uncompressed file will be available in the future.
  • If you specify multiple paths, FTP Reader uses parallel threads to read the related files. The actual number of threads is determined by the number of channels.
  • If a path contains a wildcard, FTP Reader attempts to read all files that match the path. If a path ends with a forward slash (/), FTP Reader reads data from all files in the specified path. For example, if you specify the /bazhen/ path, FTP Reader reads data from all the files in the /bazhen directory. FTP Reader supports only asterisks (*) as wildcards. FTP Reader can flexibly generate node names based on custom parameters.
Note
  • We recommend that you do not use asterisks (*) because an out of memory (OOM) error may occur on a Java Virtual Machine (JVM).
  • Data Integration considers all text files in a sync node as a single table. Make sure that all files in a sync node use the same schema.
  • Make sure that the data format is similar to CSV and readable to Data Integration.
  • If no readable files exist in the specified path, FTP Reader reports an error.
Yes N/A
column The columns from which you want to read data. The type parameter specifies the data type of a column. The index parameter specifies the ID of a column in the source table, starting from 0. The value parameter specifies the column value if the column is a constant column.
By default, FTP Reader reads all data as strings. In this case, set this parameter to an asterisk (*), such as "column":["*"]. You can also set the column parameter in the following format:
{
    "type": "long",
    "index": 0    // The first INT-type column of the file from which you want to read data. 
  },
  {
    "type": "string",
    "value": "alibaba"  // The value of the current column. In this code, the value is the constant "alibaba". 
  }

In the column parameter, you must specify the type parameter and specify either the index or value parameter.

Yes *
fieldDelimiter The column delimiter that is used in the file from which you want to read data.
Note You must specify a column delimiter for FTP Reader. The default delimiter is commas (,). If you do not specify the column delimiter, the default column delimiter is used.
Yes ,
skipHeader Specifies whether to skip the headers in a CSV-like file if the file contains headers. The skipHeader parameter is unavailable for compressed files. The default value of this parameter is false, which indicates that FTP Reader does not skip the headers in a CSV-like file. No false
encoding The encoding format of the files that you want to write to the FTP server. No utf-8
nullFormat The string that represents a null pointer. No standard strings can represent a null pointer in TXT files. You can use this parameter to define a string that represents a null pointer.

For example, if you specify nullFormat:"null", FTP Reader considers null as a null pointer.

No N/A
markDoneFileName The name of the file that is used to indicate that the sync node can start. Data Integration checks whether the file exists before data synchronization. If the file does not exist, Data Integration checks again later. Data Integration starts the sync node only after the file is detected. No N/A
maxRetryTime The maximum number of retries for the detection of the file if no file is detected. By default, a maximum of 60 retries are allowed. Data Integration detects the file every 1 minute. The whole process lasts 60 minutes. No 60
csvReaderConfig The configurations required to read CSV files. The parameter value must match the MAP type. You can use a CSV file reader to read data from CSV files. The CSV file reader supports multiple configurations. If no configuration is performed, the default settings are used. No N/A
fileFormat The format of the file. By default, FTP Reader reads data from CSV files. The data in CSV files must be logical two-dimensional tables. If you specify binary as the file format, data is converted to the binary format for replication and transmission.

You can specify this parameter only when you want to replicate the complete directory structure between storage systems such as FTP and Object Storage Service (OSS).

No N/A

Configure FTP Reader by using the codeless UI

  1. Configure the source and destination.
    Set parameters in the Source and Target sections for the sync node. Configure the source and destination
    Parameter Description
    Connection The name of the data source from which you want to read data. This parameter is equivalent to the datasource parameter that is described in the preceding section.
    File Path The path on the FTP server from which you want to read data. The path is a full path that contains the path of the directory and the file name with a suffix. This parameter is equivalent to the path parameter that is described in the preceding section.
    File Type The format of the file from which you want to read from the FTP server. The default format is CSV.
    Field Delimiter The column delimiter. This parameter is equivalent to the fieldDelimiter parameter that is described in the preceding section. By default, a comma (,) is used as a column delimiter.
    Encoding The encoding format. This parameter is equivalent to the encoding parameter that is described in the preceding section. Default value: UTF-8.
    Null String The string that represents a null pointer. This parameter is equivalent to the nullFormat parameter that is described in the preceding section.
    Compression Format The format in which files are compressed. By default, files are not compressed.
    Skip Header Specifies whether to skip the headers in the file. This parameter is equivalent to the skipHeader parameter that is described in the preceding section. Default value: No.
  2. Configure field mappings. This operation is equivalent to setting the column parameter that is described in the preceding section.
    Fields in the source on the left have a one-to-one mapping with fields in the destination on the right. You can click Add to add a field. To remove an added field, move the pointer over the field and click the Remove icon. Mappings section
    Operation Description
    Map Fields with the Same Name Click Map Fields with the Same Name to establish mappings between fields with the same name. The data types of the fields must match.
    Map Fields in the Same Line Click Map Fields in the Same Line to establish mappings between fields in the same row. The data types of the fields must match.
    Delete All Mappings Click Delete All Mappings to remove the mappings that are established.
  3. Configure channel control policies. Channel section
    Parameter Description
    Expected Maximum Concurrency The maximum number of parallel threads that the sync node uses to read data from the source or write data to the destination. You can configure the parallelism for the sync node on the codeless UI.
    Bandwidth Throttling Specifies whether to enable bandwidth throttling. You can enable bandwidth throttling and specify a maximum transmission rate to prevent heavy read workloads on the source. We recommend that you enable bandwidth throttling and set the maximum transmission rate to an appropriate value based on the configurations of the source.
    Dirty Data Records Allowed The maximum number of dirty data records allowed.
    Distributed Execution

    The distributed execution mode that allows you to split your node into pieces and distribute them to multiple Elastic Compute Service (ECS) instances for parallel execution. This speeds up synchronization. If you use a large number of parallel threads to run your synchronization node in distributed execution mode, excessive access requests are sent to the data sources. Therefore, before you use the distributed execution mode, you must evaluate the access load on the data sources. You can enable this mode only if you use an exclusive resource group for Data Integration. For more information about exclusive resource groups for Data Integration, see Exclusive resource groups for Data Integration and Create and use an exclusive resource group for Data Integration.

Configure FTP Reader by using the code editor

In the following code, a sync node is configured to read data from an FTP server. For more information about how to configure a sync node by using the code editor, see Create a sync node by using the code editor.
{
    "type":"job",
    "version":"2.0", // The version number. 
    "steps":[
        {
            "stepType":"ftp",// The reader type. 
            "parameter":{
                "path":[],// The file path. 
                "nullFormat":"",// The string that represents a null pointer. 
                "compress":"",// The format in which files are compressed. 
                "datasource":"", // The name of the data source. 
                "column":[// The columns from which you want to read data. 
                    {
                        "index":0,// The ID of the column from which you want to read data. 
                        "type":""// The data type. 
                    }
                ],
                "skipHeader":"",// Specifies whether to skip the headers in the file. 
                "fieldDelimiter":",", // The column delimiter. 
                "encoding":"UTF-8", // The encoding format. 
                "fileFormat":"csv"// The format of the file. 
            },
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0" // The maximum number of dirty data records allowed. 
        },
        "speed":{
        "throttle": true, // Specifies whether to enable bandwidth throttling. A value of false indicates that bandwidth throttling is disabled, and a value of true indicates that bandwidth throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "concurrent":1 // The maximum number of parallel threads. 
            "mbps":"12",// The maximum transmission rate.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}