The FTP data source provides a bidirectional channel for reading data from and writing data to FTP servers. This topic describes the data synchronization capabilities of the FTP data source in DataWorks.
Limitations
FTP Reader
FTP Reader reads data from remote FTP files. Because remote FTP files are inherently unstructured, the following capabilities and limitations apply.
| Capability | Supported |
|---|---|
| Read TXT files (two-dimensional table schema) | Yes |
| CSV-like files with custom separators | Yes |
| Multiple data types (represented as STRING), column pruning, and column constants | Yes |
| Recursive reads and file name filtering | Yes |
| Text compression (gzip, bzip2, zip, lzo, lzo_deflate) | Yes |
| Concurrent reads from multiple files | Yes |
| Multi-threaded concurrent reads from a single file | No |
| Multi-threaded concurrent reads from a single compressed file | No |
FTP Writer
FTP Writer converts data and writes it to FTP files. Because FTP files are inherently unstructured, the following capabilities and limitations apply.
| Capability | Supported |
|---|---|
| Write text files with two-dimensional table schema | Yes |
| CSV-like and TEXT file formats with custom separators | Yes |
| Multi-threaded writes (each thread writes to a different sub-file) | Yes |
| Concurrent writes to a single file | No |
| Native data types (all data is written as STRING) | No |
| Text compression when writing | No |
Supported field types
Remote FTP files have no native data types. The following field types are defined by DataX FtpReader.
| DataX internal type | Remote FTP file data type |
|---|---|
| LONG | LONG |
| DOUBLE | DOUBLE |
| STRING | STRING |
| BOOLEAN | BOOLEAN |
| DATE | DATE |
Add a data source
Before developing a synchronization task in DataWorks, add the FTP data source by following the instructions in Data source management. Parameter descriptions are available in the DataWorks console when you add a data source.
Develop a data synchronization task
For configuration entry points and procedures, see the following guides.
Single-table offline synchronization task
For a script demo and parameter descriptions for the code editor, see Appendix: Script demo and parameter description.
Appendix: Script demo and parameter description
Configure a batch synchronization task in the code editor
To configure a batch synchronization task in the code editor, set the parameters in the script according to the unified script format. For more information, see Configure a task in the code editor.
Reader script demo
{
"type": "job",
"version": "2.0",
"steps": [
{
"stepType": "ftp",
"parameter": {
"path": [],
"nullFormat": "",
"compress": "",
"datasource": "",
"column": [
{
"index": 0,
"type": ""
}
],
"skipHeader": "",
"fieldDelimiter": ",",
"encoding": "UTF-8",
"fileFormat": "csv"
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "stream",
"parameter": {},
"name": "Writer",
"category": "writer"
}
],
"setting": {
"errorLimit": {
"record": "0"
},
"speed": {
"throttle": true,
"concurrent": 1,
"mbps": "12"
}
},
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
}
}Reader script parameters
| Parameter | Description | Required | Default value |
|---|---|---|---|
datasource | The data source name. Must match the name of the added data source. | Yes | None |
path | The path and name of the file in the remote FTP file system. Specify the full path including the file name and extension. Supports multiple paths. See Path parameter behavior. | Yes | None |
column | The list of fields to read. Each entry requires type and either index or value. See Column parameter format. | Yes | None |
fieldDelimiter | The field separator for reading data. | Yes | , |
skipHeader | Specifies whether to skip the header row in CSV-like files. Not supported for compressed files. | No | false |
encoding | The encoding format of the files to read. | No | utf-8 |
nullFormat | The string representation of null values. If not set, source data is written as-is without conversion. | No | None |
markDoneFileName | The name of the mark file. The synchronization task waits for this file to appear before starting. | No | None |
maxRetryTime | The number of retries when checking for the mark file. Retry interval is 1 minute; total wait time is 60 minutes. | No | 60 |
csvReaderConfig | Configuration options for reading CSV files (Map type). Uses default values if not set. | No | None |
fileFormat | The file type to read. By default, files are read as CSV files and parsed into a two-dimensional table. Set to binary for peer-to-peer binary copy between storage systems such as FTP and Object Storage Service (OSS). | No | None |
Path parameter behavior
The behavior of path depends on what you specify:
| Path type | Behavior |
|---|---|
| Single file | FTP Reader uses a single thread to extract data. |
| Multiple files | FTP Reader uses multi-threaded extraction. The number of threads equals the number of channels. |
Wildcard (*) | FTP Reader traverses the directory and reads all matching files. For example, / reads all files in the root directory; /bazhen/ reads all files in the bazhen directory. FTP Reader supports only the asterisk (*) as a file wildcard character. |
Avoid using the * wildcard, as it may cause a Java Virtual Machine (JVM) memory overflow error.
Use scheduling parameters to configure file names and paths dynamically.
Additional constraints:
All text files in a job are treated as a single data table. Make sure all files conform to the same schema.
All files to read must be in CSV-like format and readable by the data synchronization system.
If no files match the specified path, the synchronization task reports an error.
Column parameter format
To read all columns as STRING, set "column": ["*"].
To specify individual fields:
{
"type": "long",
"index": 0
},
{
"type": "string",
"value": "alibaba"
}typeis required for each entry.Specify either
index(column position, starting from 0) orvalue(a constant injected by FTP Reader without reading from the source file).
Writer script demo
{
"type": "job",
"version": "2.0",
"steps": [
{
"stepType": "stream",
"parameter": {},
"name": "Reader",
"category": "reader"
},
{
"stepType": "ftp",
"parameter": {
"path": "",
"fileName": "",
"nullFormat": "null",
"dateFormat": "yyyy-MM-dd HH:mm:ss",
"datasource": "",
"writeMode": "",
"fieldDelimiter": ",",
"encoding": "",
"fileFormat": ""
},
"name": "Writer",
"category": "writer"
}
],
"setting": {
"errorLimit": {
"record": "0"
},
"speed": {
"throttle": true,
"concurrent": 1,
"mbps": "12"
}
},
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
}
}Writer script parameters
| Parameter | Description | Required | Default value |
|---|---|---|---|
datasource | The data source name. Must match the name of the added data source. | Yes | None |
path | The directory path in the FTP file system. FTP Writer writes multiple sub-files to this directory. | Yes | None |
fileName | The base file name. A random suffix is appended per write thread by default to avoid conflicts. | Yes | None |
singleFileOutput | Set to true to suppress the random suffix and write to the exact file name specified in fileName. | No | false |
writeMode | The cleanup behavior before writing. Valid values: truncate, append, nonConflict. See WriteMode values. | Yes | None |
fieldDelimiter | The field separator for writing data. Must be a single character. | Yes | None |
timeout | The connection timeout for connecting to the FTP server. Unit: milliseconds. | No | 60000 (1 minute) |
skipHeader | Specifies whether to skip the header row. Not supported for compressed files. | No | false |
compress | The compression format for writing. Valid values: gzip, bzip2. | No | No compression |
encoding | The encoding format for writing. | No | utf-8 |
nullFormat | The string representation for null values. For example, setting nullFormat="null" serializes null pointers as the literal string null. | No | None |
dateFormat | The format for serializing DATE-type data. Example: "yyyy-MM-dd". | No | None |
fileFormat | The format for writing files. Valid values: CSV, TEXT. CSV is a strict CSV format that escapes the column delimiter using double quotation marks ("). TEXT uses simple delimiter separation without escaping. | No | TEXT |
header | The header row to write as the first row of the output file. Example: ["id", "name", "age"]. | No | None |
markDoneFileName | The absolute path of the mark file generated after the synchronization task completes. In auto-triggered tasks, include scheduling parameters in the file name. Example: /user/ftp/markDone_${bizdate}.txt. | No | None |
WriteMode values
| Value | Behavior |
|---|---|
truncate | Clears existing files before writing. If singleFileOutput is true, clears files with the same name. If false, clears all files with the fileName prefix. |
append | Writes without pre-processing. Data Integration FTP Writer ensures file names do not conflict. |
nonConflict | Reports an error if a file with the fileName prefix already exists. |