All Products
Search
Document Center

DataWorks:FTP data source

Last Updated:Dec 29, 2023

DataWorks provides FTP Reader and FTP Writer for you to read data from and write data to FTP data sources. This topic describes the capabilities of synchronizing data from or to FTP data sources.

Limits

FTP Reader connects to a remote FTP server, reads data from the server, and then converts the format of the data into a format that is readable to Data Integration. The files on the FTP server store only unstructured data. The following table lists the features that are supported and not supported by FTP Reader:

Supported

Not supported

  • Reads data from TXT files. The data in the files must be logical two-dimensional tables.

  • Reads data from CSV-like files with custom delimiters.

  • Reads data of various types as strings and supports constants and column pruning.

  • Supports recursive data read and file name-based filtering.

  • Supports file compression. The following compression formats are supported: GZIP, BZIP2, ZIP, LZO, and LZO_DEFLATE.

  • Uses parallel threads to read data from multiple files.

  • Uses parallel threads to read a single file.

  • Uses parallel threads to read a compressed file.

FTP Writer converts the data that is obtained from a reader to files and writes the files to an FTP server. The files on the FTP server store only unstructured data. The following table lists the features that are supported and not supported by FTP Writer:

Supported

Not supported

  • Writes only text files to an FTP server. The data in the files must be organized as logical two-dimensional tables. FTP Writer cannot write files that store binary large object (BLOB) data, such as video data, to an FTP server.

  • Writes TXT and CSV-like files that contain custom delimiters to an FTP server.

  • Writes uncompressed files to an FTP server.

  • Uses parallel threads to write files to an FTP server. Each thread writes a file.

  • Uses parallel threads to write a single file to an FTP server.

  • Distinguishes between data types. FTP does not distinguish between data types. Therefore, FTP Writer writes all data as strings to files on an FTP server.

Data type mappings

A remote FTP file does not distinguish between data types. The data types are defined by FTP Reader.

Data Integration data type

Data type in an FTP file

LONG

LONG

DOUBLE

DOUBLE

STRING

STRING

BOOLEAN

BOOLEAN

DATE

DATE

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a data synchronization task, see the following sections. For information about the parameter settings, view the infotip of each parameter on the configuration tab of the task.

Add a data source

Before you configure a data synchronization task to synchronize data from or to a specific data source, you must add the data source to DataWorks. For more information, see Add and manage data sources.

Configure a batch synchronization task to synchronize data of a single table

Appendix: Code and parameters

Appendix: Configure a batch synchronization task by using the code editor

If you use the code editor to configure a batch synchronization task, you must configure parameters for the reader and writer of the related data source based on the format requirements in the code editor. For more information about the format requirements, see Configure a batch synchronization task by using the code editor. The following information describes the configuration details of parameters for the reader and writer in the code editor.

Code for FTP Reader

{
    "type":"job",
    "version":"2.0",// The version number. 
    "steps":[
        {
            "stepType":"ftp",// The plug-in name. 
            "parameter":{
                "path":[],// The file path. 
                "nullFormat":"",// The string that represents a null pointer. 
                "compress":"",// The format in which files are compressed. 
                "datasource":"",// The name of the data source. 
                "column":[// The names of the columns. 
                    {
                        "index":0,// The ID of the column. 
                        "type":""// The data type. 
                    }
                ],
                "skipHeader":"",// Specifies whether to skip the headers in the file if the file has headers. 
                "fieldDelimiter":",",// The column delimiter. 
                "encoding":"UTF-8",// The encoding format. 
                "fileFormat":"csv"// The format of the file. 
            },
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The maximum number of dirty data records allowed. 
        },
        "speed":{
        "throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "concurrent":1 // The maximum number of parallel threads. 
            "mbps":"12",// The maximum transmission rate. Unit: MB/s. 
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

Parameters in code for FTP Reader

Parameter

Description

Required

Default value

datasource

The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor.

Yes

No default value

path

The path on the FTP server from which you want to read data. The path must be a full path that contains the path of the directory that stores the desired file and contains the file name with a suffix. You can specify multiple paths.

  • If you specify only one path, FTP Reader uses only one thread to read the related file. The feature of using parallel threads to read data from a single uncompressed file will be available in the future.

  • If you specify multiple paths, FTP Reader uses parallel threads to read the related files. The actual number of threads is determined by the number of channels.

  • If a path contains a wildcard, FTP Reader attempts to read all files that match the path. For example, if you specify the path /bazhen/, FTP Reader reads all files in the bazhen directory. FTP Reader supports only asterisks (*) as wildcards. FTP Reader allows you to use scheduling parameters to configure the file name and file path.

Note
  • We recommend that you do not use asterisks (*) because an out of memory (OOM) error may occur on a Java Virtual Machine (JVM).

  • Data Integration considers all text files in a synchronization task as a single table. Make sure that all files in a synchronization task use the same schema.

  • Make sure that the data format is similar to CSV and readable to Data Integration.

  • If no readable files exist in the specified path, FTP Reader reports an error.

Yes

No default value

column

The columns from which you want to read data. The type parameter specifies the data type of a column. The index parameter specifies the ID of a column in the source table, starting from 0. The value parameter specifies the column value if the column is a constant column.

By default, FTP Reader reads all data as strings. In this case, set this parameter to "column":["*"]. You can also configure the column parameter in the following format:

{
    "type": "long",
    "index": 0    // The first INT-type column of the file from which you want to read data. 
  },
  {
    "type": "string",
    "value": "alibaba"  // The value of the current column. In this code, the value is the constant alibaba. 
  }

In the column parameter, you must configure the type parameter and configure one of the index and value parameters.

Yes

No default value

fieldDelimiter

The column delimiter that is used in the file from which you want to read data.

Note

You must specify a column delimiter for FTP Reader. The default delimiter is commas (,). If you do not specify the column delimiter, the default column delimiter is used.

Yes

,

skipHeader

Specifies whether to skip the headers in a CSV-like file if the file has headers. The skipHeader parameter is not supported for compressed files. The default value of this parameter is false, which indicates that FTP Reader does not skip the headers in CSV-like files.

No

false

encoding

The encoding format of the files that you want to write to the FTP server.

No

utf-8

nullFormat

The string that represents a null pointer. No standard strings can represent a null pointer in TXT files. You can use this parameter to define a string that represents a null pointer. Examples:

  • If you specify nullFormat:"null", and the source data is printable characters, FTP Reader converts the printable characters into null.

  • If you specify nullFormat:"\u0001", and the source data is non-printable strings, FTP Reader converts the non-printable characters into \u0001.

  • If you do not configure the nullFormat parameter, FTP Reader does not convert source data.

No

No default value

markDoneFileName

The name of the file that is used to indicate that the synchronization task can start. Data Integration checks whether the file exists before data synchronization. If the file does not exist, Data Integration checks again later. Data Integration starts the synchronization task only after the file is detected.

No

No default value

maxRetryTime

The maximum number of retries for the detection of the file if no file is detected. By default, a maximum of 60 retries are allowed. Data Integration detects the file every 1 minute. The whole process lasts for 60 minutes.

No

60

csvReaderConfig

The configurations required to read CSV files. The parameter value must match the MAP type. You can use a CSV file reader to read data from CSV files. The CSV file reader supports multiple configurations.

No

No default value

fileFormat

The format of the file. By default, FTP Reader reads data from CSV files. The data in CSV files must be logical two-dimensional tables. If you specify binary as the file format, data is converted into the binary format for replication and transmission.

You need to configure this parameter only when you want to replicate the complete directory structure between storage systems such as FTP and Object Storage Service (OSS).

No

No default value

Code for FTP Writer

{
    "type":"job",
    "version":"2.0",// The version number. 
    "steps":[
        { 
            "stepType":"stream",
            "parameter":{},
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"ftp",// The plug-in name. 
            "parameter":{
                "path":"",// The directory on the FTP server to which you want to write files. 
                "fileName":"",// The name prefix of the files that you want to write to the FTP server. 
                "nullFormat":"null",// The string that represents a null pointer. 
                "dateFormat":"yyyy-MM-dd HH:mm:ss",// The time format. 
                "datasource":"",// The name of the data source. 
                "writeMode":"",// The write mode. 
                "fieldDelimiter":",",// The column delimiter. 
                "encoding":"",// The encoding format. 
                "fileFormat":""// The format in which FTP Writer writes files. 
            },
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The maximum number of dirty data records allowed. 
        },
        "speed":{
            "throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "concurrent":1, // The maximum number of parallel threads. 
            "mbps":"12"// The maximum transmission rate. Unit: MB/s. 
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

Parameters in code for FTP Writer

Parameter

Description

Required

Default value

datasource

The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor.

Yes

No default value

timeout

The timeout period of the connection to an FTP server. Unit: milliseconds.

No

60,000

path

The directory on the FTP server to which you want to write files. FTP Writer uses parallel threads to write multiple files to the directory based on the parallelism setting.

Yes

No default value

fileName

The name prefix of the files that you want to write to the FTP server. A random suffix is appended to file names to form the actual file names that is used by each thread.

Yes

No default value

singleFileOutput

Specifies whether to add a random suffix to the names of the files that you want to write to the FTP server. The names of the files that FTP Writer writes to the FTP server are related to the value of the fileName parameter. If you do not need the random suffix, set the singleFileOutput parameter to true. In this case, FTP Writer writes the files to the FTP server by using the name prefix of the files.

No

false

writeMode

The mode in which FTP Writer writes files. Valid values:

  • truncate: If the singleFileOutput parameter is set to true, FTP Writer deletes files with the same names in the destination directory before it writes files to the directory. If the singleFileOutput parameter is set to false, FTP Writer deletes all existing files whose names contain the prefix specified by fileName in the destination directory before it writes files to the directory.

  • append: FTP Writer directly writes files based on the file name prefix specified by fileName and ensures that the actual file names do not conflict with the names of existing files.

  • nonConflict: FTP Writer returns an error if the destination directory contains a file whose name contains the prefix specified by fileName.

Yes

No default value

fieldDelimiter

The column delimiter that is used in the files that you want to write to the FTP server. The delimiter must be a single character.

Yes

No default value

skipHeader

Specifies whether to skip the headers in a CSV-like file if the file has headers. By default, the headers are not skipped. The skipHeader parameter is unavailable for compressed files.

No

false

compress

The compression format of the files that you want to write to the FTP server. Valid values: gzip and bzip2.

No

Not default value

encoding

The encoding format of the files that you want to write to the FTP server.

No

utf-8

nullFormat

The string that represents a null pointer. No standard strings can represent a null pointer in text files. You can use this parameter to define which string represents a null pointer.

For example, if you set nullFormat to null, Data Integration considers null as a null pointer (4 characters).

No

No default value

dateFormat

The format in which the data of the DATE type is serialized in a file, such as "dateFormat":"yyyy-MM-dd".

No

No default value

fileFormat

The format in which files are written to the FTP server. Valid values: CSV and TEXT. If a file is written as a CSV file, the file must follow CSV specifications. If the data in the file contains column delimiters, the column delimiters are escaped by double quotation marks ("). If a file is written as a TXT file, the data in the file is separated by column delimiters. In this case, the column delimiters are not escaped.

No

TEXT

header

The table headers if files are written as TXT or CSV files, such as ["id","name","age"]. This indicates that the id, name, and age fields are written to a CSV file as the first row.

No

No default value

markDoneFileName

  • The name of the file that is used to indicate that the synchronization task is successfully run. Data Integration checks whether the file exists after data synchronization. Set this parameter to the absolute path of the file.

  • For batch synchronization tasks that are periodically scheduled, we recommend that you add a scheduling parameter to the name of the file. For example, you can set the name of a file to /user/ftp/markDone_${bizdate}.txt, in which ${bizdate} indicates a scheduling parameter.

No

No default value