All Products
Search
Document Center

DataWorks:Amazon S3 data source

Last Updated:Apr 29, 2026

Amazon Simple Storage Service (Amazon S3) is an object storage service built to store and retrieve any amount of data from anywhere. DataWorks Data Integration lets you read data from and write data to Amazon S3. This topic describes the capabilities of the Amazon S3 data source in DataWorks.

Limitations

Batch read

Amazon S3 stores unstructured data. In Data Integration, the Amazon S3 reader supports the following features.

Supported

Unsupported

  • Only files in TXT format can be read, and the schema in the TXT file must be a two-dimensional table.

  • Reads data from CSV-like objects with custom delimiters.

  • Reads data in ORC and PARQUET formats.

  • Reads various data types as strings and supports column pruning and constant columns.

  • Supports recursive reading and object name filtering.

  • Supports object compression. The supported compression formats are gzip, bzip2, and zip.

    Note

    You cannot compress multiple objects into a single package.

  • Supports concurrent reading of multiple objects.

  • Does not support multi-threaded reads of a single Object (File).

  • Does not support multi-threaded reads of a single compressed Object.

  • Does not support reading a single Object (File) that is larger than 100 GB.

Batch write

The Amazon S3 writer converts data from the data synchronization protocol to text files in Amazon S3. Amazon S3 itself is an unstructured data store. The Amazon S3 writer supports the following features.

Supported

Unsupported

  • Only text-type files can be written (BLOB types such as videos and images are not supported), and the schema in the text file must be a two-dimensional table.

  • Writes data to CSV-like files with custom delimiters.

  • Writes data in ORC and PARQUET formats.

    Note

    SNAPPY compression is supported in script mode.

  • Supports multi-threaded writing. Each thread writes to a different sub-file.

  • Supports file rolling. When a file exceeds a specified size, the system switches to a new file.

  • Does not support concurrent writes to a single file.

  • Amazon S3 does not provide data types. The Amazon S3 writer writes all data to Amazon S3 objects as STRING type.

  • If the storage class of the Amazon S3 bucket is Deep Archive, write operations are not supported.

  • A single Object (File) cannot exceed 100 GB.

Add a data source

Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data source management. You can view parameter descriptions in the DataWorks console to understand the meanings of the parameters when you add a data source.

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.

Configure a single-table batch synchronization task

Appendix: Script demo and parameter description

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Use the Code Editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

Reader script demo

{
    "type":"job",
    "version":"2.0",// The version number.
    "steps":[
        {
            "stepType":"s3",// The plug-in name.
            "parameter":{
                "nullFormat":"",// The string that represents a null value.
                "compress":"",// The compression type.
                "datasource":"",// The data source name.
                "column":[// The columns.
                    {
                        "index":0,// The column index.
                        "type":"string"// The data type.
                    },
                    {
                        "index":1,
                        "type":"long"
                    },
                    {
                        "index":2,
                        "type":"double"
                    },
                    {
                        "index":3,
                        "type":"boolean"
                    },
                    {
                        "format":"yyyy-MM-dd HH:mm:ss", // The time format.
                        "index":4,
                        "type":"date"
                    }
                ],
                "skipHeader":"",// Specifies whether to skip the header row of a CSV-like file.
                "encoding":"",// The encoding format.
                "fieldDelimiter":",",// The column delimiter.
                "fileFormat": "",// The file format.
                "object":[]// The object prefix.
            },
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":""// The error count.
        },
        "speed":{
            "throttle":true,// Specifies whether to enable throttling. A value of false indicates that throttling is disabled and the mbps parameter does not take effect. A value of true indicates that throttling is enabled.
            "concurrent":1 // The concurrency.
            "mbps":"12",// The throttling rate. 1 mbps = 1 MB/s.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

Reader script parameters

Parameter

Description

Required

Default value

datasource

The data source name. Script mode allows you to add data sources. The value of this parameter must be the same as the name of the data source that you add.

Yes

N/A

Object

The object information in Amazon S3. You can specify multiple objects. For example, if the bucket contains a test folder, and the folder contains a file named ll.txt, set Object to test/ll.txt.

  • When you specify a single S3 object, the Amazon S3 reader supports only single-threaded data extraction.

  • When you specify multiple S3 objects, the Amazon S3 reader supports multi-threaded data extraction. The number of concurrent threads is specified by the number of channels.

  • When you specify wildcards, the Amazon S3 reader attempts to list multiple objects. For example, abc*[0-9] matches abc0, abc1, abc2, abc3, and so on. Using wildcards may cause out-of-memory errors. We recommend that you do not use wildcards.

Note
  • The data synchronization system treats all objects synchronized by a single job as one data table. Make sure that all objects conform to the same schema.

  • Control the number of files in a single directory. Otherwise, an OutOfMemoryError error may be triggered. If this occurs, split the files into different directories and try again.

Yes

N/A

column

The list of columns to read. The type parameter specifies the data type of the source data. The index parameter specifies the column number in the text file (starting from 0). The value parameter specifies that the current column is a constant. Instead of reading data from the source file, the system generates the column based on the specified value.

By default, you can read all data as String type. Example configuration:

column": ["*"]

You can also specify column information. Example configuration:

"column":    
{       
"type": "long",       
"index": 0 //Retrieve the int field from the first column of the S3 text.
},    
{       
"type": "string",       
"value": "alibaba" //Generate the string field "alibaba" internally from S3 Reader as the current field.    
}
Note

For the column information that you specify, type is required, and you must specify either index or value.

Yes

All data is read as STRING type.

fieldDelimiter

The column delimiter for reading data.

Note

When the Amazon S3 reader reads data, you must specify a column delimiter. If no delimiter is specified, the default delimiter (,) is used. The default delimiter (,) is also used in the codeless UI.

If the delimiter is invisible, specify the Unicode encoding. For example, \u001b or \u007c.

Yes

Default value: (,)

compress

The compression type. By default, this parameter is left empty, which indicates that no compression is applied. The supported compression types are gzip, bzip2, and zip.

No

No compression

encoding

The encoding of the files to read.

No

utf-8

nullFormat

Standard strings in text files cannot represent null (null pointer). The data synchronization system uses nullFormat to define which strings can represent null. For example, if you set nullFormat="null" and the source data is "null", the data synchronization system treats it as a null field.

No

N/A

skipHeader

For CSV files, use skipHeader to specify whether to read the header row.

  • True: The header row is read during data synchronization.

  • False: The header row is not read during data synchronization.

Note

skipHeader is not supported for compressed files.

No

false

csvReaderConfig

The configuration for reading CSV files. This parameter is of the Map type. The CsvReader is used to read CSV files and provides various configurations. If you do not configure this parameter, default values are used.

No

N/A

Writer script demo

{
    "type": "job",
    "version": "2.0",
    "steps": [
        {
            "stepType": "stream",
            "parameter": {},
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "s3",
            "category": "writer",
            "name": "Writer",
            "parameter": {
                "datasource": "datasource1",
                "object": "test/csv_file.csv",
                "fileFormat": "csv",
                "encoding": "utf8/gbk/...",
                "fieldDelimiter": ",",
                "lineDelimiter": "\n",
                "column": [
                    "0",
                    "1"
                ],
                "header": [
                    "col_bigint",
                    "col_tinyint"
                ],
                "writeMode": "truncate",
                "writeSingleObject": true
            }
        }
    ],
    "setting": {
        "errorLimit": {
            "record": "" // The error count.
        },
        "speed": {
            "throttle": true, // Specifies whether to enable throttling. A value of false indicates that throttling is disabled and the mbps parameter does not take effect. A value of true indicates that throttling is enabled.
            "concurrent": 1 // The concurrency.
            "mbps": "12", // The throttling rate. 1 mbps = 1 MB/s.
        }
    },
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    }
}

Writer script parameters

Parameter

Description

Required

Default value

datasource

The data source name. Script mode allows you to add data sources. The value of this parameter must be the same as the name of the data source that you add.

Yes

N/A

object

The name of the destination object.

Yes

N/A

fileFormat

The following file formats are supported:

  • csv: Only strict CSV format is supported. If the data to write contains column delimiters, the data is escaped based on CSV escape syntax. The escape character is a double quotation mark (").

  • text: Uses column delimiters to simply separate data. Data that contains column delimiters is not escaped.

  • parquet

  • ORC

Yes

text

writeMode

  • truncate: Before writing, all objects whose names match the specified object name prefix are deleted. For example, if you set "object":"abc", all objects whose names start with abc are deleted.

  • append: No processing is performed before writing. The Data Integration S3 writer directly writes data by using the specified object name with a random UUID suffix to avoid file name conflicts. For example, if you set the object name to DI, the actual file written is DI_xxxx_xxxx_xxxx.

  • nonConflict: An error is reported if any object with a matching prefix exists in the specified path. For example, if you set "object":"abc" and an object named abc123 exists, an error is reported.

Yes

append

fieldDelimiter

The column delimiter for writing data.

No

Default value: (,)

lineDelimiter

The line delimiter for writing data.

No

Default value: (\n)

compress

The compression type. By default, this parameter is left empty, which indicates that no compression is applied.

  • When fileFormat is set to text or csv, GZIP and BZIP2 are supported.

  • When fileFormat is set to parquet or orc, SNAPPY compression is supported.

No

No compression

nullFormat

Standard strings in text files cannot represent null (null pointer). The data synchronization system uses nullFormat to define which strings can represent null. For example, if you set nullFormat="null" and the source data is null, the data synchronization system treats it as a null field.

No

N/A

header

The header to write. Example: ["id", "name", "age"].

No

N/A

writeSingleObject

true: Writes data to a single file. false: Writes data to multiple files.

Note
  • When you write data in ORC or Parquet format, the writeSingleObject parameter does not take effect. Even with this parameter, you cannot write data to a single ORC or Parquet file in multi-concurrency scenarios. To write data to a single file, set the concurrency to 1. However, a random suffix is added to the file name, and setting the concurrency to 1 affects the synchronization speed.

  • In some scenarios, for example, when the source is Hologres, data is read based on shard partitions. Even with single concurrency, multiple files may be generated.

No

false

encoding

The encoding of the files to write.

No

utf-8

column

The column configuration for writing data.

  • When fileFormat is set to csv or text, configure the column parameter with numeric placeholders. Example:

    "column":[
     "0",
     "1"
     ]
  • When fileFormat is set to Parquet or ORC, configure the column parameter with name and type combinations. Example:

    "column": [
      {
        "name": "col1",
        "type": "BIGINT"
      },
      {
        "name": "col2",
        "type": "DOUBLE"
      }

Yes

N/A