All Products
Search
Document Center

DataWorks:Amazon S3

Last Updated:Mar 26, 2026

Amazon S3 (Simple Storage Service) is an object storage service for storing and retrieving any amount of data from anywhere. DataWorks Data Integration supports reading data from and writing data to Amazon S3. This topic describes the features and limitations of the Amazon S3 data source and provides script examples with parameter references.

Limitations

Batch read

The Amazon S3 reader reads files from S3 objects. Because Amazon S3 is an unstructured data storage service, the following capabilities apply.

Feature Supported
TXT files (must contain a schema for a two-dimensional table) Yes
CSV-like files with custom delimiters Yes
ORC format Yes
Parquet format Yes
Reading multiple data types as strings Yes
Column pruning and constant columns Yes
Recursive reading and filename filtering Yes
Text compression: gzip, bzip2, zip Yes
Compressed archives containing multiple files No
Concurrent reading of multiple objects Yes
Multi-threading for a single object No
Multi-threading for a single compressed object No
Objects larger than 100 GB No

Batch write

The Amazon S3 writer converts data from the Data Synchronization protocol into text files stored as S3 objects. Because Amazon S3 is an unstructured data storage service, the following capabilities apply.

Feature Supported
Text files (must contain a schema for a two-dimensional table) Yes
BLOB data (videos, images) No
CSV-like files with custom delimiters Yes
ORC format Yes
Parquet format Yes
Snappy compression (Script Mode only, for ORC and Parquet) Yes
Multi-threaded writing (each thread writes to a separate sub-file) Yes
Automatic file splitting when a file exceeds a specified size Yes
Concurrent writing to a single file No
Native data types No — all data is written as the STRING type
Writing to buckets using the Glacier Deep Archive storage class No
Objects larger than 100 GB No

Add a data source

Add the Amazon S3 data source to DataWorks before developing a synchronization task. Follow the instructions in Data source management. Parameter descriptions are available in the DataWorks console when you add the data source.

Develop a data synchronization task

Configure a single-table batch synchronization task

Use the codeless UI or the code editor to configure the task:

For script parameters and examples when using the code editor, see Appendix: Script examples and parameter descriptions below.

Appendix: Script examples and parameter descriptions

The following sections describe the parameters to configure when using the code editor. For general script format requirements, see Configure a task in the code editor.

Reader script example

{
    "type": "job",
    "version": "2.0",
    "steps": [
        {
            "stepType": "s3",
            "parameter": {
                "nullFormat": "",
                "compress": "",
                "datasource": "",
                "column": [
                    {
                        "index": 0,
                        "type": "string"
                    },
                    {
                        "index": 1,
                        "type": "long"
                    },
                    {
                        "index": 2,
                        "type": "double"
                    },
                    {
                        "index": 3,
                        "type": "boolean"
                    },
                    {
                        "index": 4,
                        "type": "date",
                        "format": "yyyy-MM-dd HH:mm:ss"
                    }
                ],
                "skipHeader": "",
                "encoding": "",
                "fieldDelimiter": ",",
                "fileFormat": "",
                "object": []
            },
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "stream",
            "parameter": {},
            "name": "Writer",
            "category": "writer"
        }
    ],
    "setting": {
        "errorLimit": {
            "record": ""
        },
        "speed": {
            "throttle": true,
            "concurrent": 1,
            "mbps": "12"
        }
    },
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    }
}

Reader parameters

Parameter Description Required Default
datasource Name of the data source. Must match the data source name you added in DataWorks. Yes None
object S3 object path or paths to read. Accepts a single path, multiple paths, or wildcard patterns. See Object path patterns below. Yes None
column Column configuration. type sets the data type; index sets the column position (0-based); value sets a constant value generated at runtime instead of read from the source. Specify either index or value; type is always required. To read all columns as strings: "column": ["*"] Yes All columns as STRING
fieldDelimiter Delimiter used to separate fields. For non-printable characters, use Unicode escapes such as \u001b or \u007c. Yes , (comma)
compress Text compression type. Supported values: gzip, bzip2, zip. No None
encoding Encoding of the source files. No UTF-8
nullFormat String to treat as a null value. For example, setting nullFormat="null" causes the source string "null" to be read as a null field. No None
skipHeader Whether to skip the header row in CSV-like files. Set to true to skip; false to read the header as a data row. Cannot be used with compressed files. No false
csvReaderConfig Advanced settings for reading CSV files (map type). If not set, CsvReader defaults apply. No None

Object path patterns

The object parameter accepts a single path, multiple paths, or wildcard patterns.

Pattern Behavior
Single object The reader uses a single thread.
Multiple objects The reader uses multiple threads. Thread count is controlled by concurrent.
Wildcard (for example, abc*[0-9] matches abc0, abc1, abc2, abc3) The reader traverses all matching objects.
Warning

Avoid wildcards that match large numbers of objects. This can cause OutOfMemoryError. If this error occurs, split files across multiple directories and read each directory separately.

All objects in a single job are treated as one data table and must share the same schema.

Writer script example

{
    "type": "job",
    "version": "2.0",
    "steps": [
        {
            "stepType": "stream",
            "parameter": {},
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "s3",
            "category": "writer",
            "name": "Writer",
            "parameter": {
                "datasource": "datasource1",
                "object": "test/csv_file.csv",
                "fileFormat": "csv",
                "encoding": "utf8/gbk/...",
                "fieldDelimiter": ",",
                "lineDelimiter": "\n",
                "column": [
                    "0",
                    "1"
                ],
                "header": [
                    "col_bigint",
                    "col_tinyint"
                ],
                "writeMode": "truncate",
                "writeSingleObject": true
            }
        }
    ],
    "setting": {
        "errorLimit": {
            "record": ""
        },
        "speed": {
            "throttle": true,
            "concurrent": 1,
            "mbps": "12"
        }
    },
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    }
}

Writer parameters

Parameter Description Required Default
datasource Name of the data source. Must match the data source name you added in DataWorks. Yes None
object Destination object name or prefix. Yes None
fileFormat Output file format. See File formats and compression below. Yes text
writeMode How to handle existing objects before writing. See Write modes below. Yes append
column Column configuration of the output file. For csv or text format, use numeric placeholders: ["0", "1"]. For PARQUET or ORC format, specify name and type for each column. Yes None
fieldDelimiter Delimiter used to separate fields in the output file. No , (comma)
lineDelimiter Delimiter used to separate lines in the output file. No \n
compress Compression type. For text or csv: gzip and bzip2 are supported. For PARQUET or ORC: Snappy is supported. No None
nullFormat String to write when a value is null. For example, if nullFormat="null", null values are written as the string "null". No None
header Header row written at the top of the output file. Example: ["id", "name", "age"]. No None
writeSingleObject Whether to write all data to a single file. For ORC or Parquet output, this parameter has no effect in high-concurrency scenarios — the writer always appends a random suffix to the file name. To write data to a single file, you can set the number of concurrent threads to 1. However, the writer still adds a random suffix to the file name. Setting concurrent to 1 also reduces synchronization speed. If the source is Hologres (which reads by shard), multiple output files may still be produced even with concurrent: 1. No false
encoding Encoding of the output file. No UTF-8

File formats and compression

Format Description Supported compression
text Columns separated by the specified delimiter. No escape character is used when data contains the delimiter. gzip, bzip2
csv Standard CSV. If data contains the column delimiter, the writer escapes it with double quotation marks ("). gzip, bzip2
PARQUET Apache Parquet columnar format. Snappy
ORC Optimized Row Columnar format. Snappy

Write modes

Mode Behavior Use when
truncate Deletes all existing objects matching the specified prefix before writing. For example, "object": "abc" deletes all objects whose names start with abc. Overwriting a full dataset each run
append Writes to a new object with a random universally unique identifier (UUID) suffix, leaving existing objects untouched. The resulting file name follows the pattern DI_xxxx_xxxx_xxxx. Incrementally adding data without modifying existing files
nonConflict Fails with an error if any object matching the specified prefix already exists. For example, "object": "abc" fails if an object named abc123 exists. Ensuring no accidental overwrites