Amazon S3 (Simple Storage Service) is an object storage service for storing and retrieving any amount of data from anywhere. DataWorks Data Integration supports reading data from and writing data to Amazon S3. This topic describes the features and limitations of the Amazon S3 data source and provides script examples with parameter references.
Limitations
Batch read
The Amazon S3 reader reads files from S3 objects. Because Amazon S3 is an unstructured data storage service, the following capabilities apply.
| Feature | Supported |
|---|---|
| TXT files (must contain a schema for a two-dimensional table) | Yes |
| CSV-like files with custom delimiters | Yes |
| ORC format | Yes |
| Parquet format | Yes |
| Reading multiple data types as strings | Yes |
| Column pruning and constant columns | Yes |
| Recursive reading and filename filtering | Yes |
| Text compression: gzip, bzip2, zip | Yes |
| Compressed archives containing multiple files | No |
| Concurrent reading of multiple objects | Yes |
| Multi-threading for a single object | No |
| Multi-threading for a single compressed object | No |
| Objects larger than 100 GB | No |
Batch write
The Amazon S3 writer converts data from the Data Synchronization protocol into text files stored as S3 objects. Because Amazon S3 is an unstructured data storage service, the following capabilities apply.
| Feature | Supported |
|---|---|
| Text files (must contain a schema for a two-dimensional table) | Yes |
| BLOB data (videos, images) | No |
| CSV-like files with custom delimiters | Yes |
| ORC format | Yes |
| Parquet format | Yes |
| Snappy compression (Script Mode only, for ORC and Parquet) | Yes |
| Multi-threaded writing (each thread writes to a separate sub-file) | Yes |
| Automatic file splitting when a file exceeds a specified size | Yes |
| Concurrent writing to a single file | No |
| Native data types | No — all data is written as the STRING type |
| Writing to buckets using the Glacier Deep Archive storage class | No |
| Objects larger than 100 GB | No |
Add a data source
Add the Amazon S3 data source to DataWorks before developing a synchronization task. Follow the instructions in Data source management. Parameter descriptions are available in the DataWorks console when you add the data source.
Develop a data synchronization task
Configure a single-table batch synchronization task
Use the codeless UI or the code editor to configure the task:
-
Codeless UI: Configure a task in the codeless UI
-
Code editor: Configure a task in the code editor
For script parameters and examples when using the code editor, see Appendix: Script examples and parameter descriptions below.
Appendix: Script examples and parameter descriptions
The following sections describe the parameters to configure when using the code editor. For general script format requirements, see Configure a task in the code editor.
Reader script example
{
"type": "job",
"version": "2.0",
"steps": [
{
"stepType": "s3",
"parameter": {
"nullFormat": "",
"compress": "",
"datasource": "",
"column": [
{
"index": 0,
"type": "string"
},
{
"index": 1,
"type": "long"
},
{
"index": 2,
"type": "double"
},
{
"index": 3,
"type": "boolean"
},
{
"index": 4,
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
}
],
"skipHeader": "",
"encoding": "",
"fieldDelimiter": ",",
"fileFormat": "",
"object": []
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "stream",
"parameter": {},
"name": "Writer",
"category": "writer"
}
],
"setting": {
"errorLimit": {
"record": ""
},
"speed": {
"throttle": true,
"concurrent": 1,
"mbps": "12"
}
},
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
}
}
Reader parameters
| Parameter | Description | Required | Default |
|---|---|---|---|
datasource |
Name of the data source. Must match the data source name you added in DataWorks. | Yes | None |
object |
S3 object path or paths to read. Accepts a single path, multiple paths, or wildcard patterns. See Object path patterns below. | Yes | None |
column |
Column configuration. type sets the data type; index sets the column position (0-based); value sets a constant value generated at runtime instead of read from the source. Specify either index or value; type is always required. To read all columns as strings: "column": ["*"] |
Yes | All columns as STRING |
fieldDelimiter |
Delimiter used to separate fields. For non-printable characters, use Unicode escapes such as \u001b or \u007c. |
Yes | , (comma) |
compress |
Text compression type. Supported values: gzip, bzip2, zip. |
No | None |
encoding |
Encoding of the source files. | No | UTF-8 |
nullFormat |
String to treat as a null value. For example, setting nullFormat="null" causes the source string "null" to be read as a null field. |
No | None |
skipHeader |
Whether to skip the header row in CSV-like files. Set to true to skip; false to read the header as a data row. Cannot be used with compressed files. |
No | false |
csvReaderConfig |
Advanced settings for reading CSV files (map type). If not set, CsvReader defaults apply. | No | None |
Object path patterns
The object parameter accepts a single path, multiple paths, or wildcard patterns.
| Pattern | Behavior |
|---|---|
| Single object | The reader uses a single thread. |
| Multiple objects | The reader uses multiple threads. Thread count is controlled by concurrent. |
Wildcard (for example, abc*[0-9] matches abc0, abc1, abc2, abc3) |
The reader traverses all matching objects. |
Avoid wildcards that match large numbers of objects. This can cause OutOfMemoryError. If this error occurs, split files across multiple directories and read each directory separately.
All objects in a single job are treated as one data table and must share the same schema.
Writer script example
{
"type": "job",
"version": "2.0",
"steps": [
{
"stepType": "stream",
"parameter": {},
"name": "Reader",
"category": "reader"
},
{
"stepType": "s3",
"category": "writer",
"name": "Writer",
"parameter": {
"datasource": "datasource1",
"object": "test/csv_file.csv",
"fileFormat": "csv",
"encoding": "utf8/gbk/...",
"fieldDelimiter": ",",
"lineDelimiter": "\n",
"column": [
"0",
"1"
],
"header": [
"col_bigint",
"col_tinyint"
],
"writeMode": "truncate",
"writeSingleObject": true
}
}
],
"setting": {
"errorLimit": {
"record": ""
},
"speed": {
"throttle": true,
"concurrent": 1,
"mbps": "12"
}
},
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
}
}
Writer parameters
| Parameter | Description | Required | Default |
|---|---|---|---|
datasource |
Name of the data source. Must match the data source name you added in DataWorks. | Yes | None |
object |
Destination object name or prefix. | Yes | None |
fileFormat |
Output file format. See File formats and compression below. | Yes | text |
writeMode |
How to handle existing objects before writing. See Write modes below. | Yes | append |
column |
Column configuration of the output file. For csv or text format, use numeric placeholders: ["0", "1"]. For PARQUET or ORC format, specify name and type for each column. |
Yes | None |
fieldDelimiter |
Delimiter used to separate fields in the output file. | No | , (comma) |
lineDelimiter |
Delimiter used to separate lines in the output file. | No | \n |
compress |
Compression type. For text or csv: gzip and bzip2 are supported. For PARQUET or ORC: Snappy is supported. |
No | None |
nullFormat |
String to write when a value is null. For example, if nullFormat="null", null values are written as the string "null". |
No | None |
header |
Header row written at the top of the output file. Example: ["id", "name", "age"]. |
No | None |
writeSingleObject |
Whether to write all data to a single file. For ORC or Parquet output, this parameter has no effect in high-concurrency scenarios — the writer always appends a random suffix to the file name. To write data to a single file, you can set the number of concurrent threads to 1. However, the writer still adds a random suffix to the file name. Setting concurrent to 1 also reduces synchronization speed. If the source is Hologres (which reads by shard), multiple output files may still be produced even with concurrent: 1. |
No | false |
encoding |
Encoding of the output file. | No | UTF-8 |
File formats and compression
| Format | Description | Supported compression |
|---|---|---|
text |
Columns separated by the specified delimiter. No escape character is used when data contains the delimiter. | gzip, bzip2 |
csv |
Standard CSV. If data contains the column delimiter, the writer escapes it with double quotation marks ("). |
gzip, bzip2 |
PARQUET |
Apache Parquet columnar format. | Snappy |
ORC |
Optimized Row Columnar format. | Snappy |
Write modes
| Mode | Behavior | Use when |
|---|---|---|
truncate |
Deletes all existing objects matching the specified prefix before writing. For example, "object": "abc" deletes all objects whose names start with abc. |
Overwriting a full dataset each run |
append |
Writes to a new object with a random universally unique identifier (UUID) suffix, leaving existing objects untouched. The resulting file name follows the pattern DI_xxxx_xxxx_xxxx. |
Incrementally adding data without modifying existing files |
nonConflict |
Fails with an error if any object matching the specified prefix already exists. For example, "object": "abc" fails if an object named abc123 exists. |
Ensuring no accidental overwrites |