Amazon Simple Storage Service (Amazon S3) is an object storage service built to store and retrieve any amount of data from anywhere. DataWorks Data Integration lets you read data from and write data to Amazon S3. This topic describes the capabilities of the Amazon S3 data source in DataWorks.
Limitations
Batch read
Amazon S3 stores unstructured data. In Data Integration, the Amazon S3 reader supports the following features.
Supported | Unsupported |
|
|
Batch write
The Amazon S3 writer converts data from the data synchronization protocol to text files in Amazon S3. Amazon S3 itself is an unstructured data store. The Amazon S3 writer supports the following features.
Supported | Unsupported |
|
|
Add a data source
Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data source management. You can view parameter descriptions in the DataWorks console to understand the meanings of the parameters when you add a data source.
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Configure a single-table batch synchronization task
For the procedure, see Configure a batch synchronization task by using the codeless UI and Configure a batch synchronization task by using the code editor.
For the complete parameters and script demo for script mode, see Appendix: Script demo and parameter description.
Appendix: Script demo and parameter description
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Use the Code Editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.
Reader script demo
{
"type":"job",
"version":"2.0",// The version number.
"steps":[
{
"stepType":"s3",// The plug-in name.
"parameter":{
"nullFormat":"",// The string that represents a null value.
"compress":"",// The compression type.
"datasource":"",// The data source name.
"column":[// The columns.
{
"index":0,// The column index.
"type":"string"// The data type.
},
{
"index":1,
"type":"long"
},
{
"index":2,
"type":"double"
},
{
"index":3,
"type":"boolean"
},
{
"format":"yyyy-MM-dd HH:mm:ss", // The time format.
"index":4,
"type":"date"
}
],
"skipHeader":"",// Specifies whether to skip the header row of a CSV-like file.
"encoding":"",// The encoding format.
"fieldDelimiter":",",// The column delimiter.
"fileFormat": "",// The file format.
"object":[]// The object prefix.
},
"name":"Reader",
"category":"reader"
},
{
"stepType":"stream",
"parameter":{},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":""// The error count.
},
"speed":{
"throttle":true,// Specifies whether to enable throttling. A value of false indicates that throttling is disabled and the mbps parameter does not take effect. A value of true indicates that throttling is enabled.
"concurrent":1 // The concurrency.
"mbps":"12",// The throttling rate. 1 mbps = 1 MB/s.
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}Reader script parameters
Parameter | Description | Required | Default value |
datasource | The data source name. Script mode allows you to add data sources. The value of this parameter must be the same as the name of the data source that you add. | Yes | N/A |
Object | The object information in Amazon S3. You can specify multiple objects. For example, if the bucket contains a test folder, and the folder contains a file named ll.txt, set Object to test/ll.txt.
Note
| Yes | N/A |
column | The list of columns to read. The type parameter specifies the data type of the source data. The index parameter specifies the column number in the text file (starting from 0). The value parameter specifies that the current column is a constant. Instead of reading data from the source file, the system generates the column based on the specified value. By default, you can read all data as String type. Example configuration: You can also specify column information. Example configuration: Note For the column information that you specify, type is required, and you must specify either index or value. | Yes | All data is read as STRING type. |
fieldDelimiter | The column delimiter for reading data. Note When the Amazon S3 reader reads data, you must specify a column delimiter. If no delimiter is specified, the default delimiter (,) is used. The default delimiter (,) is also used in the codeless UI. If the delimiter is invisible, specify the Unicode encoding. For example, \u001b or \u007c. | Yes | Default value: (,) |
compress | The compression type. By default, this parameter is left empty, which indicates that no compression is applied. The supported compression types are gzip, bzip2, and zip. | No | No compression |
encoding | The encoding of the files to read. | No | utf-8 |
nullFormat | Standard strings in text files cannot represent null (null pointer). The data synchronization system uses nullFormat to define which strings can represent null. For example, if you set | No | N/A |
skipHeader | For CSV files, use skipHeader to specify whether to read the header row.
Note skipHeader is not supported for compressed files. | No | false |
csvReaderConfig | The configuration for reading CSV files. This parameter is of the Map type. The CsvReader is used to read CSV files and provides various configurations. If you do not configure this parameter, default values are used. | No | N/A |
Writer script demo
{
"type": "job",
"version": "2.0",
"steps": [
{
"stepType": "stream",
"parameter": {},
"name": "Reader",
"category": "reader"
},
{
"stepType": "s3",
"category": "writer",
"name": "Writer",
"parameter": {
"datasource": "datasource1",
"object": "test/csv_file.csv",
"fileFormat": "csv",
"encoding": "utf8/gbk/...",
"fieldDelimiter": ",",
"lineDelimiter": "\n",
"column": [
"0",
"1"
],
"header": [
"col_bigint",
"col_tinyint"
],
"writeMode": "truncate",
"writeSingleObject": true
}
}
],
"setting": {
"errorLimit": {
"record": "" // The error count.
},
"speed": {
"throttle": true, // Specifies whether to enable throttling. A value of false indicates that throttling is disabled and the mbps parameter does not take effect. A value of true indicates that throttling is enabled.
"concurrent": 1 // The concurrency.
"mbps": "12", // The throttling rate. 1 mbps = 1 MB/s.
}
},
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
}
}Writer script parameters
Parameter | Description | Required | Default value |
datasource | The data source name. Script mode allows you to add data sources. The value of this parameter must be the same as the name of the data source that you add. | Yes | N/A |
object | The name of the destination object. | Yes | N/A |
fileFormat | The following file formats are supported:
| Yes | text |
writeMode |
| Yes | append |
fieldDelimiter | The column delimiter for writing data. | No | Default value: (,) |
lineDelimiter | The line delimiter for writing data. | No | Default value: (\n) |
compress | The compression type. By default, this parameter is left empty, which indicates that no compression is applied.
| No | No compression |
nullFormat | Standard strings in text files cannot represent null (null pointer). The data synchronization system uses | No | N/A |
header | The header to write. Example: | No | N/A |
writeSingleObject | true: Writes data to a single file. false: Writes data to multiple files. Note
| No | false |
encoding | The encoding of the files to write. | No | utf-8 |
column | The column configuration for writing data.
| Yes | N/A |