Amazon Simple Storage Service (Amazon S3) Reader is used to read data from Amazon S3 buckets. This topic describes the data types and parameters that are supported by Amazon S3 Reader and how to configure Amazon S3 Reader by using the codeless user interface (UI) and the code editor.

Background information

Amazon S3 Reader reads data stored in Amazon S3 buckets. Amazon S3 Reader uses Amazon S3 SDK for Java provided by Amazon to read data from Amazon S3. Then, Amazon S3 Reader converts the data to a format that is readable to Data Integration and sends the converted data to a writer.

Amazon S3 stores unstructured data. Amazon S3 Reader provides the following features:
  • Reads data from TXT objects. The data in the TXT objects must be logical two-dimensional tables.
  • Reads data from CSV-like objects with custom delimiters.
  • Reads data of various types as strings and supports constants and column pruning.
  • Supports recursive data read and object name-based filtering.
  • Supports object compression. The following compression formats are supported: GZIP, BZIP2, and ZIP.
  • Uses parallel threads to read data from multiple objects at the same time.

Limits

  • Amazon S3 data sources in the Chinese mainland and Hong Kong (China) are not supported.
  • Amazon S3 Reader does not support the following features:
    • Uses parallel threads to read data from a single object.
    • Uses parallel threads to read data from a compressed object.
    • Reads data from an object that exceeds 100 GB in size.

Data types

Category Data Integration data type Amazon S3 data type
Integer LONG LONG
Floating point DOUBLE DOUBLE
String STRING STRING
Date and time DATE DATE
Boolean BOOL BOOL

Configure Amazon S3 Reader by using the code editor

  • Parameters
    Parameter Description Required Default value
    datasource The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor. Yes N/A
    Object The name of the Amazon S3 object. You can specify multiple objects from which Amazon S3 Reader reads data. For example, if a bucket contains the test folder in which the ll.txt object resides, the name of this object is test/ll.txt.
    • If you specify a single Amazon S3 object, Amazon S3 Reader uses a single thread to read data.
    • If you specify multiple Amazon S3 objects, Amazon S3 Reader uses parallel threads to read data. The number of threads is determined by the number of channels.
    • If you specify a name that contains a wildcard, Amazon S3 Reader reads data from all objects that match the name. For example, if you set this parameter to abc[0-9], Amazon S3 Reader reads data from objects abc0 to abc9. We recommend that you do not use wildcards because an out of memory (OOM) error may occur.
    Note
    • Data Integration considers all objects that are read in a synchronization node as a single table. Make sure that all objects that are read in a synchronization node use the same schema.
    • Control the number of objects stored in a folder. If a folder contains excessive objects, an OOM error may occur. In this case, store the objects in different folders before you synchronize data.
    Yes N/A
    column The columns from which you want to read data. The type parameter specifies the source data type. The index parameter specifies the ID of the column in the source object, starting from 0. The value parameter specifies the column value if the column is a constant column. Amazon S3 Reader does not read a constant column from the source. Instead, Amazon S3 Reader generates a constant column based on the value that you specify.
    You can specify the column parameter in the following format. In this case, Amazon S3 Reader reads all data as strings.
    column": ["*"]
    You can also specify a column to read and a constant column in the following format:
    "column":    
    {       
    "type": "long",       
    "index": 0 // The first INT-type column in the object from which you want to read data. 
    },    
    {       
    "type": "string",       
    "value": "alibaba" // The value of the current column. In this code, the value is the constant alibaba.     
    }
    Note In the column parameter, you must specify the type parameter and specify the index or value parameter.
    Yes *, which indicates that Amazon S3 Reader reads all data as strings.
    fieldDelimiter The column delimiter that is used in the Amazon S3 object from which you want to read data.
    Note

    Amazon S3 Reader uses a column delimiter to read data. The default column delimiter is a comma (,). If you do not specify the column delimiter, the default column delimiter is used.

    If the delimiter is non-printable, enter a value encoded in Unicode, such as \u001b or \u007c.

    Yes Comma (,)
    compress The format in which objects are compressed. By default, this parameter is left empty, which means that objects are not compressed. Amazon S3 Reader supports the following compression formats: GZIP, BZIP2, and ZIP. No Empty
    encoding The encoding format of the objects from which you want to read data. No utf-8
    nullFormat The string that represents a null pointer. No standard strings can represent a null pointer in TXT objects. You can use this parameter to define a string that represents a null pointer. For example, if you set the nullFormat parameter to null, Amazon S3 Reader considers null as a null pointer. You can escape empty strings in the following format: \N=\\N. No N/A
    skipHeader Specifies whether to skip the headers in a CSV-like object. Valid values:
    • True: Amazon S3 Reader reads the headers in a CSV-like object.
    • False: Amazon S3 Reader ignores the headers in a CSV-like object.
    Note The skipHeader parameter is unavailable for compressed objects.
    No false
    csvReaderConfig The configurations required to read CSV-like objects. The parameter value must be of the MAP type. A CSV-like object reader is used to read data from CSV-like objects. The CSV-like object reader supports multiple configurations. If no configuration is specified, the default settings are used. No N/A
  • In the following sample code, a synchronization node is configured to read data from an Amazon S3 bucket. For more information about how to configure a synchronization node by using the code editor, see Create a synchronization node by using the code editor. The following code provides a sample script:
    {
        "type":"job",
        "version":"2.0",// The version number. 
        "steps":[
            {
                "stepType":"s3",// The reader type. 
                "parameter":{
                    "nullFormat":"",// The string that represents a null pointer. 
                    "compress":"",// The format in which objects are compressed. 
                    "datasource":"",// The name of the data source. 
                    "column":[// The columns from which you want to read data. 
                        {
                            "index":0,// The ID of a column in the source object. 
                            "type":"string"// The data type of the column. 
                        },
                        {
                            "index":1,
                            "type":"long"
                        },
                        {
                            "index":2,
                            "type":"double"
                        },
                        {
                            "index":3,
                            "type":"boolean"
                        },
                        {
                            "format":"yyyy-MM-dd HH:mm:ss", // The time format. 
                            "index":4,
                            "type":"date"
                        }
                    ],
                    "skipHeader":"",// Specifies whether to skip the headers in a CSV-like object. 
                    "encoding":"",// The encoding format. 
                    "fieldDelimiter":",",// The column delimiter. 
                    "fileFormat": "",// The format of the object. 
                    "object":[]// The name of the object from which you want to read data. 
                },
                "name":"Reader",
                "category":"reader"
            },
            {
                "stepType":"stream",
                "parameter":{},
                "name":"Writer",
                "category":"writer"
            }
        ],
        "setting":{
            "errorLimit":{
                "record":""// The maximum number of dirty data records allowed. 
            },
            "speed":{
                "throttle":true,// Specifies whether to enable bandwidth throttling. A value of false indicates that bandwidth throttling is disabled, and a value of true indicates that bandwidth throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
                "concurrent":1 // The maximum number of parallel threads. 
                "mbps":"12",// The maximum transmission rate.
            }
        },
        "order":{
            "hops":[
                {
                    "from":"Reader",
                    "to":"Writer"
                }
            ]
        }
    }

Configure Amazon S3 Reader by using the codeless UI

  1. Configure data sources.
    Set parameters in the Source and Target section for the synchronization node. Configure data sources
    Parameter Description
    Connection The name of the data source from which you want to read data. This parameter is equivalent to the datasource parameter that you set when you use the code editor.
    Object Name (Path Included) The name of the object from which you want to read data. This parameter is equivalent to the Object parameter that you set when you use the code editor.
    Note If an Amazon S3 object is named based on the date, such as aaa/20171024abc.txt, you can set this parameter to aaa/${bdp.system.bizdate}abc.txt.
    Field Delimiter The column delimiter. This parameter is equivalent to the fieldDelimiter parameter that you set when you use the code editor. By default, a comma (,) is used as the column delimiter.
    Encoding The encoding format. This parameter is equivalent to the encoding parameter that you set when you use the code editor. Default value: UTF-8.
    Null String The string that represents a null pointer. This parameter is equivalent to the nullFormat parameter that you set when you use the code editor. If the source data contains the specified string, the string is replaced with null.
    Compression Format The format in which objects are compressed. This parameter is equivalent to the compress parameter that you set when you use the code editor. By default, objects are not compressed.
    Include Header Specifies whether to skip the headers in the object. This parameter is equivalent to the skipHeader parameter that you set when you use the code editor. Default value: No.
  2. Configure field mappings. This operation is equivalent to setting the column parameter when you use the code editor.
    Fields in the source on the left have a one-to-one mapping with fields in the destination on the right. You can click Add to add a field. To remove an added field, move the pointer over the field and click the Remove icon. Mappings
    Parameter Description
    Map Fields with the Same Name Click Map Fields with the Same Name to establish mappings between fields with the same name. The data types of the fields must match.
    Map Fields in the Same Line Click Map Fields in the Same Line to establish mappings between fields in the same row. The data types of the fields must match.
    Delete All Mappings Click Delete All Mappings to remove the mappings that have been established.
  3. Configure channel control policies. Channel section
    Parameter Description
    Expected Maximum Concurrency The maximum number of parallel threads that the synchronization node can use to read data from the source or write data to the destination. You can configure the parallelism for the synchronization node on the codeless UI.
    Bandwidth Throttling Specifies whether to enable bandwidth throttling. You can enable bandwidth throttling and specify a maximum transmission rate to prevent heavy read workloads on the source. We recommend that you enable bandwidth throttling and set the maximum transmission rate to an appropriate value based on the configurations of the source.
    Dirty Data Records Allowed The maximum number of dirty data records allowed.
    Distributed Execution Not supported.