edit-icon download-icon

Configure OSS reader

Last Updated: Apr 13, 2018

The OSS Reader plug-in provides the ability to read data from OSS data storage. At the underlying implementation level, OSS Reader acquires the OSS data using official OSS Java SDK, converts the data to the data synchronization protocol, and passes it to Writer.

OSS Reader provides the ability to read data from a remote OSS file and convert the data to the Data Integration/datax protocol. OSS file itself is a non‑structured data storage. For Data Integration/datax, OSS Reader currently supports the following features:

  • Only supports reading TXT files and the shema in the TXT file must be a two‑dimensional table.

  • Supports CSV‑like format files with custom delimiters.

  • Supports reading multiple types of data (represented by Strings) and supports column pruning and column constants.

  • Supports recursive reading and filtering by File Name.

  • Supports text compression. The available compression formats include gzip, bzip2, and zip. Note: Multiple files cannot be compressed into one package.

  • Supports concurrent reading of multiple objects.

The followings are not supported currently:

  • Multi‑thread concurrent reading of a single object (file).
  • Technically, the multi‑thread concurrent reading of a single compressed object is not supported.

OSS Reader supports the following data types of OSS: BIGINT, DOUBLE, STRING, DATATIME, BOOLEAN.

Parameter description

  • datasource

    • Description: Data source name. It must be identical to the data source name added. Adding data source is supported in script mode.

    • Required: Yes

    • Default value: None

  • Object

    • Description: The object information of OSS. Multiple objects can be specified. For example, if the bucket of xxx contains yunshi folder which has ll.txt file, the object is directly specified as yunshi/ll.txt.

      • If a single OSS object is specified, OSS Reader only supports single-threaded data extraction. We are planning to provide the ability to concurrently read a single non‑compressed object with multiple threads.
      • If multiple OSS objects are specified, OSS Reader can extract data with multiple threads. The number of concurrent threads is specified based on the number of channels.
      • If a wildcard is specified, OSS Reader attempts to traverse multiple objects. For more information, see OSS Product Overview.

        Note:

        Data synchronization system identifies all objects synchronized in a job as a same data table. You must make sure that all objects are applicable to the same schema information.

    • Required: Yes

    • Default value: None

  • column

    • Description: It refers to the list of fields read, where the type indicates the type of source data, the index indicates the column in which the current column locates (starts from 0), and the value indicates that the current type is constant and the data is not read from the source file but the corresponding column is automatically generated according to the value.

      By default, you can read data by taking String as the only type. The configuration is as follows:

      1. json
      2. "column": ["*"]

      You can configure the column field as follows:

      1. json
      2. "column":
      3. {
      4. "type": "long",
      5. "index": 0 //Read the int field from the first column of the OSS text
      6. },
      7. {
      8. "type": "string",
      9. "value": "alibaba" //OSS Reader internally generates the alibaba string field as the current field
      10. }

      Note:

      For the specified column information, you must enter type and choose one from index/value.

    • Required: Yes

    • Default value: Read data by taking string as the only type

  • fieldDelimiter

    • Description: The delimiter used to separate the read fields. Note: A field delimiter must be specified when OSS Reader reads data. By default, if “,” is not specified, it is entered in the interface configuration.

    • Required: Yes

    • Default value: ,

  • compress

    • Description: Compression type of files. It is left empty by default, which means no compression is performed. Supports the following compression types: gzip, bzip2, zip.

    • Required: No

    • Default value: No compression

  • encoding

    • Description: Encoding of the read files.

    • Required: No

    • Default value: utf-8

  • nullFormat

    • Description: Defining null (null pointer) with a standard string is not allowed in text files. Data Synchronization system provides nullFormat to define which strings can be expressed as null. For example, when nullFormat=”null” is configured, if the source data is “null”, it is considered as a null field in data synchronization system.

    • Required: No

    • Default value: \N

  • skipHeader

    • Description: The header of a file in CSV‑like format is skipped if it is a title. Headers are not skipped by default. skipHeader is not supported for file compression.
    • Required: No

    • Default value: False

Development in wizard mode

wizardmodeOSS

  • Data Sources: datasource in the preceding parameter description. Choose oss.

  • Object Prefix: Object in the preceding parameter description.

  • version of the type: The type version of current data sourse.

  • Column Delimiter: fieldDelimiter in the preceding parameter description, which defaults to “,”.

  • Encoding Format: encoding in the preceding parameter description, which defaults to utf-8.

  • null Value: nullFormat in the preceding parameter description. Enter the field to be expressed as null into a text box. If source end exists, the corresponding field is converted to null.

  • Compression Format: compress in the preceding parameter description, which defaults to “no compression”.

  • Whether the Header: skipHeader in the preceding parameter description, which defaults to “No”.

  • Field Mapping: column in the preceding parameter description. Column field information can be specified.

Development in script mode

The following is a script configuration sample. For more information about parameters, see Parameter description.

  1. {
  2. "type": "job",
  3. "version": "1.0",
  4. "configuration": {
  5. "settting": {
  6. "key": "value"
  7. },
  8. "reader": {
  9. "plugin": "oss",
  10. "parameter": {
  11. "datasource": "datasourceName",
  12. "object": [
  13. "yunshi/*"
  14. ],
  15. "column": [
  16. {
  17. "type": "int",
  18. "index": 0
  19. },
  20. {
  21. "type": "string",
  22. "value": "alibaba"
  23. },
  24. {
  25. "type": "date",
  26. "index": 1,
  27. "format": "yyyy-mm-dd"
  28. }
  29. ],
  30. "encoding": "UTF-8",
  31. "fieldDelimiter": "\t",
  32. "compress": "gzip"
  33. }
  34. },
  35. "writer": {
  36. }
  37. }
  38. }
Thank you! We've received your feedback.