TOS data source read capability - DataWorks - Alibaba Cloud Documentation Center

A TOS data source lets you read files from Tinder Object Storage (TOS). You can use this data source to retrieve files stored in TOS, parse them, and sync the data to any destination data source. This topic describes the data synchronization capabilities of TOS in DataWorks.

Limits

TOS data sources in DataWorks support the following field types.

Data type	Description
STRING	Text.
LONG	Integer.
BYTES	Byte array. The read text is converted into a byte array with `UTF-8` encoding.
BOOL	Boolean.
DOUBLE	Floating-point.
DATE	Date and time. The following formats are supported: `YYYY-MM-dd HH:mm:ss` `yyyy-MM-dd` `HH:mm:ss`

Create a TOS data source

Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data Source Management. You can view the infotips of parameters in the DataWorks console to understand the meanings of the parameters when you add a data source.

Develop a data sync task

You can use a TOS data source only as a source in an offline sync task for a single table. The following section describes how to configure the data sync task.

For more information, see Configure a task in the codeless UI and Configure a task in the code editor.
For all parameters and a script sample for the code editor, see Appendix: Script sample and parameter descriptions.

Appendix: Script sample and parameter descriptions

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a task in the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

Reader script sample

{
  "type": "job",
  "version": "2.0",
  "steps": [
    {
      "stepType": "tos",
      "parameter": {
        "datasource": "",
        "object": ["f/z/1.csv"],
        "fileFormat": "csv",
        "encoding": "utf8/gbk/...",
        "fieldDelimiter": ",",
        "useMultiCharDelimiter": true,
        "skipHeader": true,
        "compress": "zip/gzip",
        "column": [
          {
            "index": 0,
            "type": "long"
          },
          {
            "index": 1,
            "type": "boolean"
          },
          {
            "index": 2,
            "type": "double"
          },
          {
            "index": 3,
            "type": "string"
          },
          {
            "index": 4,
            "type": "date"
          }
        ]
      },
      "name": "Reader",
      "category": "reader"
    },
    {
      "stepType": "stream",
      "parameter": {},
      "name": "Writer",
      "category": "writer"
    }
  ],
  "setting": {
    "errorLimit": {
      "record": "0"
    },
    "speed": {
      "concurrent": 1
    }
  },
  "order": {
    "hops": [
      {
        "from": "Reader",
        "to": "Writer"
      }
    ]
  }
}

Reader script parameters

Parameter	Description	Required	Default value
datasource	The name of the data source. This must be the same as the name of the data source that you add in the code editor.	Yes	None
fileFormat	The format of the source file. Supported formats: `csv`, `text`, `parquet`, and `orc`.	Yes	None
object	The file path. This parameter supports the asterisk () wildcard character and arrays. For example, to sync the a/b/1.csv and a/b/2.csv files, you can set this parameter to a/b/.csv.	Yes	None
column	The columns to read. The type parameter specifies the source data type. The index parameter specifies the column number in the text file, starting from 0. The value parameter specifies a constant. This creates a column with a constant value instead of reading data from the source file. By default, you can read all data as the String type with the following configuration. `column": [""]` You can specify the column information as follows. `"column": { "type": "long", "index": 0 // Obtains an integer field from the first column of the TOS text file. }, { "type": "string", "value": "alibaba" // Generates a string field with the value "alibaba" from within TOS as the current field. }` Note* For the column information that you specify, you must specify the type parameter and either the index or value parameter.	Yes	All columns are read as the `STRING` type.
fieldDelimiter	The field separator. Note You must specify a field separator for TOS Reader. If you do not specify one, the default comma (,) is used. The comma (,) is also the default value on the configuration page. If the separator is a non-printable character, enter its Unicode encoding. For example, `\u001b` or `\u007c`.	Yes	`,`
lineDelimiter	The row delimiter. Note This parameter is valid only when fileFormat is set to text.	No	None
compress	The compression format of the text file. By default, this parameter is left empty, which means no compression. Supported formats: `gzip`, `bzip2`, and `zip`.	No	`None`
encoding	The encoding format of the file.	No	`utf-8`
nullFormat	A string in the text file that represents a null pointer. You can use nullFormat to define which strings represent null because text files do not have a standard way to define null. For example: If you set `nullFormat:"null"`, which is a visible character, a source data value of "null" is treated as a null field. If you set `nullFormat:"\u0001"`, which is a non-visible character, a source data value of "\u0001" is treated as a null field. If you do not specify the `"nullFormat"` parameter, no conversion is performed. The source data is written to the destination as is.	No	None
skipHeader	For CSV files, use skipHeader to configure whether to skip the header. True: Reads the table header during data source synchronization. False: The table header is not read during data source synchronization. Note The skipHeader parameter is not supported for compressed files.	No	`false`
parquetSchema	The schema of the Parquet files to read. This parameter is valid only when fileFormat is set to parquet. Ensure that the entire configuration is valid JSON after you specify the parquetSchema. `message MessageTypeName { Rule, DataType, FieldName; ......................; }` The format of parquetSchema is as follows: MessageTypeName: The name of the message type. required/optional: Use required for non-null fields and optional for nullable fields. We recommend that you set this to optional for all fields. dataType: Parquet files support BOOLEAN, Int32, Int64, Int96, FLOAT, DOUBLE, BINARY, and fixed_len_byte_array. If the field type is string, use BINARY. Each column definition must end with a semicolon (;), including the last one. The following is a configuration example: `"parquetSchema": "message m { optional int32 minute_id; optional int32 dsp_id; optional int32 adx_pid; optional int64 req; optional int64 res; optional int64 suc; optional int64 imp; optional double revenue; }"`	No	None
csvReaderConfig	The parameter configuration for reading CSV files. The value is of the Map type. A csvReader is used to read CSV files. If you do not configure this parameter, default values are used.	No	None