Sync Tencent COS Files to DataWorks for Cross-Cloud Data Integration - DataWorks

Supported capabilities

Capability	Supported
Offline sync — read (source)	Yes
Offline sync — write (destination)	No
Real-time sync	No

Prerequisites

Before you begin, ensure that you have:

A Tencent Cloud account with a COS bucket
A SecretId and SecretKey with read access to the bucket — get them from API Key Management in the Tencent Cloud console
The region ID and endpoint of the bucket — see Regions and Endpoints
A DataWorks workspace

Supported data types

Data type	Description
STRING	Text
LONG	Integer
DOUBLE	Floating-point number
BOOL	Boolean
DATE	Date and time. Supported formats: `YYYY-MM-dd HH:mm:ss` and `yyyy-MM-ddHH:mm:ss`
BYTES	Byte array. Text content is converted to a UTF-8 encoded byte array.

Create a data source

Create a COS data source before configuring any sync task. For the full procedure, see Data Source Management.

The key parameters are described below.

Parameter	Description
Data Source Name	A unique name within the workspace. Use letters, digits, and underscores (`_`). The name cannot start with a digit or an underscore.
Region	The region where the bucket is located. Enter the region ID. See Regions and Endpoints.
Bucket	The name of the COS bucket.
Endpoint	The endpoint of COS. See Regions and Endpoints.
AccessKey ID	Corresponds to SecretId on Tencent Cloud. Get it from API Key Management.
AccessKey Secret	Corresponds to SecretKey on Tencent Cloud. Get it from API Key Management.

Configure an offline sync task for a single table

To configure a COS offline sync task, use the codeless UI or the code editor.

Codeless UI: See Configure a task in the codeless UI.
Code editor: See Configure a task in the code editor. For the full parameter reference and a script example, see Appendix: Script demo and parameters.

Appendix: Script demo and parameters

Reader script demo

{
  "type": "job",
  "version": "2.0",
  "steps": [
    {
      "stepType": "cos",
      "parameter": {
        "datasource": "",
        "object": ["f/z/1.csv"],
        "fileFormat": "csv",
        "encoding": "utf8/gbk/...",
        "fieldDelimiter": ",",
        "useMultiCharDelimiter": true,
        "lineDelimiter": "\n",
        "skipHeader": true,
        "compress": "zip/gzip",
        "column": [
          {
            "index": 0,
            "type": "long"
          },
          {
            "index": 1,
            "type": "boolean"
          },
          {
            "index": 2,
            "type": "double"
          },
          {
            "index": 3,
            "type": "string"
          },
          {
            "index": 4,
            "type": "date"
          }
        ]
      },
      "name": "Reader",
      "category": "reader"
    },
    {
      "stepType": "stream",
      "parameter": {},
      "name": "Writer",
      "category": "writer"
    }
  ],
  "setting": {
    "errorLimit": {
      "record": "0"
    },
    "speed": {
      "concurrent": 1
    }
  },
  "order": {
    "hops": [
      {
        "from": "Reader",
        "to": "Writer"
      }
    ]
  }
}

Reader script parameters

Connection

Parameter	Description	Required	Default
`datasource`	The data source name. Must match the name of the data source you added.	Yes	None

Locate files

Parameter	Description	Required	Default
`object`	The file path. Supports the asterisk (``) wildcard and can be configured as an array. For example, to sync `a/b/1.csv` and `a/b/2.csv`, set this to `a/b/.csv`.	Yes	None

Parse files

Parameter	Description	Required	Default
`fileFormat`	The format of the source file. Valid values: `csv`, `text`, `parquet`, `orc`.	Yes	None
`column`	The fields to read. Each entry specifies:<br>- `type`: the data type in the source<br>- `index`: the column position (0-based)<br>- `value`: a constant value — no data is read from the source; a column is generated with this value instead<br><br>To read all columns as STRING, use `"column": [""]`.<br><br>To specify individual columns:<br>```json<br>"column": [<br> { "type": "long", "index": 0 },<br> { "type": "string", "value": "alibaba" }<br>]```<br><br> Important* Each entry must include `type` and either `index` or `value`.	Yes	All columns as STRING
`fieldDelimiter`	The field separator. For invisible characters, use Unicode encoding (for example, `\u001b` or `\u007c`).	Yes	`,`
`lineDelimiter`	The row separator. Valid only when `fileFormat` is `text`.	No	None
`encoding`	The file encoding.	No	`utf-8`
`compress`	The compression format. Supported values: `gzip`, `bzip2`, `zip`. Leave blank if the file is not compressed.	No	Uncompressed
`skipHeader`	For a CSV file, specifies whether to read the table header.<br>- `true`: The table header is read during data synchronization.<br>- `false`: The table header is not read during data synchronization.<br><br> Note Not supported for compressed files.	No	`false`
`nullFormat`	The string to treat as a null value. For example, if set to `"null"`, any field containing the text `null` is written to the destination as null. If not set, source data is written as-is without conversion.	No	None

Advanced options

Parameter	Description	Required	Default
`parquetSchema`	Required when `fileFormat` is `parquet`. Defines the data structure using the following format:<br><br>```<br>message MessageTypeName {<br> required\|optional DataType ColumnName;<br> ...<br>}<br>```<br><br>Supported data types:` BOOLEAN`,` Int32`,` Int64`,` Int96`,` FLOAT`,` DOUBLE`,` BINARY `(use for string types),` fixed_len_byte_array`. Set all fields to` optional `unless null values are not allowed. Each field definition must end with a semicolon (`;`), including the last one.<br><br>Example:<br>```json<br>{"parquetSchema": "message UserProfile { optional int32 minute_id; optional int32 dsp_id; optional int32 adx_pid; optional int64 req; optional int64 res; optional int64 suc; optional int64 imp; optional double revenue; }"}<br>```	No (required for Parquet)	None
`csvReaderConfig`	Additional parameters for reading CSV files, passed as a map. Uses defaults if not specified.	No	None