COS (Tencent Cloud Object Storage) is a read-only data source in DataWorks. Connect it to read files from a COS bucket and sync the data to any supported destination.
Supported capabilities
| Capability | Supported |
|---|---|
| Offline sync — read (source) | Yes |
| Offline sync — write (destination) | No |
| Real-time sync | No |
Prerequisites
Before you begin, ensure that you have:
-
A Tencent Cloud account with a COS bucket
-
A SecretId and SecretKey with read access to the bucket — get them from API Key Management in the Tencent Cloud console
-
The region ID and endpoint of the bucket — see Regions and Endpoints
-
A DataWorks workspace
Supported data types
| Data type | Description |
|---|---|
| STRING | Text |
| LONG | Integer |
| DOUBLE | Floating-point number |
| BOOL | Boolean |
| DATE | Date and time. Supported formats: YYYY-MM-dd HH:mm:ss and yyyy-MM-ddHH:mm:ss |
| BYTES | Byte array. Text content is converted to a UTF-8 encoded byte array. |
Create a data source
Create a COS data source before configuring any sync task. For the full procedure, see Data Source Management.
The key parameters are described below.
| Parameter | Description |
|---|---|
| Data Source Name | A unique name within the workspace. Use letters, digits, and underscores (_). The name cannot start with a digit or an underscore. |
| Region | The region where the bucket is located. Enter the region ID. See Regions and Endpoints. |
| Bucket | The name of the COS bucket. |
| Endpoint | The endpoint of COS. See Regions and Endpoints. |
| AccessKey ID | Corresponds to SecretId on Tencent Cloud. Get it from API Key Management. |
| AccessKey Secret | Corresponds to SecretKey on Tencent Cloud. Get it from API Key Management. |
Configure an offline sync task for a single table
To configure a COS offline sync task, use the codeless UI or the code editor.
-
Codeless UI: See Configure a task in the codeless UI.
-
Code editor: See Configure a task in the code editor. For the full parameter reference and a script example, see Appendix: Script demo and parameters.
Appendix: Script demo and parameters
Reader script demo
{
"type": "job",
"version": "2.0",
"steps": [
{
"stepType": "cos",
"parameter": {
"datasource": "",
"object": ["f/z/1.csv"],
"fileFormat": "csv",
"encoding": "utf8/gbk/...",
"fieldDelimiter": ",",
"useMultiCharDelimiter": true,
"lineDelimiter": "\n",
"skipHeader": true,
"compress": "zip/gzip",
"column": [
{
"index": 0,
"type": "long"
},
{
"index": 1,
"type": "boolean"
},
{
"index": 2,
"type": "double"
},
{
"index": 3,
"type": "string"
},
{
"index": 4,
"type": "date"
}
]
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "stream",
"parameter": {},
"name": "Writer",
"category": "writer"
}
],
"setting": {
"errorLimit": {
"record": "0"
},
"speed": {
"concurrent": 1
}
},
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
}
}
Reader script parameters
Connection
| Parameter | Description | Required | Default |
|---|---|---|---|
datasource |
The data source name. Must match the name of the data source you added. | Yes | None |
Locate files
| Parameter | Description | Required | Default |
|---|---|---|---|
object |
The file path. Supports the asterisk (*) wildcard and can be configured as an array. For example, to sync a/b/1.csv and a/b/2.csv, set this to a/b/*.csv. |
Yes | None |
Parse files
| Parameter | Description | Required | Default |
|---|---|---|---|
fileFormat |
The format of the source file. Valid values: csv, text, parquet, orc. |
Yes | None |
column |
The fields to read. Each entry specifies:<br>- type: the data type in the source<br>- index: the column position (0-based)<br>- value: a constant value — no data is read from the source; a column is generated with this value instead<br><br>To read all columns as STRING, use "column": ["*"].<br><br>To specify individual columns:<br>``json<br>"column": [<br> { "type": "long", "index": 0 },<br> { "type": "string", "value": "alibaba" }<br>]``<br><br>Important
Each entry must include |
Yes | All columns as STRING |
fieldDelimiter |
The field separator. For invisible characters, use Unicode encoding (for example, \u001b or \u007c). |
Yes | , |
lineDelimiter |
The row separator. Valid only when fileFormat is text. |
No | None |
encoding |
The file encoding. | No | utf-8 |
compress |
The compression format. Supported values: gzip, bzip2, zip. Leave blank if the file is not compressed. |
No | Uncompressed |
skipHeader |
For a CSV file, specifies whether to read the table header.<br>- true: The table header is read during data synchronization.<br>- false: The table header is not read during data synchronization.<br><br>Note
Not supported for compressed files. |
No | false |
nullFormat |
The string to treat as a null value. For example, if set to "null", any field containing the text null is written to the destination as null. If not set, source data is written as-is without conversion. |
No | None |
Advanced options
| Parameter | Description | Required | Default |
|---|---|---|---|
parquetSchema |
Required when fileFormat is parquet. Defines the data structure using the following format:<br><br>``<br>message MessageTypeName {<br> required|optional DataType ColumnName;<br> ...<br>}<br>`<br><br>Supported data types: BOOLEAN, Int32, Int64, Int96, FLOAT, DOUBLE, BINARY (use for string types), fixed_len_byte_array. Set all fields to optional unless null values are not allowed. Each field definition must end with a semicolon (;), including the last one.<br><br>Example:<br>`json<br>{"parquetSchema": "message UserProfile { optional int32 minute_id; optional int32 dsp_id; optional int32 adx_pid; optional int64 req; optional int64 res; optional int64 suc; optional int64 imp; optional double revenue; }"}<br>`` |
No (required for Parquet) | None |
csvReaderConfig |
Additional parameters for reading CSV files, passed as a map. Uses defaults if not specified. | No | None |