The OSS data source connects DataWorks to Object Storage Service (OSS) for both reading and writing data. This topic covers supported file formats, limits, and script parameters for the OSS data source.
Supported formats and limits
Offline read
OSS Reader reads objects from OSS and converts them to the DataWorks data integration protocol. OSS is an unstructured data store — it has no native schema, so all field structure must be confirmed in the task configuration.
| Supported | Not supported |
|---|---|
| TXT files (schema must be a two-dimensional table) | Multi-threaded concurrent reads for a single object |
| CSV-like files with custom delimiters | Multi-threaded concurrent reads for a single compressed object |
| ORC and Parquet formats | |
| Multiple data types (read as strings), column pruning, and column constants | |
| Recursive reads and file name filtering | |
| Concurrent reads for multiple objects |
Compression support for text files (TXT and CSV)
| Format | Supported |
|---|---|
| gzip | Yes |
| bzip2 | Yes |
| zip | Yes |
A compressed package cannot contain multiple files.
Usage notes for CSV files
-
CSV files must be in standard CSV format. If a column contains a double quotation mark (
"), escape it as two double quotation marks (""). Otherwise, the file splits incorrectly. -
If a file uses multiple delimiters, use the TXT file type instead.
-
OSS is an unstructured data source. Confirm the field structure before syncing data. If the structure of data in the source changes, reconfirm the field structure in the task configuration to prevent garbled data during synchronization.
Offline write
OSS Writer converts data from the DataWorks data synchronization protocol into text files written to OSS.
| Supported | Not supported |
|---|---|
| Text files only — not BLOBs such as videos and images (schema must be a two-dimensional table) | Concurrent writes to a single file |
| CSV-like files with custom delimiters | Writing to Cold Archive storage class buckets |
| ORC and Parquet formats (Snappy compression available in the code editor) | Single objects exceeding 100 GB |
| Multi-threaded writes (each thread writes to a different sub-file) | — |
| File rollover (switches to a new file when the current file exceeds a specified size) | — |
OSS does not provide data types. OSS Writer writes all data as the STRING type to OSS objects.
Data integration column types for offline write
| Type classification | Column configuration type |
|---|---|
| Integer types | LONG |
| String types | STRING |
| Floating-point types | DOUBLE |
| Boolean types | BOOLEAN |
| Date and time types | DATE |
Real-time write
-
Supports real-time writes.
-
Supports real-time writes from a single table to data lakes, including Hudi (0.12.x), Paimon, and Iceberg.
Create a data source
Add the OSS data source to DataWorks before developing a synchronization task. For instructions, see Data source management.
Cross-account, RAM role, and cross-region configurations
-
Cross-account: Grant authorization to the corresponding account via a bucket policy. See Grant cross-account access to OSS using a bucket policy.
-
RAM role authorization: See Configure a data source using the RAM role authorization mode.
-
Cross-region: Use a public endpoint. See Overview of endpoints and network connectivity.
Develop a data synchronization task
Single-table offline sync
-
Codeless UI: Codeless UI configuration
-
Code editor: Code editor configuration
For all parameters and script demos, see Appendix: Script demos and parameter descriptions.
Single-table real-time sync
Whole-database synchronization
-
Offline: Whole-database offline sync task
-
Real-time: Whole-database real-time sync task
FAQ
Is there a limit on the number of OSS files that can be read?
How do I handle dirty data when reading a CSV file with multiple delimiters?
Appendix: Script demos and parameter descriptions
Batch synchronization via the code editor
The following script examples and parameter tables apply to batch synchronization tasks configured in the code editor. For the general code editor procedure, see Configure a task in the code editor.
Reader script demo: General example
{
"type": "job",
"version": "2.0",
"steps": [
{
"stepType": "oss",
"parameter": {
"nullFormat": "", // String that represents null
"compress": "", // Text compression type
"datasource": "", // Data source name
"column": [ // Fields to read
{
"index": 0,
"type": "string"
},
{
"index": 1,
"type": "long"
},
{
"index": 2,
"type": "double"
},
{
"index": 3,
"type": "boolean"
},
{
"format": "yyyy-MM-dd HH:mm:ss",
"index": 4,
"type": "date"
}
],
"skipHeader": "", // Skip header row in CSV-like files
"encoding": "", // Encoding format
"fieldDelimiter": ",", // Column delimiter
"fileFormat": "", // Text file format
"object": [] // Object prefix or path
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "stream",
"parameter": {},
"name": "Writer",
"category": "writer"
}
],
"setting": {
"errorLimit": {
"record": "" // Maximum number of error records
},
"speed": {
"throttle": true, // true = rate limited; false = no rate limit
"concurrent": 1, // Number of concurrent jobs
"mbps": "12" // Rate limit (1 mbps = 1 MB/s)
}
},
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
}
}
Reader script demo: Read ORC or Parquet files
Read ORC or Parquet files from OSS by reusing the HDFS Reader. In addition to the standard OSS Reader parameters, use the extended path (for ORC) and fileFormat (for both ORC and Parquet) parameters.
Read an ORC file
{
"stepType": "oss",
"parameter": {
"datasource": "",
"fileFormat": "orc",
"path": "/tests/case61/orc__691b6815_9260_4037_9899_****",
"column": [
{ "index": 0, "type": "long" },
{ "index": "1", "type": "string" },
{ "index": "2", "type": "string" }
]
}
}
Read a Parquet file
{
"type": "job",
"version": "2.0",
"steps": [
{
"stepType": "oss",
"parameter": {
"nullFormat": "",
"compress": "",
"fileFormat": "parquet",
"path": "/*",
"parquetSchema": "message m { optional BINARY registration_dttm (UTF8); optional Int64 id; optional BINARY first_name (UTF8); optional BINARY last_name (UTF8); optional BINARY email (UTF8); optional BINARY gender (UTF8); optional BINARY ip_address (UTF8); optional BINARY cc (UTF8); optional BINARY country (UTF8); optional BINARY birthdate (UTF8); optional DOUBLE salary; optional BINARY title (UTF8); optional BINARY comments (UTF8); }",
"column": [
{ "index": "0", "type": "string" },
{ "index": "1", "type": "long" },
{ "index": "2", "type": "string" },
{ "index": "3", "type": "string" },
{ "index": "4", "type": "string" },
{ "index": "5", "type": "string" },
{ "index": "6", "type": "string" },
{ "index": "7", "type": "string" },
{ "index": "8", "type": "string" },
{ "index": "9", "type": "string" },
{ "index": "10", "type": "double" },
{ "index": "11", "type": "string" },
{ "index": "12", "type": "string" }
],
"skipHeader": "false",
"encoding": "UTF-8",
"fieldDelimiter": ",",
"fieldDelimiterOrigin": ",",
"datasource": "wpw_demotest_oss",
"envType": 0,
"object": ["wpw_demo/userdata1.parquet"]
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "odps",
"parameter": {
"partition": "dt=${bizdate}",
"truncate": true,
"datasource": "0_odps_wpw_demotest",
"envType": 0,
"column": ["id"],
"emptyAsNull": false,
"table": "wpw_0827"
},
"name": "Writer",
"category": "writer"
}
],
"setting": {
"errorLimit": { "record": "" },
"locale": "zh_CN",
"speed": {
"throttle": false,
"concurrent": 2
}
},
"order": {
"hops": [{ "from": "Reader", "to": "Writer" }]
}
}
Reader script parameters
| Parameter | Description | Required | Default |
|---|---|---|---|
datasource |
The data source name. Must match the name configured in the code editor. | Yes | None |
object |
The path to the objects to read. See Configuring the object path below. | Yes | None |
column |
The list of fields to read. type specifies the data type. index specifies the column number (starting from 0). value generates a constant field — the value is not read from the source. To read all data as strings, set "column": ["*"]. For type, index or value is required. |
Yes | All data read as STRING |
fileFormat |
The file format of the source object. Valid values: csv, text. Both support custom delimiters. |
Yes | csv |
fieldDelimiter |
The column delimiter. Defaults to a comma (,). For non-visible characters, use Unicode encoding (for example, \u001b). |
Yes | , |
parquetSchema |
Required when reading Parquet files (fileFormat: parquet). Describes the data types in the Parquet file. See parquetSchema format below. |
No (required for Parquet) | None |
lineDelimiter |
The row delimiter. Valid only when fileFormat is text. |
No | None |
compress |
The compression format. Valid values: gzip, bzip2, zip. Leave blank for no compression. |
No | No compression |
encoding |
The encoding format of the source file. | No | utf-8 |
nullFormat |
The string to treat as null. For example: "nullFormat": "null" treats the string null as a null field. "nullFormat": "\u0001" treats the invisible character as null. If not set, source data is written as-is without conversion. |
No | None |
skipHeader |
Skips the header row in a CSV-like file. Not supported for compressed files. | No | false |
csvReaderConfig |
Advanced CSV reading parameters. Uses default values if not configured. | No | None |
Configuring the object path
The object parameter accepts three path formats:
Option 1: Static path
Specify exact file paths. The path starts from the root of the bucket — do not include the bucket name.
-
Single file:
my_folder/my_file.txt -
Multiple files:
folder_a/file1.txt,folder_a/file2.txt(comma-separated)
Option 2: Wildcard path
Use wildcards to match multiple files by pattern:
-
*matches zero or more characters -
?matches exactly one character
Examples:
-
abc*[0-9].txtmatchesabc0.txt,abc10.txt,abc_test_9.txt -
abc?.txtmatchesabc1.txt,abcX.txt
Wildcards, especially *, trigger a full OSS path scan. With many files, this scan can consume significant memory and time, and may cause the task to fail due to memory overflow. In production environments, organize files into separate folders and use more specific prefixes rather than broad wildcards.
Option 3: Dynamic parameter path
Embed scheduling parameters in the path to automate date-based synchronization. When the task runs, parameters are replaced with their actual values.
Example: raw_data/${bizdate}/abc.txt syncs the folder for the corresponding data timestamp each day.
For available scheduling parameters, see Sources and expressions of scheduling parameters.
Concurrency and performance
The path configuration determines how many threads are used:
| Path type | Read behavior |
|---|---|
| Single uncompressed file | Single-threaded |
| Multiple files or wildcard matching multiple files | Multi-threaded concurrent reads |
Configure the number of concurrent threads in the Channel Control section.
All objects in a single sync job are treated as one data table. All objects must use the same schema.
parquetSchema format
message MessageTypeName {
Required/Optional DataType ColumnName;
...;
}
-
MessageTypeName: Any name.
-
Required/Optional:
requiredmeans the field cannot be null;optionalmeans the field can be null. Set all fields tooptionalunless you have specific constraints. -
Data type: Supported types are BOOLEAN, Int32, Int64, Int96, FLOAT, DOUBLE, BINARY (for strings), and fixed_len_byte_array.
-
Each row must end with a semicolon, including the last row.
Example:
"parquetSchema": "message m { optional int32 minute_id; optional int32 dsp_id; optional int32 adx_pid; optional int64 req; optional int64 res; optional int64 suc; optional int64 imp; optional double revenue; }"
Writer script demo: General example
{
"type": "job",
"version": "2.0",
"steps": [
{
"stepType": "stream",
"parameter": {},
"name": "Reader",
"category": "reader"
},
{
"stepType": "oss",
"parameter": {
"nullFormat": "", // String that represents null
"dateFormat": "", // Date format
"datasource": "", // Data source name
"writeMode": "", // Write mode
"writeSingleObject": "false", // Write to a single OSS file
"encoding": "", // Encoding format
"fieldDelimiter": ",", // Column delimiter
"fileFormat": "", // File format
"object": "" // Object prefix
},
"name": "Writer",
"category": "writer"
}
],
"setting": {
"errorLimit": {
"record": "0" // Maximum number of error records
},
"speed": {
"throttle": true, // true = rate limited; false = no rate limit
"concurrent": 1, // Number of concurrent jobs
"mbps": "12" // Rate limit (1 mbps = 1 MB/s)
}
},
"order": {
"hops": [{ "from": "Reader", "to": "Writer" }]
}
}
Writer script demo: Write ORC or Parquet files
Write ORC or Parquet files to OSS by reusing the HDFS Writer. The path and fileFormat extended parameters are used in addition to the standard OSS Writer parameters. For full parameter details, see HDFS Writer.
The following examples are for reference only. Modify the column names and data types to match your actual data before using.
Write an ORC file
Switch to the code editor, set fileFormat to orc, set path to the target path, and configure column in the format {"name": "column_name", "type": "column_type"}.
Supported ORC field types for writing:
| Field type | Supported |
|---|---|
| TINYINT | Yes |
| SMALLINT | Yes |
| INT | Yes |
| BIGINT | Yes |
| FLOAT | Yes |
| DOUBLE | Yes |
| TIMESTAMP | Yes |
| DATE | Yes |
| VARCHAR | Yes |
| STRING | Yes |
| CHAR | Yes |
| BOOLEAN | Yes |
| DECIMAL | Yes |
| BINARY | Yes |
{
"stepType": "oss",
"parameter": {
"datasource": "",
"fileFormat": "orc",
"path": "/tests/case61",
"fileName": "orc",
"writeMode": "append",
"column": [
{ "name": "col1", "type": "BIGINT" },
{ "name": "col2", "type": "DOUBLE" },
{ "name": "col3", "type": "STRING" }
],
"fieldDelimiter": "\t",
"compress": "NONE",
"encoding": "UTF-8"
}
}
Write a Parquet file
{
"stepType": "oss",
"parameter": {
"datasource": "",
"fileFormat": "parquet",
"path": "/tests/case61",
"fileName": "test",
"writeMode": "append",
"fieldDelimiter": "\t",
"compress": "SNAPPY",
"encoding": "UTF-8",
"parquetSchema": "message test { required int64 int64_col;\n required binary str_col (UTF8);\nrequired group params (MAP) {\nrepeated group key_value {\nrequired binary key (UTF8);\nrequired binary value (UTF8);\n}\n}\nrequired group params_arr (LIST) {\nrepeated group list {\nrequired binary element (UTF8);\n}\n}\nrequired group params_struct {\nrequired int64 id;\n required binary name (UTF8);\n }\nrequired group params_arr_complex (LIST) {\nrepeated group list {\nrequired group element {\n required int64 id;\n required binary name (UTF8);\n}\n}\n}\nrequired group params_complex (MAP) {\nrepeated group key_value {\nrequired binary key (UTF8);\nrequired group value {\nrequired int64 id;\n required binary name (UTF8);\n}\n}\n}\nrequired group params_struct_complex {\nrequired int64 id;\n required group detail {\nrequired int64 id;\n required binary name (UTF8);\n}\n}\n}",
"dataxParquetMode": "fields"
}
}
Writer script parameters
| Parameter | Description | Required | Default |
|---|---|---|---|
datasource |
The data source name. Must match the name configured in the code editor. | Yes | None |
object |
The name prefix for files written to OSS. OSS uses the forward slash (/) as a directory separator. Examples: "object": "datax" writes objects starting with datax followed by a random string. "object": "cdo/datax" writes to /cdo/datax with a random string appended. To suppress the random UUID suffix, set writeSingleObject to true. |
Yes | None |
writeMode |
How existing objects are handled before writing. truncate: clears all objects matching the object prefix. append: writes directly and appends a random UUID to the file name (for example, DI_**__**). nonConflict: reports an error if any object with a matching prefix already exists. |
Yes | None |
fileFormat |
The output file format. csv: strict CSV format — column delimiters in data are escaped with double quotation marks. text: splits data by delimiter only, no escaping. parquet: requires the parquetSchema parameter; must use the code editor. ORC: must use the code editor. |
No | text |
writeSingleObject |
Whether to write all data to a single file. true: writes to one file; no empty file is created if there is no data. false: writes to multiple files; creates an empty file (with a header if configured) when there is no data. Note
This parameter does not take effect for ORC or Parquet formats. To write a single ORC or Parquet file, set concurrency to 1 — but note that a random suffix is still added and single concurrency reduces sync speed. In some scenarios (for example, when the source is Hologres), data is read by shard and multiple files may be generated even with single concurrency. |
No | false |
compress |
The compression format for the output file. CSV and TEXT formats do not support compression. Parquet and ORC support SNAPPY only. Must be configured in the code editor. | No | None |
fieldDelimiter |
The column delimiter. | No | , |
encoding |
The file encoding. | No | utf-8 |
parquetSchema |
Required when writing Parquet files (fileFormat: parquet). Describes the output file structure. Supported types: BOOLEAN, INT32, INT64, INT96, FLOAT, DOUBLE, BINARY (for strings), FIXED_LEN_BYTE_ARRAY. Each row must end with a semicolon. If not configured, DataWorks converts data types automatically — see Appendix: Conversion policy for Parquet data types. For a configuration example, see Appendix: Script demos and parameter descriptions. |
No | None |
nullFormat |
The string to write for null values. For example, nullFormat="null" writes null for null fields. |
No | None |
header |
The file header. Example: ["id", "name", "age"]. |
No | None |
ossBlockSize |
The size of each data block in MB. Applies only to Parquet and ORC formats. Configure this parameter at the same level as the object parameter. Because multipart upload supports a maximum of 10,000 blocks, the default block size of 16 MB limits a single file to 160 GB. Increase the block size to support larger files. |
No | 16 |
maxFileSize |
The maximum size of a single output file in MB. Applies only to CSV and TEXT formats. Calculated at the memory level — actual file size may be slightly larger due to data expansion. Each block is 10 MB (the minimum granularity); a maxFileSize value below 10 MB is treated as 10 MB. When the limit is reached, file rollover occurs and the new file name has a suffix appended (_1, _2, etc.) to the original prefix. |
No | 100,000 |
suffix |
The suffix appended to generated file names. For example, ".csv" produces fileName****.csv. |
No | None |
Appendix: Conversion policy for Parquet data types
If parquetSchema is not configured, DataWorks converts source field types to Parquet types according to the following policy.
| Source data type | Parquet type | Parquet logical type |
|---|---|---|
| CHAR / VARCHAR / STRING | BINARY | UTF8 |
| BOOLEAN | BOOLEAN | Not applicable |
| BINARY / VARBINARY | BINARY | Not applicable |
| DECIMAL | FIXED_LEN_BYTE_ARRAY | DECIMAL |
| TINYINT | INT32 | INT_8 |
| SMALLINT | INT32 | INT_16 |
| INT / INTEGER | INT32 | Not applicable |
| BIGINT | INT64 | Not applicable |
| FLOAT | FLOAT | Not applicable |
| DOUBLE | DOUBLE | Not applicable |
| DATE | INT32 | DATE |
| TIME | INT32 | TIME_MILLIS |
| TIMESTAMP / DATETIME | INT96 | Not applicable |