All Products
Search
Document Center

DataWorks:OSS

Last Updated:Mar 27, 2026

The OSS data source connects DataWorks to Object Storage Service (OSS) for both reading and writing data. This topic covers supported file formats, limits, and script parameters for the OSS data source.

Supported formats and limits

Offline read

OSS Reader reads objects from OSS and converts them to the DataWorks data integration protocol. OSS is an unstructured data store — it has no native schema, so all field structure must be confirmed in the task configuration.

Supported Not supported
TXT files (schema must be a two-dimensional table) Multi-threaded concurrent reads for a single object
CSV-like files with custom delimiters Multi-threaded concurrent reads for a single compressed object
ORC and Parquet formats
Multiple data types (read as strings), column pruning, and column constants
Recursive reads and file name filtering
Concurrent reads for multiple objects

Compression support for text files (TXT and CSV)

Format Supported
gzip Yes
bzip2 Yes
zip Yes
Important

A compressed package cannot contain multiple files.

Usage notes for CSV files

  • CSV files must be in standard CSV format. If a column contains a double quotation mark ("), escape it as two double quotation marks (""). Otherwise, the file splits incorrectly.

  • If a file uses multiple delimiters, use the TXT file type instead.

  • OSS is an unstructured data source. Confirm the field structure before syncing data. If the structure of data in the source changes, reconfirm the field structure in the task configuration to prevent garbled data during synchronization.

Offline write

OSS Writer converts data from the DataWorks data synchronization protocol into text files written to OSS.

Supported Not supported
Text files only — not BLOBs such as videos and images (schema must be a two-dimensional table) Concurrent writes to a single file
CSV-like files with custom delimiters Writing to Cold Archive storage class buckets
ORC and Parquet formats (Snappy compression available in the code editor) Single objects exceeding 100 GB
Multi-threaded writes (each thread writes to a different sub-file)
File rollover (switches to a new file when the current file exceeds a specified size)
Note

OSS does not provide data types. OSS Writer writes all data as the STRING type to OSS objects.

Data integration column types for offline write

Type classification Column configuration type
Integer types LONG
String types STRING
Floating-point types DOUBLE
Boolean types BOOLEAN
Date and time types DATE

Real-time write

  • Supports real-time writes.

  • Supports real-time writes from a single table to data lakes, including Hudi (0.12.x), Paimon, and Iceberg.

Create a data source

Add the OSS data source to DataWorks before developing a synchronization task. For instructions, see Data source management.

Cross-account, RAM role, and cross-region configurations

Develop a data synchronization task

Single-table offline sync

For all parameters and script demos, see Appendix: Script demos and parameter descriptions.

Single-table real-time sync

Whole-database synchronization

FAQ

Is there a limit on the number of OSS files that can be read?

How do I handle dirty data when reading a CSV file with multiple delimiters?

Appendix: Script demos and parameter descriptions

Batch synchronization via the code editor

The following script examples and parameter tables apply to batch synchronization tasks configured in the code editor. For the general code editor procedure, see Configure a task in the code editor.

Reader script demo: General example

{
    "type": "job",
    "version": "2.0",
    "steps": [
        {
            "stepType": "oss",
            "parameter": {
                "nullFormat": "",          // String that represents null
                "compress": "",            // Text compression type
                "datasource": "",          // Data source name
                "column": [               // Fields to read
                    {
                        "index": 0,
                        "type": "string"
                    },
                    {
                        "index": 1,
                        "type": "long"
                    },
                    {
                        "index": 2,
                        "type": "double"
                    },
                    {
                        "index": 3,
                        "type": "boolean"
                    },
                    {
                        "format": "yyyy-MM-dd HH:mm:ss",
                        "index": 4,
                        "type": "date"
                    }
                ],
                "skipHeader": "",          // Skip header row in CSV-like files
                "encoding": "",            // Encoding format
                "fieldDelimiter": ",",     // Column delimiter
                "fileFormat": "",          // Text file format
                "object": []              // Object prefix or path
            },
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "stream",
            "parameter": {},
            "name": "Writer",
            "category": "writer"
        }
    ],
    "setting": {
        "errorLimit": {
            "record": ""               // Maximum number of error records
        },
        "speed": {
            "throttle": true,          // true = rate limited; false = no rate limit
            "concurrent": 1,           // Number of concurrent jobs
            "mbps": "12"               // Rate limit (1 mbps = 1 MB/s)
        }
    },
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    }
}

Reader script demo: Read ORC or Parquet files

Read ORC or Parquet files from OSS by reusing the HDFS Reader. In addition to the standard OSS Reader parameters, use the extended path (for ORC) and fileFormat (for both ORC and Parquet) parameters.

Read an ORC file

{
    "stepType": "oss",
    "parameter": {
        "datasource": "",
        "fileFormat": "orc",
        "path": "/tests/case61/orc__691b6815_9260_4037_9899_****",
        "column": [
            { "index": 0, "type": "long" },
            { "index": "1", "type": "string" },
            { "index": "2", "type": "string" }
        ]
    }
}

Read a Parquet file

{
    "type": "job",
    "version": "2.0",
    "steps": [
        {
            "stepType": "oss",
            "parameter": {
                "nullFormat": "",
                "compress": "",
                "fileFormat": "parquet",
                "path": "/*",
                "parquetSchema": "message m { optional BINARY registration_dttm (UTF8); optional Int64 id; optional BINARY first_name (UTF8); optional BINARY last_name (UTF8); optional BINARY email (UTF8); optional BINARY gender (UTF8); optional BINARY ip_address (UTF8); optional BINARY cc (UTF8); optional BINARY country (UTF8); optional BINARY birthdate (UTF8); optional DOUBLE salary; optional BINARY title (UTF8); optional BINARY comments (UTF8); }",
                "column": [
                    { "index": "0", "type": "string" },
                    { "index": "1", "type": "long" },
                    { "index": "2", "type": "string" },
                    { "index": "3", "type": "string" },
                    { "index": "4", "type": "string" },
                    { "index": "5", "type": "string" },
                    { "index": "6", "type": "string" },
                    { "index": "7", "type": "string" },
                    { "index": "8", "type": "string" },
                    { "index": "9", "type": "string" },
                    { "index": "10", "type": "double" },
                    { "index": "11", "type": "string" },
                    { "index": "12", "type": "string" }
                ],
                "skipHeader": "false",
                "encoding": "UTF-8",
                "fieldDelimiter": ",",
                "fieldDelimiterOrigin": ",",
                "datasource": "wpw_demotest_oss",
                "envType": 0,
                "object": ["wpw_demo/userdata1.parquet"]
            },
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "odps",
            "parameter": {
                "partition": "dt=${bizdate}",
                "truncate": true,
                "datasource": "0_odps_wpw_demotest",
                "envType": 0,
                "column": ["id"],
                "emptyAsNull": false,
                "table": "wpw_0827"
            },
            "name": "Writer",
            "category": "writer"
        }
    ],
    "setting": {
        "errorLimit": { "record": "" },
        "locale": "zh_CN",
        "speed": {
            "throttle": false,
            "concurrent": 2
        }
    },
    "order": {
        "hops": [{ "from": "Reader", "to": "Writer" }]
    }
}

Reader script parameters

Parameter Description Required Default
datasource The data source name. Must match the name configured in the code editor. Yes None
object The path to the objects to read. See Configuring the object path below. Yes None
column The list of fields to read. type specifies the data type. index specifies the column number (starting from 0). value generates a constant field — the value is not read from the source. To read all data as strings, set "column": ["*"]. For type, index or value is required. Yes All data read as STRING
fileFormat The file format of the source object. Valid values: csv, text. Both support custom delimiters. Yes csv
fieldDelimiter The column delimiter. Defaults to a comma (,). For non-visible characters, use Unicode encoding (for example, \u001b). Yes ,
parquetSchema Required when reading Parquet files (fileFormat: parquet). Describes the data types in the Parquet file. See parquetSchema format below. No (required for Parquet) None
lineDelimiter The row delimiter. Valid only when fileFormat is text. No None
compress The compression format. Valid values: gzip, bzip2, zip. Leave blank for no compression. No No compression
encoding The encoding format of the source file. No utf-8
nullFormat The string to treat as null. For example: "nullFormat": "null" treats the string null as a null field. "nullFormat": "\u0001" treats the invisible character as null. If not set, source data is written as-is without conversion. No None
skipHeader Skips the header row in a CSV-like file. Not supported for compressed files. No false
csvReaderConfig Advanced CSV reading parameters. Uses default values if not configured. No None

Configuring the object path

The object parameter accepts three path formats:

Option 1: Static path

Specify exact file paths. The path starts from the root of the bucket — do not include the bucket name.

  • Single file: my_folder/my_file.txt

  • Multiple files: folder_a/file1.txt, folder_a/file2.txt (comma-separated)

Option 2: Wildcard path

Use wildcards to match multiple files by pattern:

  • * matches zero or more characters

  • ? matches exactly one character

Examples:

  • abc*[0-9].txt matches abc0.txt, abc10.txt, abc_test_9.txt

  • abc?.txt matches abc1.txt, abcX.txt

Important

Wildcards, especially *, trigger a full OSS path scan. With many files, this scan can consume significant memory and time, and may cause the task to fail due to memory overflow. In production environments, organize files into separate folders and use more specific prefixes rather than broad wildcards.

Option 3: Dynamic parameter path

Embed scheduling parameters in the path to automate date-based synchronization. When the task runs, parameters are replaced with their actual values.

Example: raw_data/${bizdate}/abc.txt syncs the folder for the corresponding data timestamp each day.

For available scheduling parameters, see Sources and expressions of scheduling parameters.

Concurrency and performance

The path configuration determines how many threads are used:

Path type Read behavior
Single uncompressed file Single-threaded
Multiple files or wildcard matching multiple files Multi-threaded concurrent reads

Configure the number of concurrent threads in the Channel Control section.

Important

All objects in a single sync job are treated as one data table. All objects must use the same schema.

parquetSchema format

message MessageTypeName {
    Required/Optional DataType ColumnName;
    ...;
}
  • MessageTypeName: Any name.

  • Required/Optional: required means the field cannot be null; optional means the field can be null. Set all fields to optional unless you have specific constraints.

  • Data type: Supported types are BOOLEAN, Int32, Int64, Int96, FLOAT, DOUBLE, BINARY (for strings), and fixed_len_byte_array.

  • Each row must end with a semicolon, including the last row.

Example:

"parquetSchema": "message m { optional int32 minute_id; optional int32 dsp_id; optional int32 adx_pid; optional int64 req; optional int64 res; optional int64 suc; optional int64 imp; optional double revenue; }"

Writer script demo: General example

{
    "type": "job",
    "version": "2.0",
    "steps": [
        {
            "stepType": "stream",
            "parameter": {},
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "oss",
            "parameter": {
                "nullFormat": "",           // String that represents null
                "dateFormat": "",           // Date format
                "datasource": "",           // Data source name
                "writeMode": "",            // Write mode
                "writeSingleObject": "false", // Write to a single OSS file
                "encoding": "",             // Encoding format
                "fieldDelimiter": ",",      // Column delimiter
                "fileFormat": "",           // File format
                "object": ""               // Object prefix
            },
            "name": "Writer",
            "category": "writer"
        }
    ],
    "setting": {
        "errorLimit": {
            "record": "0"              // Maximum number of error records
        },
        "speed": {
            "throttle": true,          // true = rate limited; false = no rate limit
            "concurrent": 1,           // Number of concurrent jobs
            "mbps": "12"               // Rate limit (1 mbps = 1 MB/s)
        }
    },
    "order": {
        "hops": [{ "from": "Reader", "to": "Writer" }]
    }
}

Writer script demo: Write ORC or Parquet files

Write ORC or Parquet files to OSS by reusing the HDFS Writer. The path and fileFormat extended parameters are used in addition to the standard OSS Writer parameters. For full parameter details, see HDFS Writer.

Important

The following examples are for reference only. Modify the column names and data types to match your actual data before using.

Write an ORC file

Switch to the code editor, set fileFormat to orc, set path to the target path, and configure column in the format {"name": "column_name", "type": "column_type"}.

Supported ORC field types for writing:

Field type Supported
TINYINT Yes
SMALLINT Yes
INT Yes
BIGINT Yes
FLOAT Yes
DOUBLE Yes
TIMESTAMP Yes
DATE Yes
VARCHAR Yes
STRING Yes
CHAR Yes
BOOLEAN Yes
DECIMAL Yes
BINARY Yes
{
    "stepType": "oss",
    "parameter": {
        "datasource": "",
        "fileFormat": "orc",
        "path": "/tests/case61",
        "fileName": "orc",
        "writeMode": "append",
        "column": [
            { "name": "col1", "type": "BIGINT" },
            { "name": "col2", "type": "DOUBLE" },
            { "name": "col3", "type": "STRING" }
        ],
        "fieldDelimiter": "\t",
        "compress": "NONE",
        "encoding": "UTF-8"
    }
}

Write a Parquet file

{
    "stepType": "oss",
    "parameter": {
        "datasource": "",
        "fileFormat": "parquet",
        "path": "/tests/case61",
        "fileName": "test",
        "writeMode": "append",
        "fieldDelimiter": "\t",
        "compress": "SNAPPY",
        "encoding": "UTF-8",
        "parquetSchema": "message test { required int64 int64_col;\n required binary str_col (UTF8);\nrequired group params (MAP) {\nrepeated group key_value {\nrequired binary key (UTF8);\nrequired binary value (UTF8);\n}\n}\nrequired group params_arr (LIST) {\nrepeated group list {\nrequired binary element (UTF8);\n}\n}\nrequired group params_struct {\nrequired int64 id;\n required binary name (UTF8);\n }\nrequired group params_arr_complex (LIST) {\nrepeated group list {\nrequired group element {\n required int64 id;\n required binary name (UTF8);\n}\n}\n}\nrequired group params_complex (MAP) {\nrepeated group key_value {\nrequired binary key (UTF8);\nrequired group value {\nrequired int64 id;\n required binary name (UTF8);\n}\n}\n}\nrequired group params_struct_complex {\nrequired int64 id;\n required group detail {\nrequired int64 id;\n required binary name (UTF8);\n}\n}\n}",
        "dataxParquetMode": "fields"
    }
}

Writer script parameters

Parameter Description Required Default
datasource The data source name. Must match the name configured in the code editor. Yes None
object The name prefix for files written to OSS. OSS uses the forward slash (/) as a directory separator. Examples: "object": "datax" writes objects starting with datax followed by a random string. "object": "cdo/datax" writes to /cdo/datax with a random string appended. To suppress the random UUID suffix, set writeSingleObject to true. Yes None
writeMode How existing objects are handled before writing. truncate: clears all objects matching the object prefix. append: writes directly and appends a random UUID to the file name (for example, DI_**__**). nonConflict: reports an error if any object with a matching prefix already exists. Yes None
fileFormat The output file format. csv: strict CSV format — column delimiters in data are escaped with double quotation marks. text: splits data by delimiter only, no escaping. parquet: requires the parquetSchema parameter; must use the code editor. ORC: must use the code editor. No text
writeSingleObject Whether to write all data to a single file. true: writes to one file; no empty file is created if there is no data. false: writes to multiple files; creates an empty file (with a header if configured) when there is no data.
Note

This parameter does not take effect for ORC or Parquet formats. To write a single ORC or Parquet file, set concurrency to 1 — but note that a random suffix is still added and single concurrency reduces sync speed. In some scenarios (for example, when the source is Hologres), data is read by shard and multiple files may be generated even with single concurrency.

No false
compress The compression format for the output file. CSV and TEXT formats do not support compression. Parquet and ORC support SNAPPY only. Must be configured in the code editor. No None
fieldDelimiter The column delimiter. No ,
encoding The file encoding. No utf-8
parquetSchema Required when writing Parquet files (fileFormat: parquet). Describes the output file structure. Supported types: BOOLEAN, INT32, INT64, INT96, FLOAT, DOUBLE, BINARY (for strings), FIXED_LEN_BYTE_ARRAY. Each row must end with a semicolon. If not configured, DataWorks converts data types automatically — see Appendix: Conversion policy for Parquet data types. For a configuration example, see Appendix: Script demos and parameter descriptions. No None
nullFormat The string to write for null values. For example, nullFormat="null" writes null for null fields. No None
header The file header. Example: ["id", "name", "age"]. No None
ossBlockSize The size of each data block in MB. Applies only to Parquet and ORC formats. Configure this parameter at the same level as the object parameter. Because multipart upload supports a maximum of 10,000 blocks, the default block size of 16 MB limits a single file to 160 GB. Increase the block size to support larger files. No 16
maxFileSize The maximum size of a single output file in MB. Applies only to CSV and TEXT formats. Calculated at the memory level — actual file size may be slightly larger due to data expansion. Each block is 10 MB (the minimum granularity); a maxFileSize value below 10 MB is treated as 10 MB. When the limit is reached, file rollover occurs and the new file name has a suffix appended (_1, _2, etc.) to the original prefix. No 100,000
suffix The suffix appended to generated file names. For example, ".csv" produces fileName****.csv. No None

Appendix: Conversion policy for Parquet data types

If parquetSchema is not configured, DataWorks converts source field types to Parquet types according to the following policy.

Source data type Parquet type Parquet logical type
CHAR / VARCHAR / STRING BINARY UTF8
BOOLEAN BOOLEAN Not applicable
BINARY / VARBINARY BINARY Not applicable
DECIMAL FIXED_LEN_BYTE_ARRAY DECIMAL
TINYINT INT32 INT_8
SMALLINT INT32 INT_16
INT / INTEGER INT32 Not applicable
BIGINT INT64 Not applicable
FLOAT FLOAT Not applicable
DOUBLE DOUBLE Not applicable
DATE INT32 DATE
TIME INT32 TIME_MILLIS
TIMESTAMP / DATETIME INT96 Not applicable