All Products
Search
Document Center

DataWorks:OSS data source

Last Updated:Dec 27, 2025

The OSS data source provides a bidirectional channel to read data from and write data to OSS. This topic describes the data synchronization capabilities of the OSS data source in DataWorks.

Supported field types and limits

Offline read

OSS Reader reads data from OSS and converts the data to conform to the Data Integration protocol. OSS is a service for storing unstructured data. OSS Reader supports the following features for data integration.

Support

Unsupported

  • Supports reading TXT files. The schema in the TXT file must be a two-dimensional table.

  • Supports CSV-like files with custom delimiters.

  • Supports ORC and PARQUET formats.

  • Supports reading multiple data types, which are represented as strings. Supports column pruning and column constants.

  • Supports recursive reads and file name filtering.

  • Supports text compression. The available compression formats are gzip, bzip2, and zip.

    Note

    A compressed package cannot contain multiple compressed files.

  • Supports concurrent reads for multiple objects.

  • Does not support multi-threaded concurrent reads for a single object or file.

  • A single compressed object technically cannot be read by multiple threads concurrently.

Important
  • When you prepare data in OSS, if the data is in a CSV file, the file must be in standard CSV format. For example, if a column contains a double quotation mark ("), you must replace it with two double quotation marks (""). Otherwise, the file may be split incorrectly. If the file contains multiple delimiters, use the text type.

  • OSS is an unstructured data source that stores file-type data. Therefore, before you run a sync task, confirm that the field structure meets your expectations. Similarly, if the data structure in the unstructured data source changes, you must update the field structure in the task configuration. Otherwise, the synchronized data may be corrupted.

Offline write

OSS Writer converts data based on the data synchronization protocol and writes the data to text files in OSS. OSS is a service for storing unstructured data. OSS Writer currently supports the following features.

Support

Unsupported

  • Supports writing only text files. BLOB files such as videos and images are not supported. The schema in the text file must be a two-dimensional table.

  • Supports CSV-like files with custom delimiters.

  • Supports ORC and PARQUET formats.

  • Supports multi-threaded writes. Each thread writes to a different sub-file.

  • Supports file scrolling. When a file size exceeds a specific value, the system switches to a new file.

  • Does not support concurrent writes to a single file.

  • OSS itself does not provide data types. OSS Writer writes all data to OSS objects as the STRING type.

  • Writing to OSS buckets with the Cold Archive storage class is not supported.

  • A single object or file cannot exceed 100 GB.

Type classification

Data Integration column configuration type

Integer

LONG

String

STRING

Floating-point

DOUBLE

Boolean

BOOLEAN

Date and time

DATE

Real-time write

  • Supports real-time writes.

  • Supports real-time writes from a single table to data lakes such as Hudi (0.12.x), Paimon, and Iceberg.

Create a data source

Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data Source Management. You can view parameter descriptions in the DataWorks console to understand the meanings of the parameters when you add a data source.

Note

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.

Configuration guide for single-table offline sync tasks

Configuration guide for single-table real-time sync tasks

For more information about the procedure, see Configure a real-time sync task in Data Integration and Configure a real-time sync task in DataStudio.

Configuration guide for full-database synchronization

For more information about the procedure, see Full-database offline sync task and Full-database real-time sync task.

FAQ

Is there a limit on the number of files that can be read from OSS?

How do I handle dirty data when reading a CSV file with multiple delimiters?

Appendix: Script demos and parameter descriptions

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a task in the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

Reader script demo: General example

{
    "type":"job",
    "version":"2.0",// The version number.
    "steps":[
        {
            "stepType":"oss",// The plugin name.
            "parameter":{
                "nullFormat":"",// Defines the string that can be interpreted as null.
                "compress":"",// The text compression type.
                "datasource":"",// The data source.
                "column":[// The fields.
                    {
                        "index":0,// The column index.
                        "type":"string"// The data type.
                    },
                    {
                        "index":1,
                        "type":"long"
                    },
                    {
                        "index":2,
                        "type":"double"
                    },
                    {
                        "index":3,
                        "type":"boolean"
                    },
                    {
                        "format":"yyyy-MM-dd HH:mm:ss", // The time format.
                        "index":4,
                        "type":"date"
                    }
                ],
                "skipHeader":"",// If a CSV-like file has a header row, skip it.
                "encoding":"",// The encoding format.
                "fieldDelimiter":",",// The column delimiter.
                "fileFormat": "",// The text type.
                "object":[]// The object prefix.
            },
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":""// The number of error records.
        },
        "speed":{
            "throttle":true,// If throttle is set to false, the mbps parameter does not take effect, which means that the data rate is not limited. If throttle is set to true, the data rate is limited.
            "concurrent":1, // The number of concurrent jobs.
            "mbps":"12"// The maximum data rate. 1 mbps = 1 MB/s.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

Reader script demo: Read ORC or Parquet files from OSS

You can read files in ORC or Parquet format from OSS using the HDFS Reader. In addition to the existing parameters of the OSS Reader, this method adds extended configuration parameters, such as Path (for ORC) and FileFormat (for ORC and Parquet).

  • The following example shows how to read an ORC file from OSS.

    {
    "stepType": "oss",
    "parameter": {
    "datasource": "",
    "fileFormat": "orc",
    "path": "/tests/case61/orc__691b6815_9260_4037_9899_****",
    "column": [
    {
    "index": 0,
    "type": "long"
    },
    {
    "index": "1",
    "type": "string"
    },
    {
    "index": "2",
    "type": "string"
    }
    ]
    }
    }
  • The following example shows how to read a Parquet file from OSS.

    {
      "type":"job",
        "version":"2.0",
        "steps":[
        {
          "stepType":"oss",
          "parameter":{
            "nullFormat":"",
            "compress":"",
            "fileFormat":"parquet",
            "path":"/*",
            "parquetSchema":"message m { optional BINARY registration_dttm (UTF8); optional Int64 id; optional BINARY first_name (UTF8); optional BINARY last_name (UTF8); optional BINARY email (UTF8); optional BINARY gender (UTF8); optional BINARY ip_address (UTF8); optional BINARY cc (UTF8); optional BINARY country (UTF8); optional BINARY birthdate (UTF8); optional DOUBLE salary; optional BINARY title (UTF8); optional BINARY comments (UTF8); }",
            "column":[
              {
                "index":"0",
                "type":"string"
              },
              {
                "index":"1",
                "type":"long"
              },
              {
                "index":"2",
                "type":"string"
              },
              {
                "index":"3",
                "type":"string"
              },
              {
                "index":"4",
                "type":"string"
              },
              {
                "index":"5",
                "type":"string"
              },
              {
                "index":"6",
                "type":"string"
              },
              {
                "index":"7",
                "type":"string"
              },
              {
                "index":"8",
                "type":"string"
              },
              {
                "index":"9",
                "type":"string"
              },
              {
                "index":"10",
                "type":"double"
              },
              {
                "index":"11",
                "type":"string"
              },
              {
                "index":"12",
                "type":"string"
              }
            ],
            "skipHeader":"false",
            "encoding":"UTF-8",
            "fieldDelimiter":",",
            "fieldDelimiterOrigin":",",
            "datasource":"wpw_demotest_oss",
            "envType":0,
            "object":[
              "wpw_demo/userdata1.parquet"
            ]
          },
          "name":"Reader",
          "category":"reader"
        },
        {
          "stepType":"odps",
          "parameter":{
            "partition":"dt=${bizdate}",
            "truncate":true,
            "datasource":"0_odps_wpw_demotest",
            "envType":0,
            "column":[
              "id"
            ],
            "emptyAsNull":false,
            "table":"wpw_0827"
          },
          "name":"Writer",
          "category":"writer"
        }
      ],
        "setting":{
        "errorLimit":{
          "record":""
        },
        "locale":"zh_CN",
          "speed":{
          "throttle":false,
            "concurrent":2
        }
      },
      "order":{
        "hops":[
          {
            "from":"Reader",
            "to":"Writer"
          }
        ]
      }
    }

Reader script parameters

Parameter

Description

Required

Default value

datasource

The name of the data source. In the code editor, you can add a data source. The value of this parameter must be the same as the name of the added data source.

Yes

None

Object

Specifies one or more objects to synchronize from OSS. You can configure this parameter in three ways: explicit path, wildcard path, and dynamic parameter path.

1. Configuration methods

  • Explicit path

    • Basic rule: The path starts from the root directory of the bucket and does not need to include the bucket name.

    • Specify a single file: Enter the full path of the file. Example: my_folder/my_file.txt.

    • Specify multiple objects: Use a comma (,) to separate the paths of multiple files or folders. Example: folder_a/file1.txt, folder_a/file2.txt.

  • Wildcard path

    • Use wildcards to match multiple files that follow a specific pattern.

    • *: Matches zero or more of any character.

    • ?: Matches one of any character.

    • Examples:

      • abc*[0-9].txt matches abc0.txt, abc10.txt, abc_test_9.txt, and so on.

      • abc?.txt matches abc1.txt, abcX.txt, and so on.

  • Dynamic parameter path

    • Embed scheduling parameters in the path to automate synchronization. When the task runs, the parameters are replaced with their actual values.

    • Example: If you set the path to raw_data/${bizdate}/abc.txt, the task can dynamically synchronize the folder for the corresponding data timestamp each day.

    • For more information about how to use scheduling parameters, see Sources and expressions of scheduling parameters.

Important
  • Use wildcards with caution. Using a wildcard, especially *, triggers a traversal scan of the OSS path. If there is many files, this can consume a lot of memory and time, and may even cause the task to fail due to memory overflow. We do not recommend using broad wildcards in a production environment. If you encounter this issue, split the files into different directories and try to synchronize them again.

  • The data synchronization system treats all objects synchronized under a single job as one data table. You must ensure that all objects can adapt to the same schema information.

2. Concurrent read mechanism and performance

The configuration method directly affects the concurrent performance of data extraction:

  • Single-threaded mode: When you specify only a single, non-compressed file, the task extracts data in single-threaded mode.

  • Multi-threaded mode: When you specify multiple specific files or use a wildcard character to match multiple files, the task automatically enables multi-threaded concurrent reading to significantly improve extraction efficiency. You can configure the concurrency in Channel Control.

Yes

None

parquetSchema

This parameter is configured when reading a Parquet file from OSS. It takes effect only when fileFormat is set to parquet. It specifies the type description of the Parquet storage. After you fill in the parquetSchema, ensure that the overall configuration conforms to JSON syntax.

message MessageTypeName {
Required/Optional, Data Type, Column Name;
......................;
}

The format of the parquetSchema configuration is as follows:

  • MessageType name: Enter a name.

  • Required/Optional: `required` means not null, `optional` means nullable. We recommend that you set all to `optional`.

  • Data type: Parquet files support BOOLEAN, Int32, Int64, Int96, FLOAT, DOUBLE, BINARY (use BINARY for string types), and fixed_len_byte_array types.

  • Each row setting must end with a semicolon. The last row must also have a semicolon.

The following is a configuration example.

"parquetSchema": "message m { optional int32 minute_id; optional int32 dsp_id; optional int32 adx_pid; optional int64 req; optional int64 res; optional int64 suc; optional int64 imp; optional double revenue; }"

No

None

column

The list of fields to read. `type` specifies the type of the source data. `index` specifies which column of the text the current column comes from (starting from 0). `value` specifies that the current type is a constant. The data is not read from the source file, but the corresponding column is automatically generated based on the `value`.

By default, you can read all data as the String type with the following configuration.

"column": ["*"]

You can specify the column field information with the following configuration.

"column":
    {
       "type": "long",
       "index": 0    // Get the int field from the first column of the OSS text.
    },
    {
       "type": "string",
       "value": "alibaba"  // Generate a string field "alibaba" from within OSS Reader as the current field.
    }
Note

For the column information you specify, `type` is required, and you must choose either `index` or `value`.

Yes

All data is read as the STRING type.

fileFormat

The text type. The file type of the source OSS file. For example, csv or text. Both formats support custom delimiters.

Yes

csv

fieldDelimiter

The column delimiter for reading.

Note

When OSS Reader reads data, you need to specify a column delimiter. If you do not specify one, the default is a comma (,). The UI configuration also defaults to a comma (,).

If the delimiter is not visible, enter the Unicode encoding. For example, \u001b or \u007c.

Yes

,

lineDelimiter

The row delimiter for reading.

Note

This parameter is valid only when fileFormat is set to text.

No

None

compress

The text compression type. By default, this is not filled in, which means no compression. Supported compression types are gzip, bzip2, and zip.

No

No compression

encoding

The encoding configuration for reading files.

No

utf-8

nullFormat

In text files, null (a null pointer) cannot be defined with a standard string. Data synchronization provides nullFormat to define which strings can represent null. For example:

  • If you configure nullFormat:"null", which is a visible character, and the source data is "null", data synchronization treats it as a null field.

  • If you configure nullFormat:"\u0001", which is an invisible character, and the source data is the string "\u0001", data synchronization treats it as a null field.

  • If you do not include the "nullFormat" parameter, which means it is not configured, the source data is written to the destination as is, without any conversion.

No

None

skipHeader

CSV-like files may have a header row that needs to be skipped. By default, it is not skipped. skipHeader is not supported in compressed file mode.

No

false

csvReaderConfig

Parameter configuration for reading CSV files. This is a Map type. CSV files are read using CsvReader, which has many configurations. If not configured, default values are used.

No

None

Writer script demo: General example

{
    "type":"job",
    "version":"2.0",
    "steps":[
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"oss",// The plugin name.
            "parameter":{
                "nullFormat":"",// The data synchronization system provides nullFormat to define which strings can represent null.
                "dateFormat":"",// The date format.
                "datasource":"",// The data source.
                "writeMode":"",// The write mode.
                "writeSingleObject":"false", // Specifies whether to write synchronized data to a single OSS file.
                "encoding":"",// The encoding format.
                "fieldDelimiter":",",// The column delimiter.
                "fileFormat":"",// The text type.
                "object":""// The object prefix.
            },
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The number of error records.
        },
        "speed":{
            "throttle":true,// If throttle is set to false, the mbps parameter does not take effect, which means that the data rate is not limited. If throttle is set to true, the data rate is limited.
            "concurrent":1, // The number of concurrent jobs.
            "mbps":"12"// The maximum data rate. 1 mbps = 1 MB/s.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

Writer script demo: Write ORC or Parquet files to OSS

You can write ORC or Parquet files to OSS by reusing the HDFS Writer. In addition to the existing OSS Writer parameters, extended configuration parameters such as Path and FileFormat are added. For more information about these parameters, see HDFS Writer.

The following are examples of writing ORC or Parquet files to OSS:

Important

The following are examples only. Modify the parameters based on your specific column names and types. Do not copy and use them directly.

  • Write to OSS in ORC file format

    To write an ORC file, you must use the code editor. In the code editor, set the fileFormat parameter to orc and the path parameter to the destination file path. The format for the column parameter is {"name":"your column name","type": "your column type"}.

    The following ORC types are currently supported for writing:

    Field type

    Offline write to OSS (ORC format)

    TINYINT

    Support

    SMALLINT

    Supported

    INT

    Supported

    BIGINT

    Support

    FLOAT

    Supported

    DOUBLE

    Supported

    TIMESTAMP

    Supported

    DATE

    Supported

    VARCHAR

    Help and support

    STRING

    Support

    CHAR

    Supported

    BOOLEAN

    Support

    DECIMAL

    Support

    BINARY

    Supported

    {
    "stepType": "oss",
    "parameter": {
    "datasource": "",
    "fileFormat": "orc",
    "path": "/tests/case61",
    "fileName": "orc",
    "writeMode": "append",
    "column": [
    {
    "name": "col1",
    "type": "BIGINT"
    },
    {
    "name": "col2",
    "type": "DOUBLE"
    },
    {
    "name": "col3",
    "type": "STRING"
    }
    ],
    "fieldDelimiter": "\t",
    "compress": "NONE",
    "encoding": "UTF-8"
    }
    }
  • Write to OSS in Parquet file format

    {
    "stepType": "oss",
    "parameter": {
    "datasource": "",
    "fileFormat": "parquet",
    "path": "/tests/case61",
    "fileName": "test",
    "writeMode": "append",
    "fieldDelimiter": "\t",
    "compress": "SNAPPY",
    "encoding": "UTF-8",
    "parquetSchema": "message test { required int64 int64_col;\n required binary str_col (UTF8);\nrequired group params (MAP) {\nrepeated group key_value {\nrequired binary key (UTF8);\nrequired binary value (UTF8);\n}\n}\nrequired group params_arr (LIST) {\nrepeated group list {\nrequired binary element (UTF8);\n}\n}\nrequired group params_struct {\nrequired int64 id;\n required binary name (UTF8);\n }\nrequired group params_arr_complex (LIST) {\nrepeated group list {\nrequired group element {\n required int64 id;\n required binary name (UTF8);\n}\n}\n}\nrequired group params_complex (MAP) {\nrepeated group key_value {\nrequired binary key (UTF8);\nrequired group value {\nrequired int64 id;\n required binary name (UTF8);\n}\n}\n}\nrequired group params_struct_complex {\nrequired int64 id;\n required group detail {\nrequired int64 id;\n required binary name (UTF8);\n}\n}\n}",
    "dataxParquetMode": "fields"
    }
    }

Writer script parameters

Parameter

Description

Required

Default value

datasource

The name of the data source. In the code editor, you can add a data source. The value of this parameter must be the same as the name of the added data source.

Yes

None

object

The name of the file written by OSS Writer. OSS uses file names to simulate directories. OSS has the following restrictions on object names:

  • If you use "object": "datax", the written object starts with "datax" and a random string is added as a suffix.

  • If you use "object": "cdo/datax", the written object starts with /cdo/datax and a random string is added as a suffix. The delimiter for simulating directories in OSS is a forward slash (/).

If you do not need a random UUID suffix, we recommend that you configure "writeSingleObject" : "true". For more information, see the description of writeSingleObject.

Yes

None

ossBlockSize

The OSS block size. The default block size is 16 MB. When the file is written in parquet or ORC format, you can add this parameter at the same level as the object parameter.

Because OSS multipart upload supports a maximum of 10,000 blocks, the default single file size is limited to 160 GB. If the number of blocks exceeds the limit, you can increase the block size to support larger file uploads.

No

16

writeMode

The data processing method before OSS Writer writes data:

  • truncate: Clears all objects that match the object name prefix before writing. For example, if "object":"abc", all objects starting with "abc" will be cleared.

  • append: No processing is done before writing. Data Integration OSS Writer directly writes using the object name and adds a random UUID suffix to ensure that file names do not conflict. For example, if the object name you specify is "DI", the actual written name will be DI_****_****_****.

  • nonConflict: If an object with a matching prefix is found at the specified path, an error is reported directly. For example, if "object":"abc" and an object named "abc123" exists, an error will be reported.

Yes

None

writeSingleObject

Specifies whether to write data to a single file when writing to OSS:

  • true: Writes to a single file. When no data can be read, no empty file is generated.

  • false: Writes to multiple files. When no data can be read, if a file header is configured, an empty file containing only the file header is output. Otherwise, only an empty file is output.

Note
  • When writing ORC or Parquet data, the writeSingleObject parameter does not take effect. This means you cannot use this parameter to write to a single ORC or Parquet file in a multi-concurrency scenario. To write to a single file, you can set the concurrency to 1. However, a random suffix will be added to the file name, and setting the concurrency to 1 will affect the speed of the sync task.

  • In some scenarios, such as when the source is Hologres, data will be read by shard. Even with a single concurrency, multiple files may still be generated.

No

false

fileFormat

The format of the output file. The following formats are supported:

  • csv: Only strict csv format is supported. If the data to be written includes a column delimiter, it will be escaped according to the csv escape syntax. The escape character is a double quotation mark (").

  • text: Simply splits the data to be written using the column delimiter. No escaping is performed if the data to be written includes the column delimiter.

  • parquet: If you use this file type, you must add the parquetSchema parameter to define the data type.

    Important
  • ORC: If you use this format, you need to switch to the code editor.

No

text

compress

The compression format of the data file written to OSS. This must be configured in a script task.

Important

CSV and TEXT file types do not support compression. Parquet/ORC files only support SNAPPY compression.

No

None

fieldDelimiter

The column delimiter for writing.

No

,

encoding

The encoding configuration of the output file.

No

utf-8

parquetSchema

This is a required parameter for writing to OSS in Parquet file format. It describes the structure of the object file. Therefore, this parameter is effective only when fileFormat is set to parquet. The format is as follows.

message MessageTypeName {
Required/Optional, Data Type, Column Name;
......................;
}

The configuration items are described as follows:

  • MessageType name: Enter a name.

  • Required/Optional: `required` means not null, `optional` means nullable. We recommend that you set all to `optional`.

  • Data type: Parquet files support BOOLEAN, INT32, INT64, INT96, FLOAT, DOUBLE, BINARY (use BINARY for string types), and FIXED_LEN_BYTE_ARRAY types.

Note

Each row setting must end with a semicolon. The last row must also have a semicolon.

The following is an example.

message m {
optional int64 id;
optional int64 date_id;
optional binary datetimestring;
optional int32 dspId;
optional int32 advertiserId;
optional int32 status;
optional int64 bidding_req_num;
optional int64 imp;
optional int64 click_num;
}

No

None

nullFormat

In text files, null (a null pointer) cannot be defined with a standard string. The data synchronization system provides nullFormat to define which strings can represent null. For example, if you configure nullFormat="null" and the source data is null, the data synchronization system will treat it as a null field.

No

None

header

The header of the output file when writing to OSS. For example, ["id", "name", "age"].

No

None

maxFileSize (Advanced configuration, not supported in codeless UI)

The maximum size of a single object file when writing to OSS. The default is 10,000 × 10 MB. This is similar to controlling the size of a log file when printing log4j logs. When uploading in parts to OSS, each part is 10 MB (which is also the minimum granularity for log rotation files, meaning a maxFileSize less than 10 MB will be treated as 10 MB). Each OSS InitiateMultipartUploadRequest supports a maximum of 10,000 parts.

When rotation occurs, the object name rule is to append suffixes like _1, _2, _3 to the original object prefix with a random UUID.

Note
  • The default unit is MB.

  • Configuration example: "maxFileSize":300 sets the single file size to 300 MB.

  • maxFileSize only takes effect for csv and text formats. It is calculated at the memory level of the sync task process and cannot precisely control the actual size of the destination file. The actual file size written to the destination may exceed expectations due to data bloat.

No

100,000

suffix (Advanced configuration, not supported in codeless UI)

The suffix of the file name generated when data synchronization writes data. For example, if you configure suffix as .csv, the final written file name will be fileName****.csv.

No

None

Appendix: Conversion policy for Parquet data types

If you do not configure the parquetSchema parameter, DataWorks converts the data types based on the source field types. The following table describes the conversion policy.

Converted data type

Parquet type

Parquet logical type

CHAR / VARCHAR / STRING

BINARY

UTF8

BOOLEAN

BOOLEAN

Not applicable

BINARY / VARBINARY

BINARY

Not applicable

DECIMAL

FIXED_LEN_BYTE_ARRAY

DECIMAL

TINYINT

INT32

INT_8

SMALLINT

INT32

INT_16

INT/INTEGER

INT32

Not applicable

BIGINT

INT64

Not applicable

FLOAT

FLOAT

Not applicable

DOUBLE

DOUBLE

Not applicable

DATE

INT32

DATE

TIME

INT32

TIME_MILLIS

TIMESTAMP/DATETIME

INT96

Not applicable