All Products
Search
Document Center

DataWorks:OSS-HDFS

Last Updated:Mar 26, 2026

OSS-HDFS Service (JindoFS Service) is a cloud-native data lake storage product. An OSS-HDFS data source provides a bidirectional channel to read from and write to OSS-HDFS. This topic describes the data synchronization capabilities that DataWorks provides for OSS-HDFS.

Supported capabilities

CapabilitySupported
Offline readYes
Offline writeYes
Real-time writeYes

Limitations

Offline read

The network connection from a resource group to OSS-HDFS can be complex. To run data synchronization tasks, use a Serverless resource group (recommended) or an exclusive resource group for Data Integration. Ensure that your resource group can access OSS-HDFS over the network.

OSS-HDFS Reader supports the following:

  • Files in text, CSV, ORC, and Parquet formats. The file content must be a logical two-dimensional table.

  • Reading multiple data types and column constants.

  • Recursive reads and the wildcard characters * and ?.

  • Concurrent reads from multiple files. The actual number of concurrent threads is the smaller value between the number of files to read and the concurrent setting.

Important

OSS-HDFS Reader does not support multi-threaded concurrent reads from a single file due to the internal chunking algorithm for single files.

Offline write

  • OSS-HDFS Writer supports only text, ORC, and Parquet formats. The file content must be a logical two-dimensional table.

  • For text files, ensure that the field delimiter used for writing matches the delimiter used when creating the Hive table. This ensures that the data written to OSS-HDFS maps correctly to Hive table fields.

Real-time write

  • Supports real-time writes.

  • Supports real-time writes for Hudi format version 0.14.x.

Supported field types

Offline read

OSS-HDFS Reader converts data types from ParquetFile, ORCFile, TextFile, and CsvFile to the internal types that Data Integration supports.

Type categoryOSS-HDFS data types
IntegerTINYINT, SMALLINT, INT, BIGINT
Floating-pointFLOAT, DOUBLE, DECIMAL
StringSTRING, CHAR, VARCHAR
Date and timeDATE, TIMESTAMP
BooleanBOOLEAN
Note

The following examples illustrate internal type representations:

  • LONG: Integer data in an OSS-HDFS file, such as 123456789.

  • DOUBLE: Floating-point data in an OSS-HDFS file, such as 3.1415.

  • BOOLEAN: Boolean data in an OSS-HDFS file, such as true or false. Values are not case-sensitive.

  • DATE: Date and time data in an OSS-HDFS file, such as 2014-12-31 00:00:00.

Offline write

OSS-HDFS Writer writes files in TextFile, ORCFile, and ParquetFile formats to a specified path in the OSS-HDFS file system.

Type categoryOSS-HDFS data types
IntegerTINYINT, SMALLINT, INT, BIGINT
Floating-pointFLOAT, DOUBLE
StringCHAR, VARCHAR, STRING
BooleanBOOLEAN
Date and timeDATE, TIMESTAMP

Add a data source

Before developing a synchronization task in DataWorks, add the required data source by following the instructions in Data source management. Parameter descriptions are available in the DataWorks console when you add a data source.

Develop a data synchronization task

Configure an offline synchronization task for a single table

Configure a real-time synchronization task for a single table

See Configure real-time incremental synchronization for a single table and Configure a real-time synchronization task in DataStudio.

Configure a full and incremental real-time synchronization task for an entire database

See Configure a real-time synchronization task for an entire database.

Appendix: Script demos and parameter descriptions

Reader script demo

All parameters follow the unified script format required by the code editor. For format details, see Configure a task in the code editor.

{
    "type": "job",
    "version": "2.0",
    "steps": [
        {
            "stepType": "oss_hdfs",
            "parameter": {
                "path": "",
                "datasource": "",
                "column": [
                    {
                        "index": 0,
                        "type": "string"
                    },
                    {
                        "index": 1,
                        "type": "long"
                    },
                    {
                        "index": 2,
                        "type": "double"
                    },
                    {
                        "index": 3,
                        "type": "boolean"
                    },
                    {
                        "format": "yyyy-MM-dd HH:mm:ss",
                        "index": 4,
                        "type": "date"
                    }
                ],
                "fieldDelimiter": ",",
                "encoding": "UTF-8",
                "fileFormat": ""
            },
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "stream",
            "parameter": {},
            "name": "Writer",
            "category": "writer"
        }
    ],
    "setting": {
        "errorLimit": {
            "record": ""
        },
        "speed": {
            "concurrent": 3,
            "throttle": true,
            "mbps": "12"
        }
    },
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    }
}

Reader script parameters

ParameterDescriptionRequiredDefault value
pathThe path of the file or directory to read. Three input styles are supported: OPTION 1: Single file — OSS-HDFS Reader uses a single thread to read the file. OPTION 2: Multiple files — OSS-HDFS Reader reads files concurrently. The actual thread count is the smaller value between the number of files and the concurrent setting. Use simple regex patterns such as /hadoop/data_201704*, or use scheduling parameters for time-based file names. OPTION 3: Wildcard path — OSS-HDFS Reader traverses matching files. Specifying / reads all files in the root directory. Only * and ? are supported as wildcard characters. All files in a single synchronization job are treated as one data table; ensure all files share the same schema. The AccessKey pair configured in the data source must have read permissions on the corresponding OSS-HDFS path.YesNone
fileFormatThe file type. Valid values: text, orc, csv, parquet. OSS-HDFS Reader auto-detects the file type and applies the corresponding read policy. Before synchronization starts, it verifies that all files in the specified path match the value of fileFormat. The task fails if there is a mismatch.YesNone
columnThe list of fields to read. Set to ["*"] to read all columns as STRING. To specify individual columns, provide type and either index (reads from the data file, starting at 0) or value (generates a constant column without reading from the file). Only one of index or value can be set per column entry.YesNone
fieldDelimiterThe field delimiter for reading TextFile data. Not required for ORC or Parquet files.No,
encodingThe file encoding.Noutf-8
nullFormatThe string to treat as a null value. For example, setting nullFormat: "null" causes Data Integration to treat the source string null as a null field.NoNone
compressThe compression format. Valid values: gzip, bzip2, snappy.NoNone

Writer script demo

{
    "type": "job",
    "version": "2.0",
    "steps": [
        {
            "stepType": "stream",
            "parameter": {},
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "oss_hdfs",
            "parameter": {
                "path": "",
                "fileName": "",
                "compress": "",
                "datasource": "",
                "column": [
                    {
                        "name": "col1",
                        "type": "string"
                    },
                    {
                        "name": "col2",
                        "type": "int"
                    },
                    {
                        "name": "col3",
                        "type": "double"
                    },
                    {
                        "name": "col4",
                        "type": "boolean"
                    },
                    {
                        "name": "col5",
                        "type": "date"
                    }
                ],
                "writeMode": "",
                "fieldDelimiter": ",",
                "encoding": "",
                "fileFormat": "text"
            },
            "name": "Writer",
            "category": "writer"
        }
    ],
    "setting": {
        "errorLimit": {
            "record": ""
        },
        "speed": {
            "concurrent": 3,
            "throttle": false
        }
    },
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    }
}

Writer script parameters

ParameterDescriptionRequiredDefault value
fileFormatThe file type. Valid values: text, orc, parquet.YesNone
pathThe path in the OSS-HDFS file system where data is stored. OSS-HDFS Writer writes multiple files to this directory based on the concurrency configuration. When associating with a Hive table, specify the Hive table's storage path on OSS-HDFS.YesNone
fileNameThe base name for output files. A random suffix is appended to this name for each concurrent thread to create the actual file names.YesNone
columnThe fields to write. Writing to a subset of columns is not supported. When associating with a Hive table, specify all field names and types. Use name for the field name and type for the field type. Not required when fileFormat is parquet.Yes (not required if fileFormat is parquet)None
writeModeHow OSS-HDFS Writer handles existing files before writing. OSS-HDFS Writer uses a write-then-rename strategy: data is first written to a temporary directory named using the path_random rule, then moved to the destination path after all writes complete with unique file names guaranteed. After the move, the temporary directory is automatically deleted. If the connection is interrupted, the temporary directory and any partially written files are not cleaned up automatically — delete them manually before retrying. Valid values: append — writes directly without pre-processing, ensuring no file name conflicts. nonConflict — reports an error if a file with the fileName prefix already exists in the directory. truncate — deletes all files matching the fileName prefix before writing (for example, if fileName is abc, all files starting with abc in the directory are deleted first).YesNone
fieldDelimiterThe field delimiter for output files. Only single-character delimiters are supported; multiple characters cause a runtime error. Not required when fileFormat is parquet.Yes (not required if fileFormat is parquet)None
compressThe compression format for text files. Valid values: gzip, bzip2. Leave blank to write without compression.NoNone
encodingThe file encoding.Noutf-8
parquetSchemaThe schema definition for Parquet output files. Only takes effect when fileFormat is parquet. Use the following format: message <MessageName> { <required|optional> <DataType> <FieldName>; ... }. Supported data types: BOOLEAN, INT32, INT64, INT96, FLOAT, DOUBLE, BINARY (use BINARY for string types), FIXED_LEN_BYTE_ARRAY. Use optional for nullable fields and required for non-null fields. Setting all fields to optional is recommended. Each row definition must end with a semicolon, including the last one. Example: message m { optional int64 id; optional binary username; optional int32 status; }NoNone