OSS-HDFS Service (JindoFS Service) is a cloud-native data lake storage product. An OSS-HDFS data source provides a bidirectional channel to read from and write to OSS-HDFS. This topic describes the data synchronization capabilities that DataWorks provides for OSS-HDFS.
Supported capabilities
| Capability | Supported |
|---|---|
| Offline read | Yes |
| Offline write | Yes |
| Real-time write | Yes |
Limitations
Offline read
The network connection from a resource group to OSS-HDFS can be complex. To run data synchronization tasks, use a Serverless resource group (recommended) or an exclusive resource group for Data Integration. Ensure that your resource group can access OSS-HDFS over the network.
OSS-HDFS Reader supports the following:
Files in text, CSV, ORC, and Parquet formats. The file content must be a logical two-dimensional table.
Reading multiple data types and column constants.
Recursive reads and the wildcard characters
*and?.Concurrent reads from multiple files. The actual number of concurrent threads is the smaller value between the number of files to read and the
concurrentsetting.
OSS-HDFS Reader does not support multi-threaded concurrent reads from a single file due to the internal chunking algorithm for single files.
Offline write
OSS-HDFS Writer supports only text, ORC, and Parquet formats. The file content must be a logical two-dimensional table.
For text files, ensure that the field delimiter used for writing matches the delimiter used when creating the Hive table. This ensures that the data written to OSS-HDFS maps correctly to Hive table fields.
Real-time write
Supports real-time writes.
Supports real-time writes for Hudi format version 0.14.x.
Supported field types
Offline read
OSS-HDFS Reader converts data types from ParquetFile, ORCFile, TextFile, and CsvFile to the internal types that Data Integration supports.
| Type category | OSS-HDFS data types |
|---|---|
| Integer | TINYINT, SMALLINT, INT, BIGINT |
| Floating-point | FLOAT, DOUBLE, DECIMAL |
| String | STRING, CHAR, VARCHAR |
| Date and time | DATE, TIMESTAMP |
| Boolean | BOOLEAN |
The following examples illustrate internal type representations:
LONG: Integer data in an OSS-HDFS file, such as 123456789.
DOUBLE: Floating-point data in an OSS-HDFS file, such as 3.1415.
BOOLEAN: Boolean data in an OSS-HDFS file, such as true or false. Values are not case-sensitive.
DATE: Date and time data in an OSS-HDFS file, such as 2014-12-31 00:00:00.
Offline write
OSS-HDFS Writer writes files in TextFile, ORCFile, and ParquetFile formats to a specified path in the OSS-HDFS file system.
| Type category | OSS-HDFS data types |
|---|---|
| Integer | TINYINT, SMALLINT, INT, BIGINT |
| Floating-point | FLOAT, DOUBLE |
| String | CHAR, VARCHAR, STRING |
| Boolean | BOOLEAN |
| Date and time | DATE, TIMESTAMP |
Add a data source
Before developing a synchronization task in DataWorks, add the required data source by following the instructions in Data source management. Parameter descriptions are available in the DataWorks console when you add a data source.
Develop a data synchronization task
Configure an offline synchronization task for a single table
For step-by-step instructions, see Configure a task in the codeless UI and Configure a task in the code editor.
For a complete parameter list and script example, see Appendix: Script demos and parameter descriptions.
Configure a real-time synchronization task for a single table
See Configure real-time incremental synchronization for a single table and Configure a real-time synchronization task in DataStudio.
Configure a full and incremental real-time synchronization task for an entire database
See Configure a real-time synchronization task for an entire database.
Appendix: Script demos and parameter descriptions
Reader script demo
All parameters follow the unified script format required by the code editor. For format details, see Configure a task in the code editor.
{
"type": "job",
"version": "2.0",
"steps": [
{
"stepType": "oss_hdfs",
"parameter": {
"path": "",
"datasource": "",
"column": [
{
"index": 0,
"type": "string"
},
{
"index": 1,
"type": "long"
},
{
"index": 2,
"type": "double"
},
{
"index": 3,
"type": "boolean"
},
{
"format": "yyyy-MM-dd HH:mm:ss",
"index": 4,
"type": "date"
}
],
"fieldDelimiter": ",",
"encoding": "UTF-8",
"fileFormat": ""
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "stream",
"parameter": {},
"name": "Writer",
"category": "writer"
}
],
"setting": {
"errorLimit": {
"record": ""
},
"speed": {
"concurrent": 3,
"throttle": true,
"mbps": "12"
}
},
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
}
}Reader script parameters
| Parameter | Description | Required | Default value |
|---|---|---|---|
path | The path of the file or directory to read. Three input styles are supported: OPTION 1: Single file — OSS-HDFS Reader uses a single thread to read the file. OPTION 2: Multiple files — OSS-HDFS Reader reads files concurrently. The actual thread count is the smaller value between the number of files and the concurrent setting. Use simple regex patterns such as /hadoop/data_201704*, or use scheduling parameters for time-based file names. OPTION 3: Wildcard path — OSS-HDFS Reader traverses matching files. Specifying / reads all files in the root directory. Only * and ? are supported as wildcard characters. All files in a single synchronization job are treated as one data table; ensure all files share the same schema. The AccessKey pair configured in the data source must have read permissions on the corresponding OSS-HDFS path. | Yes | None |
fileFormat | The file type. Valid values: text, orc, csv, parquet. OSS-HDFS Reader auto-detects the file type and applies the corresponding read policy. Before synchronization starts, it verifies that all files in the specified path match the value of fileFormat. The task fails if there is a mismatch. | Yes | None |
column | The list of fields to read. Set to ["*"] to read all columns as STRING. To specify individual columns, provide type and either index (reads from the data file, starting at 0) or value (generates a constant column without reading from the file). Only one of index or value can be set per column entry. | Yes | None |
fieldDelimiter | The field delimiter for reading TextFile data. Not required for ORC or Parquet files. | No | , |
encoding | The file encoding. | No | utf-8 |
nullFormat | The string to treat as a null value. For example, setting nullFormat: "null" causes Data Integration to treat the source string null as a null field. | No | None |
compress | The compression format. Valid values: gzip, bzip2, snappy. | No | None |
Writer script demo
{
"type": "job",
"version": "2.0",
"steps": [
{
"stepType": "stream",
"parameter": {},
"name": "Reader",
"category": "reader"
},
{
"stepType": "oss_hdfs",
"parameter": {
"path": "",
"fileName": "",
"compress": "",
"datasource": "",
"column": [
{
"name": "col1",
"type": "string"
},
{
"name": "col2",
"type": "int"
},
{
"name": "col3",
"type": "double"
},
{
"name": "col4",
"type": "boolean"
},
{
"name": "col5",
"type": "date"
}
],
"writeMode": "",
"fieldDelimiter": ",",
"encoding": "",
"fileFormat": "text"
},
"name": "Writer",
"category": "writer"
}
],
"setting": {
"errorLimit": {
"record": ""
},
"speed": {
"concurrent": 3,
"throttle": false
}
},
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
}
}Writer script parameters
| Parameter | Description | Required | Default value |
|---|---|---|---|
fileFormat | The file type. Valid values: text, orc, parquet. | Yes | None |
path | The path in the OSS-HDFS file system where data is stored. OSS-HDFS Writer writes multiple files to this directory based on the concurrency configuration. When associating with a Hive table, specify the Hive table's storage path on OSS-HDFS. | Yes | None |
fileName | The base name for output files. A random suffix is appended to this name for each concurrent thread to create the actual file names. | Yes | None |
column | The fields to write. Writing to a subset of columns is not supported. When associating with a Hive table, specify all field names and types. Use name for the field name and type for the field type. Not required when fileFormat is parquet. | Yes (not required if fileFormat is parquet) | None |
writeMode | How OSS-HDFS Writer handles existing files before writing. OSS-HDFS Writer uses a write-then-rename strategy: data is first written to a temporary directory named using the path_random rule, then moved to the destination path after all writes complete with unique file names guaranteed. After the move, the temporary directory is automatically deleted. If the connection is interrupted, the temporary directory and any partially written files are not cleaned up automatically — delete them manually before retrying. Valid values: append — writes directly without pre-processing, ensuring no file name conflicts. nonConflict — reports an error if a file with the fileName prefix already exists in the directory. truncate — deletes all files matching the fileName prefix before writing (for example, if fileName is abc, all files starting with abc in the directory are deleted first). | Yes | None |
fieldDelimiter | The field delimiter for output files. Only single-character delimiters are supported; multiple characters cause a runtime error. Not required when fileFormat is parquet. | Yes (not required if fileFormat is parquet) | None |
compress | The compression format for text files. Valid values: gzip, bzip2. Leave blank to write without compression. | No | None |
encoding | The file encoding. | No | utf-8 |
parquetSchema | The schema definition for Parquet output files. Only takes effect when fileFormat is parquet. Use the following format: message <MessageName> { <required|optional> <DataType> <FieldName>; ... }. Supported data types: BOOLEAN, INT32, INT64, INT96, FLOAT, DOUBLE, BINARY (use BINARY for string types), FIXED_LEN_BYTE_ARRAY. Use optional for nullable fields and required for non-null fields. Setting all fields to optional is recommended. Each row definition must end with a semicolon, including the last one. Example: message m { optional int64 id; optional binary username; optional int32 status; } | No | None |