OSS-HDFS data source - DataWorks - Alibaba Cloud Documentation Center

OSS-HDFS (JindoFS) is a cloud-native data lake storage service. DataWorks provides OSS-HDFS Reader and OSS-HDFS Writer for you to read data from and write data to OSS-HDFS data sources. This topic describes the capabilities of synchronizing data from or to OSS-HDFS data sources.

Limits

Batch data read

When you use OSS-HDFS Reader, take note of the following items:
Network connections between a resource group and OSS-HDFS are complex. We recommend that you use an exclusive resource group for Data Integration to run your data synchronization task to which the OSS-HDFS data source is added. Make sure that your exclusive resource group for Data Integration can access the network where the OSS-HDFS data source resides.
OSS-HDFS Reader supports the following features:
- Supports the text, CSV, ORC, and Parquet file formats. Data stored in the files in these formats must be organized as logical two-dimensional tables.
- Reads data of various types and supports constants.
- Supports recursive reading and wildcards asterisks (*) and question marks (?).
- Uses parallel threads to read data from multiple files.

Important

Currently, OSS-HDFS Reader cannot read data from a single file by using parallel threads due to the internal sharding method.

Batch data write

Take note of the following items when you use OSS-HDFS Writer:

OSS-HDFS Writer can write only text, ORC, and Parquet files that store logical two-dimensional tables to OSS-HDFS.
To write a text file to OSS-HDFS, make sure that the delimiter in the file is the same as that in the Hive table that you want to associate with the file. This way, you can associate the columns in the file that is written to OSS-HDFS with those in the Hive table.

Real-time data write

Real-time data write is supported.
You can write data from OSS-HDFS to Hudi 0.14.x in real time.

How it works

OSS-HDFS Writer writes files to OSS-HDFS in the following way:

Creates a temporary directory that does not exist in OSS-HDFS based on the path parameter you specified.
The temporary directory is specified in the format of path_Random suffix.
Writes files that are obtained from a reader to the temporary directory.
Moves the files from the temporary directory to the specified directory after all the files are written. The names of the files that you want to write to OSS-HDFS must be different from those of existing files in OSS-HDFS.
Deletes the temporary directory. If OSS-HDFS Writer fails to connect to OSS-HDFS due to a network interruption, you must manually delete the temporary directory and all the files in the temporary directory.

Data type mappings

Batch data read

The following table lists the data type mappings based on which OSS-HDFS Reader converts data types in Parquet, ORC, text, and CSV files.

Category	OSS-HDFS data type
Integer	TINYINT, SMALLINT, INT, and BIGINT
Floating point	FLOAT, DOUBLE, and DECIMAL
String	STRING, CHAR, and VARCHAR
Date and time	DATE and TIMESTAMP
Boolean	BOOLEAN

Note

LONG: data of the integer type in OSS-HDFS files, such as 123456789.
DOUBLE: data of the floating point type in OSS-HDFS files, such as 3.1415.
BOOLEAN: data of the Boolean type in OSS-HDFS files, such as true or false. Data is not case-sensitive.
DATE: data of the date and time type in OSS-HDFS files, such as 2014-12-31 00:00:00.

Batch data write

OSS-HDFS Writer can write text, ORC, or Parquet files to a specified directory in OSS-HDFS.

The following table lists the data type mappings based on which OSS-HDFS Writer converts data types.

Category	OSS-HDFS data type
Integer	TINYINT, SMALLINT, INT, and BIGINT
Floating point	FLOAT and DOUBLE
String	CHAR, VARCHAR, and STRING
Boolean	BOOLEAN
Date and time	DATE and TIMESTAMP

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a data synchronization task, see the following sections. For information about the parameter settings, view the infotip of each parameter on the configuration tab of the task.

Add a data source

Before you configure a data synchronization task to synchronize data from or to a specific data source, you must add the data source to DataWorks. For more information, see Add and manage data sources.

Configure a batch synchronization task to synchronize data of a single table

For more information about the configuration procedure, see Configure a batch synchronization task by using the codeless UI and Configure a batch synchronization task by using the code editor.
For information about all parameters that are configured and the code that is run when you use the code editor to configure a batch synchronization task, see Appendix: Code and parameters.

Configure a real-time synchronization task to synchronize data of a single table

For more information about the configuration procedure, see Create a real-time synchronization task to synchronize incremental data from a single table and Configure a real-time synchronization task in DataStudio.

Configure synchronization settings to implement real-time synchronization of full or incremental data in a database

For more information about the configuration procedure, see Configure a synchronization task in Data Integration.

Appendix: Code and parameters

Appendix: Configure a batch synchronization task by using the code editor

If you use the code editor to configure a batch synchronization task, you must configure parameters for the reader and writer of the related data source based on the format requirements in the code editor. For more information about the format requirements, see Configure a batch synchronization task by using the code editor. The following information describes the configuration details of parameters for the reader and writer in the code editor.

Code for OSS-HDFS Reader

{
    "type": "job",
    "version": "2.0",
    "steps": [
        {
            "stepType": "oss_hdfs",// The plug-in name.
            "parameter": {
                "path": "",// The path of the file from which you want to read data.
                "datasource": "",// The name of the data source.
                "column": [
                    {
                        "index": 0,// The index of the column in the source file. The index starts from 0, which indicates that OSS-HDFS Reader reads data from the first column of the source file. 
                        "type": "string"// The field type.
                    },
                    {
                        "index": 1,
                        "type": "long"
                    },
                    {
                        "index": 2,
                        "type": "double"
                    },
                    {
                        "index": 3,
                        "type": "boolean"
                    },
                    {
                        "format": "yyyy-MM-dd HH:mm:ss", // The time format.
                        "index": 4,
                        "type": "date"
                    }
                ],
                "fieldDelimiter": ",",// The column delimiter.
                "encoding": "UTF-8",// The encoding format.
                "fileFormat": ""// The file format.
            },
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "stream",
            "parameter": {},
            "name": "Writer",
            "category": "writer"
        }
    ],
    "setting": {
        "errorLimit": {
            "record": ""// The maximum number of dirty data records allowed.
        },
        "speed": {
            "concurrent": 3,// The maximum number of parallel threads.
            "throttle": true // Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "mbps":"12"// The maximum transmission rate. Unit: MB/s. 
        }
    },
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    }
}

Parameters in code for OSS-HDFS Reader

Parameter	Description	Required	Default value
path	The path of the file from which you want to read data. If you specify a single OSS-HDFS file, OSS-HDFS Reader uses only one thread to read data from the file. If you specify multiple OSS-HDFS files, OSS-HDFS Reader uses parallel threads to read data from the files. The number of threads is determined by the concurrent parameter. If you want to read data from multiple files, you can specify a regular expression, such as `/hadoop/data_201704`. If the file names contain time information and the time information is presented in a regular manner, you can use scheduling parameters together with a regular expression. The values of the scheduling parameters are replaced based on the data timestamp of the task. For more information about scheduling parameters, see Supported formats of scheduling parameters. Note* The number of threads that are actually started is always the same as the smaller value between the number of OSS-HDFS files that you want to read and the number of parallel threads that you configure. If a path contains a wildcard, OSS-HDFS Reader attempts to read data from all files that match the path. For example, if you specify the path as /oss-hdfs/, OSS-HDFS Reader reads all files in the oss-hdfs directory. OSS-HDFS Reader supports only asterisks (``) and question marks (`?`) as wildcards. The syntax is similar to the syntax of file name wildcards used in the Linux command line. Important* Data Integration considers all the files to read in a data synchronization task as a single table. Make sure that all the files can adapt to the same schema and Data Integration has the permissions to read all these files. Make sure that the AccessKey pair you specified when you add the OSS-HDFS data source has the read permissions on the OSS-HDFS data source.	Yes	No default value
fileFormat	The format of the file from which you want to read data. OSS-HDFS Reader automatically identifies the file format and uses the related read policies. Before OSS-HDFS Reader reads data, it checks whether all the files in the specified path match the format specified by the fileFormat parameter. If the format of a file does not match the format specified by the fileFormat parameter, the data synchronization task fails. Valid values of the fileFormat parameter: TEXT: the text format. ORC: the ORC format. CSV: the CSV format, which is a common OSS-HDFS file format. The data in a CSV file is organized as a logical two-dimensional table. PARQUET: the Parquet format.	Yes	No default value
column	The names of the columns from which you want to read data. By default, OSS-HDFS Reader reads all data as strings. In this case, set this parameter to `"column": [""]`. You can also configure the column parameter in the following format: type: specifies the data type of a source column. index: specifies the ID of a source column, starting from 0. value: specifies a constant. If you specify the value field, OSS-HDFS Reader reads the value of this field. Note* For the column parameter, you must configure the type field and one of the index and value fields. `{ "type": "long", "index": 0 // The first LONG-type column of the source file. The index starts from 0. The index field indicates the IDs of the columns from which you want to read data in the file. }, { "type": "string", "value": "alibaba" // The value of the current column, which is a constant column alibaba. It is internally generated by OSS-HDFS Reader. }`	Yes	No default value
fieldDelimiter	The delimiter of the columns from which you want to read data. If the source files are text files, you must specify a column delimiter. If you do not specify a column delimiter, OSS-HDFS Reader uses commas (,) as column delimiters by default. If the source files are ORC or Parquet files, you do not need to specify a column delimiter.	No	,
encoding	The encoding format of the file from which you want to read data.	No	utf-8
nullFormat	The string that represents a null pointer. No standard strings can represent a null pointer in text files. You can use this parameter to define which string represents a null pointer. For example, if you set this parameter to `null`, Data Integration considers null as a null pointer.	No	No default value
compress	The compression format. The following compression formats are supported: GZIP, BZIP2, and Snappy.	No	No default value

Code for OSS-HDFS Writer

{
    "type": "job",
    "version": "2.0",// The version number. 
    "steps": [
        {
            "stepType": "stream",
            "parameter": {},
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "oss_hdfs",// The plug-in name. 
            "parameter": {
                "path": "",// The directory in OSS-HDFS to which the files are written. 
                "fileName": "",// The name prefix of the files that you want to write to OSS-HDFS. 
                "compress": "",// The compression format of the files that you want to write to OSS-HDFS. 
                "datasource": "",// The name of the data source. 
                "column": [
                    {
                        "name": "col1",// The name of a column. 
                        "type": "string"// The data type of a column. 
                    },
                    {
                        "name": "col2",
                        "type": "int"
                    },
                    {
                        "name": "col3",
                        "type": "double"
                    },
                    {
                        "name": "col4",
                        "type": "boolean"
                    },
                    {
                        "name": "col5",
                        "type": "date"
                    }
                ],
                "writeMode": "",// The write mode. 
                "fieldDelimiter": ",",// The column delimiter. 
                "encoding": "",// The encoding format. 
                "fileFormat": "text"// The format of the files that you want to write to OSS-HDFS. 
            },
            "name": "Writer",
            "category": "writer"
        }
    ],
    "setting": {
        "errorLimit": {
            "record": ""// The maximum number of dirty data records allowed. 
        },
        "speed": {
            "concurrent": 3,// The maximum number of parallel threads. 
            "throttle": false // Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
        }
    },
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    }
}

Parameters in code for OSS-HDFS Writer

Parameter	Description	Required	Default value
fileFormat	The format of the file from which you want to read data. text: the type of the file that you want to write to OSS-HDFS is text. orc: the type of the file that you want to write to OSS-HDFS is ORC. parquet: the type of the file that you want to write to OSS-HDFS is Parquet.	Yes	No default value
path	The directory in OSS-HDFS to which you want to write files. OSS-HDFS Writer writes multiple files to the directory based on the configuration of parallel threads. To associate the columns in a file with those in a Hive table, set the path parameter to the storage path of the Hive table in OSS-HDFS. For example, you can specify a storage path for the Hive data warehouse.	Yes	No default value
fileName	The name prefix of the files that you want to write to OSS-HDFS. A random suffix is appended to the specified prefix to form the actual file name that is used by each thread.	Yes	No default value
column	The names of the columns to which you want to write data. You cannot write data only to some columns in the Hive table. To associate the columns in a file with those in a Hive table, configure the name and type parameters for each column. The name parameter specifies the name of the column, and the type parameter specifies the data type of the column. You can configure the column parameter in the following format: `{ "column": [ { "name": "userName", "type": "string" }, { "name": "age", "type": "long" } ] }`	Required if the fileFormat parameter is set to text or orc	No default value
writeMode	The write mode. Valid values: append: OSS-HDFS Writer writes the files based on the specified file name prefix and ensures that the actual file names do not conflict with the names of existing files. nonConflict: OSS-HDFS Writer returns an error if a file with the specified file name prefix exists in the destination directory. `truncate`: OSS-HDFS Writer deletes all existing files whose names start with the specified file name prefix from the destination directory before files are written to the directory. For example, if you set `fileName` to abc, all existing files whose names start with abc are deleted from the destination directory.	Yes	No default value
fieldDelimiter	The column delimiter that is used in the files you want to write to OSS-HDFS. Note Only single-character delimiters are supported. If you specify multi-character delimiters, an error is reported.	Required if the fileFormat parameter is set to text or orc	No default value
compress	The compression format of the files that you want to write to OSS-HDFS. By default, this parameter is left empty, which indicates that the files are not compressed. For a text file, the GZIP and BZIP2 compression formats are supported.	No	No default value
encoding	The encoding format of the files that you want to write to OSS-HDFS.	No	utf-8
parquetSchema	The schema of the Parquet files that you want to write to OSS-HDFS. This parameter is available only if the fileFormat parameter is set to parquet. Format: `message MessageName { required, dataType, columnName; .....................; }` Parameters: MessageName: the name of the MessageName object. required: indicates that the column cannot be left empty. You can also specify optional based on your business requirements. We recommend that you specify optional for all columns. dataType: Parquet files support various data types, such as BOOLEAN, INT32, INT64, INT96, FLOAT, DOUBLE, BINARY, and FIXED_LEN_BYTE_ARRAY. Set this parameter to BINARY if the column stores strings. Note Each line, including the last line, must end with a semicolon (;). Example: `message m { optional int64 id; optional int64 date_id; optional binary datetimestring; optional int32 dspId; optional int32 advertiserId; optional int32 status; optional int64 bidding_req_num; optional int64 imp; optional int64 click_num; }`	No	No default value