This topic describes the parameters that are supported by DataHub Writer and how to configure DataHub Writer by using the codeless user interface (UI) and code editor.

DataHub is a real-time data distribution platform that is designed to process streaming data. You can publish and subscribe to streaming data in DataHub and distribute the data to other platforms. This allows you to analyze streaming data and build applications based on the streaming data.

DataHub is built on top of the Apsara distributed operating system, and features high availability, low latency, high scalability, and high throughput. DataHub is seamlessly integrated with Realtime Compute for Apache Flink, and allows you to use SQL statements to analyze streaming data. DataHub can also distribute streaming data to Alibaba Cloud services, such as MaxCompute and Object Storage Service (OSS).
Notice Strings must be encoded in the UTF-8 format. The size of each string must not exceed 1 MB.

Channel types

The source is connected to the sink by using a single channel. Therefore, the channel type configured for the writer must be the same as that configured for the reader. In normal cases, channels are categorized into two types: memory and file. In the following configuration, the channel type is set to file:
"agent.sinks.dataXSinkWrapper.channel": "file"

Parameters

Parameter Description Required Default value
accessId The AccessKey ID of the account that you use to connect to DataHub. Yes No default value
accessKey The AccessKey secret of the account that you use to connect to DataHub. Yes No default value
endPoint The endpoint of DataHub. Yes No default value
maxRetryCount The maximum number of retries if the synchronization node fails. No No default value
mode The mode for writing strings. Yes No default value
parseContent The data to be parsed. Yes No default value
project The basic organizational unit of data in DataHub. Each project has one or more topics.
Note DataHub projects are independent of MaxCompute projects. You cannot use MaxCompute projects as DataHub projects.
Yes No default value
topic The minimum unit for data subscription and publishing. You can use topics to distinguish different types of streaming data. Yes No default value
maxCommitSize The maximum amount of the buffered data that Data Integration can accumulate before it commits the data to the destination. You can specify this parameter to improve writing efficiency. The default value is 1048576, in bytes, which is 1 MB. DataHub allows for a maximum of 10,000 data records to be written in a single write request. If the number of data records exceeds 10,000, the synchronization node fails. You can control the number of data records based on the total amount of data that is calculated by using the following formula: Average amount of data in a single data record × 10000. No 1MB

Configure DataHub Writer by using the codeless UI

This method is not supported.

Configure DataHub Writer by using the code editor

In the following code, a synchronization node is configured to write data from memory to DataHub by using the code editor. For more information, see Create a sync node by using the code editor.
{
    "type": "job",
    "version": "2.0",// The version number. 
    "steps": [
        { 
            "stepType": "stream",
            "parameter": {},
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "datahub",// The writer type. 
            "parameter": {
                "datasource": "",// The name of the data source. 
                "topic": "",// The minimum unit for data subscription and publishing. You can use topics to distinguish different types of streaming data. 
                "maxRetryCount": 500,// The maximum number of retries if the synchronization node fails. 
                "maxCommitSize": 1048576// The maximum amount of the buffered data that Data Integration can accumulate before it commits the data to the destination. 
                 // DataHub allows for a maximum of 10,000 data records to be written in a single write request. If the number of data records exceeds 10,000, the synchronization node fails. You can control the number of data records based on the total amount of data that is calculated by using the following formula: Average amount of data in a single data record × 10000. For example, if the data size of a single data record is 10 KB, the value of this parameter must be less than the result of 10 multiplied by 10000. 
            },
            "name": "Writer",
            "category": "writer"
        }
    ],
    "setting": {
        "errorLimit": {
            "record": ""// The maximum number of dirty data records allowed. 
        },
        "speed": {
            "throttle":true,// Specifies whether to enable bandwidth throttling. The value false indicates that bandwidth throttling is disabled, and the value true indicates that bandwidth throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "concurrent":20, // The maximum number of parallel threads. 
            "mbps":"12"// The maximum transmission rate.
        }
    },
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    }
}