This topic describes the parameters of DataHub Writer and how to configure it by using the code editor.

DataHub is a real-time data distribution platform that is designed to process streaming data. You can publish and subscribe to streaming data in DataHub and distribute the data to other platforms. This allows you to analyze streaming data and build applications based on the streaming data.

Based on the Apsara system of Alibaba Cloud, DataHub features high availability, low latency, high scalability, and high throughput. Seamlessly integrated with Realtime Compute, DataHub allows you to use SQL to analyze streaming data. DataHub can also distribute streaming data to Alibaba Cloud services such as MaxCompute and Object Storage Service (OSS).
Notice Strings must be encoded in the UTF-8 format. The size of each string must not exceed 1 MB.

Channel types

The source is connected to the sink by using a single channel. Therefore, the channel type configured for the writer must be the same as that configured for the reader. Generally, channels are categorized into two types: memory and file. In the following configuration, the channel type is set to file:
"agent.sinks.dataXSinkWrapper.channel": "file"

Parameters

Parameter Description Required Default value
accessId The AccessKey ID of the account that you can use to connect to DataHub. Yes N/A
accessKey The AccessKey secret of the account that you can use to connect to DataHub. Yes N/A
endpoint The endpoint of DataHub. Yes N/A
maxRetryCount The maximum number of retries if the sync node fails. No N/A
mode The mode for writing strings. Yes N/A
parseContent The data to be parsed. Yes N/A
project The organizational unit in DataHub, that is, project. Each project contains one or more topics.
Note DataHub projects are independent from MaxCompute projects. You cannot use MaxCompute projects as DataHub projects.
Yes N/A
topic The minimum unit for data subscription and publishing. You can use topics to distinguish different types of streaming data. Yes N/A
maxCommitSize The amount of data that DataHub Writer buffers before sending it to the sink. This feature aims to improve writing efficiency. The default value is 1048576, in bytes, that is, 1 MB. No 1MB
batchSize The number of data records that DataHub Writer buffers before sending them to the sink. This feature aims to improve writing efficiency. No The default value is 1024.
maxCommitInterval The maximum interval at which DataHub Writer sends data to the sink.

When an interval ends, DataHub Writer sends buffered data even if the data amount does not reach the preceding two thresholds.

No The default value is 30000, in milliseconds, that is, 30 seconds.
parseMode The mode for parsing log entries. Valid values: default and csv. A value of default indicates that no log parsing is required. A value of csv indicates that a delimiter is inserted between fields for each log entry. No default

Configure DataHub Writer by using the codeless UI

The codeless user interface (UI) is not supported for DataHub Writer.

Configure DataHub Writer by using the code editor

The following example shows how to configure a sync node to write data to a DataHub project. For more information, see Create a sync node by using the code editor.
{
    "type": "job",
    "version": "2.0",// The version number.
    "steps": [
        { 
            "stepType": "stream",
            "parameter": {},
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "datahub",// The writer type.
            "parameter": {
                "datasource": "",// The connection name.
                "topic": "",// The minimum unit for data subscription and publishing. You can use topics to distinguish different types of streaming data.
                "maxRetryCount":500,// The maximum number of retries if a task fails.
                "maxCommitSize": 1048576// The amount of data that DataHub Writer buffers before sending it to the sink.
            },
            "name": "Writer",
            "category": "writer"
        }
    ],
    "setting": {
        "errorLimit": {
            "record": ""// The maximum number of dirty data records allowed.
        },
        "speed": {
            "concurrent": 20,// The maximum number of concurrent threads.
            "throttle": false // Specifies whether to enable bandwidth throttling. A value of false indicates that the bandwidth is not throttled. A value of true indicates that the bandwidth is throttled. The maximum transmission rate takes effect only if you set this parameter to true.
        }
    },
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    }
}