Use the logstash-output-datahub plug-in - Elasticsearch - Alibaba Cloud Documentation Center

The logstash-output-datahub plug-in allows you to transfer data to DataHub. This topic describes how to use the logstash-output-datahub plug-in.

Prerequisites

The logstash-output-datahub plug-in is installed.
For more information, see Install a Logstash plug-in.
DataHub is activated, a project is created, and a topic is created for the project.
For more information, see Get started with DataHub.
The data source from which you want to read data is prepared.
Available data sources are determined by the input plug-ins supported by Logstash. For more information, see Input plugins. In this example, an Alibaba Cloud Elasticsearch cluster is used.

Use logstash-output-datahub

Create a pipeline by following the instructions provided in Use configuration files to manage pipelines. When you create the pipeline, configure the pipeline parameters based on the descriptions that are described in the table of the Parameters section. After you configure the parameters, save the settings and deploy the pipeline. This way, Logstash can be triggered to read data from the data source and transfer the data to DataHub.

The following code provides a pipeline configuration example. For more information about the parameters, see Parameters.

input {
    elasticsearch {
      hosts => ["http://es-cn-mp91cbxsm000c****.elasticsearch.aliyuncs.com:9200"]
      user => "elastic"
      index => "test"
      password => "your_password"
      docinfo => true
  }
}
filter{
    
}
output {
    datahub {
        access_id => "Your accessId"
        access_key => "Your accessKey"
        endpoint => "Endpoint"
        project_name => "project"
        topic_name => "topic"
        #shard_id => "0"
        #shard_keys => ["thread_id"]
        dirty_data_continue => true
        dirty_data_file => "/ssd/1/<Logstash cluster ID>/logstash/data/File name"
        dirty_data_file_max_size => 1000
    }
}

Important By default, Alibaba Cloud Logstash supports data transmission only over the same virtual private cloud (VPC). If source data is on the Internet, configure a Network Address Translation (NAT) gateway for your Logstash cluster to enable the cluster to access the Internet. For more information, see Configure a NAT gateway for data transmission over the Internet.

Parameters

The following table describes the parameters supported by logstash-output-datahub.


Parameter	Type	Required	Description
endpoint	string	Yes	The endpoint that is used to access DataHub. For more information, see Endpoints.
access_id	string	Yes	The AccessKey ID of your Alibaba Cloud account.
access_key	string	Yes	The AccessKey secret of your Alibaba Cloud account.
project_name	string	Yes	The name of the DataHub project.
topic_name	string	Yes	The name of the DataHub topic.
retry_times	number	No	The number of retries allowed. The value -1 indicates no limits. The value 0 indicates that retries are not allowed. A value greater than 0 indicates that the specified number of retries are allowed. Default value: -1.
retry_interval	number	No	The interval for retries. Unit: seconds. Default value: 5.
skip_after_retry	boolean	No	Specifies whether to skip the upload of data in the current batch if the number of retries caused by a DataHub exception exceeds the value of retry_times. Default value: false.
approximate_request_bytes	number	No	The approximate number of bytes that can be sent in each request. The default value is 2048576, which indicates 2 MB. This parameter is used to prevent a request from being rejected if the request body is excessively large.
shard_keys	array	No	The names of fields. The plug-in uses the values of the fields to calculate hashes for writing data to a specific shard. Note If shard_keys and shard_ids are not specified, the plug-in polls the shards to determine the shard to which the plug-in writes data.
shard_ids	array	No	The IDs of shards to which the plug-in writes data. Note If shard_keys and shard_ids are not specified, the plug-in polls the shards to determine the shard to which the plug-in writes data.
dirty_data_continue	string	No	Specifies whether to skip dirty data during data processing. Default value: false. If you set the value to true, you must specify dirty_data_file. The value true indicates that dirty data is skipped during data processing.
dirty_data_file	string	No	The name of the dirty data file. You must configure this parameter if you set the dirty_data_continue parameter to true. Note During data processing, the dirty data file is divided into part 1 and part 2. Raw dirty data is stored in part 1. Updated dirty data is stored in part 2.
dirty_data_file_max_size	number	No	The maximum size of the dirty data file. Unit: KB.
enable_pb	boolean	No	Specifies whether to enable Protocol Buffers (Protobuf) for data transfer. Default value: true. If Protobuf is not supported for data transfer, set the value to false.