The logstash-output-datahub plug-in allows you to transfer data to DataHub. This topic describes how to use the logstash-output-datahub plug-in.

Prerequisites

  • The logstash-output-datahub plug-in is installed.

    For more information, see Install a Logstash plug-in.

  • DataHub is activated, a project is created, and a topic is created for the project.

    For more information, see Get started with DataHub.

  • The data source from which you want to read data is prepared.

    Available data sources are determined by the input plug-ins supported by Logstash. For more information, see Input plugins. In this example, an Alibaba Cloud Elasticsearch cluster is used.

Use logstash-output-datahub

Create a pipeline by following the instructions provided in Use configuration files to manage pipelines. When you create the pipeline, configure the pipeline parameters based on the descriptions that are described in the table of the Parameters section. After you configure the parameters, save the settings and deploy the pipeline. This way, Logstash can be triggered to read data from the data source and transfer the data to DataHub.

The following code provides a pipeline configuration example. For more information about the parameters, see Parameters.

input {
    elasticsearch {
      hosts => ["http://es-cn-mp91cbxsm000c****.elasticsearch.aliyuncs.com:9200"]
      user => "elastic"
      index => "test"
      password => "your_password"
      docinfo => true
  }
}
filter{
    
}
output {
    datahub {
        access_id => "Your accessId"
        access_key => "Your accessKey"
        endpoint => "Endpoint"
        project_name => "project"
        topic_name => "topic"
        #shard_id => "0"
        #shard_keys => ["thread_id"]
        dirty_data_continue => true
        dirty_data_file => "/ssd/1/<Logstash cluster ID>/logstash/data/File name"
        dirty_data_file_max_size => 1000
    }
}
Important By default, Alibaba Cloud Logstash supports data transmission only over the same virtual private cloud (VPC). If source data is on the Internet, configure a Network Address Translation (NAT) gateway for your Logstash cluster to enable the cluster to access the Internet. For more information, see Configure a NAT gateway for data transmission over the Internet.

Parameters

The following table describes the parameters supported by logstash-output-datahub.
ParameterTypeRequiredDescription
endpointstringYesThe endpoint that is used to access DataHub. For more information, see Endpoints.
access_idstringYesThe AccessKey ID of your Alibaba Cloud account.
access_keystringYesThe AccessKey secret of your Alibaba Cloud account.
project_namestringYesThe name of the DataHub project.
topic_namestringYesThe name of the DataHub topic.
retry_timesnumberNoThe number of retries allowed. The value -1 indicates no limits. The value 0 indicates that retries are not allowed. A value greater than 0 indicates that the specified number of retries are allowed. Default value: -1.
retry_intervalnumberNoThe interval for retries. Unit: seconds. Default value: 5.
skip_after_retrybooleanNoSpecifies whether to skip the upload of data in the current batch if the number of retries caused by a DataHub exception exceeds the value of retry_times. Default value: false.
approximate_request_bytesnumberNoThe approximate number of bytes that can be sent in each request. The default value is 2048576, which indicates 2 MB. This parameter is used to prevent a request from being rejected if the request body is excessively large.
shard_keysarrayNoThe names of fields. The plug-in uses the values of the fields to calculate hashes for writing data to a specific shard.
Note If shard_keys and shard_ids are not specified, the plug-in polls the shards to determine the shard to which the plug-in writes data.
shard_idsarrayNoThe IDs of shards to which the plug-in writes data.
Note If shard_keys and shard_ids are not specified, the plug-in polls the shards to determine the shard to which the plug-in writes data.
dirty_data_continuestringNoSpecifies whether to skip dirty data during data processing. Default value: false. If you set the value to true, you must specify dirty_data_file. The value true indicates that dirty data is skipped during data processing.
dirty_data_filestringNoThe name of the dirty data file. You must configure this parameter if you set the dirty_data_continue parameter to true.
Note During data processing, the dirty data file is divided into part 1 and part 2. Raw dirty data is stored in part 1. Updated dirty data is stored in part 2.
dirty_data_file_max_sizenumberNoThe maximum size of the dirty data file. Unit: KB.
enable_pbbooleanNoSpecifies whether to enable Protocol Buffers (Protobuf) for data transfer. Default value: true. If Protobuf is not supported for data transfer, set the value to false.