This topic describes the methods that are used to collect and send data from a variety of data sources to an Alibaba Cloud Elasticsearch cluster.

Background information

Elasticsearch is widely used for data search and analytics. Developers and communities use Elasticsearch in a variety of scenarios. The scenarios include application search, website information search, logging, infrastructure monitoring, application performance monitoring (APM), and security analytics. Solutions for these scenarios are provided free of charge. However, to use these solutions, developers must import the required data into an Alibaba Cloud Elasticsearch cluster.

This topic describes the most common methods to collect data.

Elasticsearch provides a flexible RESTful API for communicating with client applications. You can call the RESTful API to collect, search, and analyze data. You can also use the API to manage Alibaba Cloud Elasticsearch clusters and indexes on the clusters.

Elastic Beats

Elastic Beats consists of a set of lightweight data shippers that can be used to transfer data to Elasticsearch. These shippers do not incur a number of runtime overheads. Beats can be used to run and collect data on devices that do not have sufficient hardware resources. The devices include IoT devices, edge devices, or embedded devices. If you want to collect data but do not have sufficient resources to run a resource-intensive data shipper, we recommend that you use Beats. Based on data collected by Beats from all Internet-connected devices, you can quickly identify exceptions, such as system errors and security issues. Then, you can take measures to deal with these exceptions.

Beats can also be used in systems that have sufficient hardware resources.

You can use Beats to collect various types of data.
  • Filebeat

    Filebeat can be used to read, preprocess, and transfer data from files. In most cases, you can use Filebeat to read data from log files. Filebeat can also be used to read data from non-binary files. You can use Filebeat to read data from other data sources, such as TCP, UDP, container, Redis, and syslog. Leveraging on various modules, Filebeat provides an easy way to collect logs of common applications, such as Apache, MySQL, and Kafka. Then, Filebeat parses the logs to obtain the required data.

  • Metricbeat

    Metricbeat can be used to collect and preprocess system and service metrics. System metrics indicate information about running processes, CPUs, memory, disks, and network usage. Leveraging on various modules, Metricbeat can be used to collect data from various services, such as Kafka, Palo Alto Networks, and Redis.

  • Packetbeat

    Packetbeat can be used to collect and preprocess real-time network data. You can use Packetbeat for security analytics, application monitoring, and performance analytics. Packetbeat supports the following protocols: DHCP, DNS, HTTP, MongoDB, NFS, and TLS.

  • Winlogbeat

    Winlogbeat can be used to capture event logs from Windows operating systems. The event logs include application events, hardware events, and security and system events.

  • Auditbeat

    Auditbeat can be used to detect changes to critical files and collect audit events from the Linux audit framework. In most cases, Auditbeat is used for security analytics.

  • Heartbeat

    Heartbeat can be used to check the availability of your system and services by probing. Heartbeat applies to many scenarios, such as infrastructure monitoring and security analytics. Heartbeat supports ICMP, TCP, and HTTP.

  • Functionbeat

    Functionbeat can be used to collect logs and metrics from serverless environments such as AWS Lambda.

For more information about how to use Metricbeat, see Use Beats together with Alibaba Cloud Elasticsearch to create a management dashboard. Use other shippers in a similar way.

Logstash

Logstash is a powerful and flexible tool that is used to read, process, and transfer all types of data. Logstash provides a variety of features and has high requirements for device performance. Beats does not support some features provided by Logstash, or it is costly to use Beats for some features. For example, it is costly to use Beats to enrich documents by searching for data in external data sources. Logstash has higher requirements for hardware resources than Beats. Therefore, Logstash cannot be deployed on devices whose hardware resources cannot meet the minimum requirements. If Beats is not qualified for specific scenarios, use Logstash instead.

In most cases, Beats and Logstash work collaboratively. Specifically, use Beats to collect data and Logstash to process data.

Alibaba Cloud Elasticsearch integrates the Logstash service. Alibaba Cloud Logstash is a server-side data processing pipeline. It is compatible with all the capabilities of open-source Logstash. Alibaba Cloud Logstash can be used to dynamically collect data from multiple data sources and store the data to a specified location. Alibaba Cloud Logstash can be used to process and transform all types of events by using input, filter, and output plug-ins.

Logstash data processing pipelines are used to run tasks. Each pipeline consists of at least one input, filter, and output plug-ins.
  • Input plug-ins

    Input plug-ins can be used to read data from different data sources. The supported sources include files, HTTP, IMAP, JDBC, Kafka, syslog, TCP, and UDP.

  • Filter plug-ins

    Filter plug-ins can be used to process and enrich data in various ways. In most cases, filter plug-ins first parse unstructured log data and transform the data into structured data. Logstash provides the Grok filter plug-in to parse regular expressions, CSV data, JSON data, key-value pairs, delimited unstructured data, and complex unstructured data. Logstash also provides various filter plug-ins to enrich data. The plug-ins are used to query DNS records, add locations of IP addresses, or search custom directories or Elasticsearch indexes. Additional filter plug-ins, such as mutate filter plug-in, allow you to perform diverse data transformations. The data transformations allow you to rename, delete, and copy the data fields and values.

  • Output plug-ins

    Output plug-ins can be used to write the parsed and enriched data into data sinks. These plug-ins are used in the final stage of the Logstash data processing pipeline. Multiple types of output plug-ins are available. However, this topic focuses on the Elasticsearch output plug-in. This plug-in can be used to collect and send data from a variety of data sources to an Alibaba Cloud Elasticsearch cluster.

The following section describes a sample Logstash pipeline. It can be used to complete the following operations:
  • Read the Elastic Blogs RSS feed.
  • Preprocess the data by copying or renaming fields and removing special characters and HTML tags.
  • Collect and send documents from a variety of data sources to Elasticsearch.
  1. Configure an Alibaba Cloud Logstash pipeline.
    input { 
      rss { 
        url => "/blog/feed" 
        interval => 120 
      } 
    } 
    filter { 
      mutate { 
        rename => [ "message", "blog_html" ] 
        copy => { "blog_html" => "blog_text" } 
        copy => { "published" => "@timestamp" } 
      } 
      mutate { 
        gsub => [  
          "blog_text", "<. *? >", "",
          "blog_text", "[\n\t]", " " 
        ] 
        remove_field => [ "published", "author" ] 
      } 
    } 
    output { 
      stdout { 
        codec => dots 
      } 
      elasticsearch { 
        hosts => [ "https://<your-elasticsearch-url>" ] 
        index => "elastic_blog" 
        user => "elastic" 
        password => "<your-elasticsearch-password>" 
      } 
    }

    Set hosts to a value in the format of <Internal endpoint of the Alibaba Cloud Elasticsearch cluster>:9200. Set password to the password that is used to access the Alibaba Cloud Elasticsearch cluster.

  2. In the Kibana console, view the migrated index data.
    POST elastic_blog/_search

Clients

You can integrate data collection code into tailored application code by using the clients provided by Elasticsearch. These clients are libraries that abstract low-level details of the data collection. They allow you to focus on specific operations that are related to your application. Elasticsearch supports multiple programming languages for clients, such as Java, JavaScript, Go, .NET, PHP, Perl, Python, and Ruby. For more information about the programming languages and the details and sample code of your selected language, see Elasticsearch Clients.

If the programming language of your application is not included in the preceding supported languages, obtain the required information from Community Contributed Clients.

Kibana

We recommend that you use the Kibana console to develop and debug Elasticsearch requests. Kibana provides all features of the RESTful API in Elasticsearch and abstracts the technical details of underlying HTTP requests. You can use Kibana to add original JSON documents to an Alibaba Cloud Elasticsearch cluster.
PUT my_first_index/_doc/1 
{ 
    "title" :"How to Ingest Into Elasticsearch Service",
    "date" :"2019-08-15T14:12:12",
    "description" :"This is an overview article about the various ways to ingest into Elasticsearch Service" 
}
Note In addition to Kibana, you can use other tools to communicate with the Alibaba Cloud Elasticsearch cluster. This allows you to collect documents by calling the RESTful API. For example, you can use cURL to develop and debug Elasticsearch requests or integrate tailored scripts.

Summary

Multiple methods are provided to collect and send data from a variety of data sources to an Alibaba Cloud Elasticsearch cluster. You must select the most suitable data collection method based on your business scenarios, requirements, and systems.
  • Beats shippers are convenient, lightweight, and out-of-the-box. They can be used to collect data from various data sources. Modules that are packaged with Beats provide the configurations for data acquisition, parsing, indexing, and visualization for many common databases, operating systems, containers, web servers, and caches. These modules allow you to create a dashboard for your data within five minutes. Beats shippers are most suited for resource-constrained embedded devices, such as IoT devices or firewalls.
  • Logstash is a flexible tool to read, transform, and transfer data. It provides various input, filter, and output plug-ins. If Beats cannot meet the requirements for certain scenarios, you can use Beats to collect data, use Logstash to process data, and then transfer the processed data to an Alibaba Cloud Elasticsearch cluster.
  • To collect data from applications, we recommend that you use clients that are supported by open-source Elasticsearch.
  • To develop or debug Elasticsearch requests, we recommend that you use Kibana.

References