Mastering Interview Questions: A Comprehensive Guide to Elasticsearch

By Yijia

Elasticsearch can achieve sub-second searches. A distributed deployment like the Elasticsearch cluster can easily scale, making it capable of handling petabytes of database capacity. Its search results are sorted by score to provide us with the most relevant search results.

1. Overview

Characteristics

Easy installation: No dependencies are required for downloading and installing Elasticsearch. A cluster can be easily set up by making a few parameter modifications.
JSON: The I/O format is JSON, eliminating the need to define schemas for higher efficiency and convenience.
RESTful: Almost all operations (indexing, querying, and even configuration) can be done through the HTTP interface.
Distributed: Nodes have equivalent external behaviors, that is, each node can be used as an entry. After adding nodes, the load is automatically balanced.
Multi-tenant: Different indexes can be used for different purposes, allowing for operations on multiple indexes simultaneously.
Support for ultra-large data: Support near real-time processing of massive amounts of structured and unstructured data that can be scaled to petabytes.

Features

Distributed search engine

Distributed architecture: Elasticsearch automatically distributes large amounts of data to multiple servers.

Full-text search

Provides highly automated query methods such as fuzzy search, and features such as relevance ranking and highlighting.

Data analysis engine (grouping and aggregation)

In a community website, data such as user logins in the last week and each function usage in the last month can be analyzed.

Near real-time second-level processing of massive amounts of data

Due to its distributed architecture, a large number of servers can be utilized for storing and retrieving data.

Scenarios

Search scenarios

Examples include personnel retrieval, equipment retrieval, in-app search, and order search.

Log analysis scenarios

The classic combination of ELK (Elasticsearch/Logstash/Kibana) can achieve log collection, log storage, and log analysis.

Data alert platform and data analysis scenarios

For instance, the community group purchase prompt can automatically notify users of a purchase when the offer price falls below a certain value.

It can also analyze a competitor's sales Top 10 for operational analysis.

Business Intelligence (BI) system

For example, in a community setting, it is necessary to analyze user consumption amounts and commodity categories in a certain area, output the corresponding report data, and predict the top-selling commodities based on regional and population characteristics. Elasticsearch handles the data analysis and mining, while Kibana provides data visualization.

Competitiveness Analysis

Lucene

As an information search toolkit written in Java (JAR package), Lucene is just a framework, and skilled use of Lucene is complex.

Solr

Lucene-based HTTP interface query server.A search engine system encapsulating a lot of Lucene details.

Elasticsearch

Near real-time search engine based on Lucene distributed massive data. The strategy used is to index each field so that it can be searched.

Comparison

(1) Solr uses Zookeeper for distributed management, while Elasticsearch itself has distributed coordination management capabilities.

(2) Solr is more comprehensive than Elasticsearch implementation, while Elasticsearch focuses more on core features, and advanced features are mostly provided by third-party plug-ins.

(3) Solr performs better than Elasticsearch in traditional search applications, while Elasticsearch performs better than Solr in real-time search applications.

At present, the mainstream is still Elasticsearch 7.x and the latest is 7.8.

Optimizations: Integrate JDK by default, upgrade Lucene8 to significantly improve TopK performance, and introduce circuit breakers to avoid OOM.

2. Basic Concepts

IK Analyzer

IK analyzer is an open-source lightweight Chinese word segmentation toolkit developed based on the Java language. The new IK analyzer 3.0 is developed into a common word segmentation component for Java, which is independent of the Lucene project and provides a default optimized implementation of Lucene.

IK analyzer 3.0 has the following features:

The unique "forward-iteration most fine-grained segmentation algorithm", with a high-speed processing capacity of 600,000 words per second.
Multi-subprocessor analysis mode is used to analyze letters (IP addresses, emails, and URLs), numbers (dates, common Chinese quantifiers, Roman numerals, and scientific notation), and Chinese words (names and place names).
Dictionary storage with optimized individual entries is supported. In this way, the memory usage is reduced.
The query analyzer IKQueryParser for Lucene full-text search optimization. It uses the ambiguity analysis algorithm to optimize the search for query keywords.
Permutation and combination can greatly improve the hit rate of Lucene retrieval.

Extended dictionary: ext_dict
Stopped dictionary: stop_dict
Synonym dictionary: same_dict

Index (Database-like)

settings: specify the index library and define things such as the number of shards and the number of replicas of the index library.

Mapping (Table-like design)

The data type of the field
The type of the analyzer
Whether to store or create indexes

Document (Data)

Full update: Put
Partial update: Post

3. Advanced Features

Advanced Mapping

Geographic Coordinate Point Data Type

A geographic coordinate point refers to a point on the Earth's surface that can be described using latitude and longitude. Geographic coordinate points are used for calculating the distance between two coordinates and determining if a coordinate is within a specific area. To create a geographic coordinate point, you need to explicitly declare the field type as geo_point.

Dynamic Mapping

Dynamic mapping is used to determine the data type of a field and automatically add new fields to the type mapping.

DSL Advanced

Match-all query
Full-text query
- Match query
- Match phrase query
- Query string
- Multi-match query
Term-level query
- Exact search for terms
- Collection search idx
- Range search
- Prefix search
- Wildcard search
- Regex search regexp
- Fuzzy search
Compound search
Sort & size & highlight & bulk

Aggregation Analysis

Aggregation analysis is an important feature in the database, which completes the aggregation calculation of data in a queried dataset, such as finding the maximum and minimum values or calculating the sum and average values of a field (or the results of a calculation expression).

The aggregation of metrics such as the maximum, minimum, sum, and average for a dataset is called metric aggregation in Elasticsearch.
Use GROUP BY to put the queried data in different buckets and then perform bucket metric aggregation.

Intelligent Search

Term Suggester
Phrase Suggester
Completion Suggester
Context Suggester

If the Completion Suggester has reached a zero match, you can guess that the user has an input error, and you can try the Phrase Suggester at this time. If there is still no match, try Term Suggester.

In terms of precision, Completion > Phrase > Term**, while in terms of recall, the opposite is true.

In terms of performance, the Completion Suggester is the fastest. It is ideal to use only the Completion Suggester for prefix matching if it can meet business requirements. Due to their search for inverted indexes, the Phrase and Term have lower performance in comparison. The amount of data used by the Suggester should be controlled as much as possible. The ideal scenario is that after a certain warm-up period, the index can be fully mapped into memory.

4. Best Practices

Write Optimization

Set the number of replicas to 0

When initializing the data for the first time, the number of replicas is set to 0. It is changed back after writing, thus avoiding indexing replicas.

Automatically generate ID

It can avoid the process of judging the existence before writing.

Use the analyzer appropriately

The binary type is not applicable. Use different analyzers for the title and text to speed up.

Disable scoring to extend index refresh interval
Put multiple index operations into a batch for processing.

Read Optimization

Use Filter to replace the query to reduce scoring. Use bool to combine query and filter.
Group data by day, month, and year. Queries can be centralized to part of the index.

Zero-downtime Index Reconstruction Solution

External data import
- Send a specified MQ message through the MQ web console or CLI command line.
- MQ messages are consumed by consumers of the microservice module, triggering the Elasticsearch data re-import feature.
- The microservice module queries the total number of data and paging information from the database and sends it to MQ.
- After the microservice obtains the data from the database according to the paging information in MQ, it assembles the data into the JSON format supported by Elasticsearch according to the definition of the index structure and then sends the data to the Elasticsearch cluster through the bulk command for index reconstruction.
The solution based on scroll + bulk + index alias
- Create a new index book_new and define the mapping and settings according to the new requirements.
- Use the scroll API to query data in batches and specify the scroll query duration.
- Use the bulk API to batch-write the data found by the scroll into a new index.
- Query a batch and import a batch. Note that the scroll_id at the end of the last time is used each time.
- Switch the book_alias to the new index book_new. In this case, the Java client still uses the alias for access and we do not need to fix it.
  Change any code without downtime. Verify that the alias is querying data stored on the new index.
Reindex API solution
- Elasticsearch v6.3.1 already supports the Reindex API, which encapsulates scroll and bulk and can reindex documents without any plug-ins or external tools.

Participation and Flexibility: Self-developed > scroll + bulk > reindex

Stability and reliability: Self-developed < scroll + bulk < reindex

Deep Paging Performance Solution

For example, if a super administrator wants to send an announcement or advertisement to users in a province, the easiest method is to use from + size, but this is unrealistic.

Paging method	Performance	Advantage	Disadvantage	Scenarios
From + size	Low	Good flexibility and simple implementation.	The deep paging problems.	If the data volume is relatively small, it can tolerate the deep paging problems.
scroll	Medium	The deep paging problems are resolved.	It cannot reflect the real-time performance of data (snapshot version). The maintenance cost is high. You need to maintain a scroll_id.	Exporting large amounts of data requires querying data in large amounts of result sets.
search_after	High	The best performance requires no deep paging problems and being able to reflect the real-time change of data.	The implementation of continuous paging is more complicated because each query requires the results of the previous query.	Paging of large amounts of data

Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.

Community