Some Important Concepts in Elasticsearch Include Clusters, Nodes, Indexes, Documents, Shards, and Replicas

When we start using Elasticsearch , we must understand some important concepts. Understanding these concepts is very important for us to use the Elastic stack in the future. In today's article, let's first introduce some of the most important concepts in the Elastic stack.
First, let's take a look at the following picture:

Cluster

Cluster also means cluster. An Elasticsearch cluster consists of one or more nodes, which can be identified by their cluster name. Usually the name of the Cluster can be set in the configuration file in Elasticsearch. By default, if our Elasticsearch is already running, it will automatically generate a cluster called "elasticsearch". We can customize the name of our cluster in config/elasticsearch.yml:
We can pass:
GET _cluster/state

to get the status of the entire cluster. This state can only be

changed by the master node. The result returned by the above interface is:

{
"cluster_name": "elasticsearch",
"compressed_size_in_bytes": 1920,
"version": 10,
"state_uuid": "rPiUZXbURICvkPl8GxQXUA",
"master_node": "O4cNlHDuTyWdDhq7vhJE7g",
"blocks": {},
"nodes": {...},

"metadata": {...},
"routing_table": {...},
"routing_nodes": {...},
"snapshots": {...},
"restore": {...},
"snapshot_deletions": {...}
}
Node
A single Elasticsearch instance. In most environments, each node runs on a separate box or virtual machine. A cluster consists of one or more nodes. In the test environment, I can run multiple nodes on a server. In actual deployment, in most cases it is still necessary to run a node on a server.

According to the role of node, it can be divided into the following types:

master-eligible: can be used as the master node. Once it becomes the master node, it can manage the settings and changes of the entire cluster: create, update, delete indexes; add or delete nodes; assign shards to nodes

data: data node

ingest: data access (such as pipeline )

machine learning ( Gold/Platinum License )

In general, a node can have one or more of the above functions. We can define it on the command line or in the Elasticsearch configuration file (Elasticsearch.yml):
Node type configuration parameter default value
master-eligible node.master true

data node. data true
ingest node. ingest true
machine learning node.ml true (except OSS releases)
You can also make a node do exclusive functions and roles. If the above node configuration parameters do not have any configuration, then we can consider this node as a coordination node. In this case, it can accept external requests and forward them to the corresponding nodes for processing. For the master node, sometimes we need to set cluster. remote.connect: false.

In actual use, we can send the request to the data node, but not to the master node.

We can define the role of a node in the cluster by configuring in the config/elasticsearch.yml file:

In some cases, we can set node.voting_only to true so that a node can only be used as a voting function when node. master is true, rather than being elected as a master node. This is to avoid split-brain conditions. It can usually be served by a node with lower CPU performance.

In a cluster, we can use one of the following commands to get all master-eligible nodes that can currently vote:
GET /_cluster/state?filter_path=metadata.cluster_coordination.last_committed_config
You might get results like the following list:
{
"metadata" : {
"cluster_coordination" : {
"last_committed_config" : [
"Xe6KFUYCTA6AWRpbw84qaQ",
"OvD79L1lQme1hi06Ouiu7Q",
"e6KF9L1lQUYbw84CTAemQl"
]
}
}
}

In the entire Elastic architecture, the relationship between Data Node and Cluster is expressed as follows:

The definitions above were valid prior to the Elastic Stack 7.9 release. After Elastic Stack 7.9, there are new improvements. Please read the article " Elasticsearch: Introduction to Node roles - Post 7.9 ".
Document
Elasticsearch is document-oriented, which means that the smallest unit of data you index or search is a document. Documents have some important properties in Elasticsearch:

It is independent. A document contains fields (names) and their values.
It can be layered. Think of it as a document within a document. The value of the field can be as simple as the value of the location field can be a string. It can also contain other fields and values. For example, a location field might contain a city and street address.
Flexible structure. Your documents do not depend on a predefined schema. For example, not all events require a description value, so this field can be omitted entirely. But it may require new fields such as the latitude and longitude of the location.

Documents are usually JSON representations of data. JSON over HTTP is the most widely used way to communicate with Elasticsearch, and it's the method we use in this book. For example, an event on your meetup website could be represented in the following document:

{
"name": "Elasticsearch Denver",
"organizer": "Lee",
"location": "Denver, Colorado, USA"
}

Many people think that document is compared to a relational database, it corresponds to each record in it.
Type
Types are logical containers for documents, just as tables are containers for rows. You put documents with different structures (schemas) in different types. For example, you can use one type to define aggregation groups and another type for events when people aggregate.
Each type of field definition is called a map. For example, the name will be mapped to a string, but the geolocation field under location will be mapped to the special geo_point type. Each field is handled differently. For example, your search for a word in the name field, then search groups by location to find groups located near where you live.
Many people think that Elasticsearch is schema-less. Everyone even thinks that databases in Elasticsearch don't need mapping. Actually this is a wrong concept. The correct understanding of schema-less in Elasticsearch is that we do not need to define a table in a relational database in advance to use the database. In Elasticsearch, we can not start defining a mapping, but directly write to the index we specify. The mapping of this index is dynamically generated (of course we can also prohibit this behavior). Each data type of the data item therein is dynamically identified. Such as time, strings, etc., although some data types still need to be adjusted manually, such as geo_point and other geographic location data. In addition, it also has a meaning, the same type, we may add new data items in future data input, thereby producing new mapping. This is also dynamically adjusted.
Elasticsearch is schema-less, which means that documents can be indexed without explicitly specifying how to handle each of the different fields that may appear in the document. When dynamic mapping is enabled, Elasticsearch automatically detects and adds new fields to the index. This default behavior makes indexing and browsing data easy - just start indexing documents and Elasticsearch will detect and map boolean, float and integer values, dates and strings to the appropriate Elasticsearch data type.
For some reason, after Elasticsearch 6.0, an Index can only contain one type. The reason for this is: fields with the same name in different mapping types of the same index are the same; in an Elasticsearch index, fields with the same name in different mapping types are supported by the same field in Lucene. By default it is _doc. In a future version of 8.0, type will be completely removed.
Index
In Elasticsearch, an index is a collection of documents.

Elasticsearch originated from Apache Lucene. An Elasticsearch index is distributed among one or more shards, and each shard corresponds to an Apache Lucene index. Each Index consists of one or many documents, and these documents can be distributed among different shards.

Many people think that an index is similar to a database in a relational database. There's some truth to this statement, but it's not quite the same. One of the important reasons is that documents in Elasticsearch can have an object and nested structures. An index is a logical namespace that maps to one or more primary shards and can have zero or more replica shards.

Whenever a document comes in, the hash calculation will be automatically performed according to the id of the document and stored in the calculated shard instance. This result can make all shards have more balanced storage, and some shards will not be very busy.

shard_num = hash(_routing) % num_primary_shards

By default, the _routing above is both the _id of the document. If routing is involved, these documents may only be stored in a specific shard. The advantage of this is that in some cases, we can quickly synthesize the results we need without needing to get requests across nodes. For example, the data type for joining.

From the above formula, we can also see that our shard number cannot be dynamically modified, otherwise, the corresponding shard number will not be found in the future. It must be pointed out that the number of replicas can be modified dynamically.
Shard
Since Elasticsearch is a distributed search engine, indexes are often split into elements called shards that are distributed across multiple nodes. Elasticsearch automatically manages the arrangement of these shards. It also rebalances shards as needed, so users don't need to worry about the details.

An index can store large amounts of data beyond the hardware limitations of a single node. For example, an index with 1 billion documents occupies 1TB of disk space, and neither node has such a large disk space; or a single node handles search requests and responds too slowly.

To solve this problem, Elasticsearch provides the ability to divide the index into multiple parts, which are called shards. When you create an index, you can specify the number of shards you want. Each shard is itself a fully functional and independent "index" that can be placed on any node in the cluster.
Sharding is important for two main reasons:

Allows you to horizontally split/expand your content capacity
Allows you to do distributed, parallel operations on top of shards (potentially, on multiple nodes), improving performance/throughput
There are two types of shards: primary shards and replica shards.

Primary shard: Each document is stored in a Primary shard. When indexing a document, it first indexes on the primary shard, then on all replicas of this shard (replica). An index can contain one or more primary shards. This number determines the scalability of the index relative to the size of the index data. After the index is created, the number of primary shards in the index cannot be changed.
Replica shard: Each primary shard can have zero or more replicas. A replica is a replica of the primary shard and serves two purposes:
- Added failover: In the case of primary failure, replica shards can be promoted to primary shards. Even if you lose a node, the replica shard still owns all the data

Improved performance: get and search requests can be handled by the primary or replica shards.

By default, each primary shard has one replica, but the number of replicas can be changed dynamically on an existing index. We can dynamically modify the number of replicas in the following ways:
PUT my_index/_settings
{
"number_of_replicas": 2
}

A replica shard is never started on the same node as its primary shard. In the latest Elasticsearch cluster design, we can use the auto_expand_replica configuration to let Elasticsearch automatically decide how many replicas to have. When we have one node, with this setup, we may get 0 replicas to keep the whole cluster healthy.


Usually, a shard can store many documents. In actual use, increasing the number of replica shards can improve the search speed, because more shards can help us to search at the same time. However, the increase in the number of replica shards will also affect the speed of data writing. In many cases, we can even set the number of replicas to 0 when writing large batches of data. For detailed reading, please refer to the article " Elasticsearch: How to improve the speed of data ingestion in Elasticsearch ". Increasing the number of primary shards can increase the speed of data writing because more shards can help us write data at the same time. There may be many developers who think that the more the number of shards, the better? Oversharing is a common problem that Elasticsearch users encounter. Many small shards consume a lot of resources because each shard corresponds to a Lucene index. A shard can usually store dozens of gigabytes of data. If you need more shards, you can:

Create more indexes to make it easy to scale, for example, for some time-series data, we can create new indexes for them every day or every week
Use the Split API to increase the number of shards for a large index. We can read the article " Elasticsearch: Split index API - splitting a large index into more shards ".
A shard's performance changes as it grows:

As shown above, we recommend 50G as the index size for best performance. In our actual use of Beats, the default ILM index size is 50G.

The figure below shows an index with 5 shards and 1 replica

These shards are distributed on different physical machines:

We can set the corresponding shard value for each index:
curl -XPUT http://localhost:9200/another_user?pretty -H 'Content-Type: application/JSON -d '
{
"settings" : {
"index.number_of_shards" : 2,
"index.number_of_replicas" : 1
}
}'

For example, in the above REST interface, we set 2 shards for the index of another_user, and there is a replica. Once the number of primary shards is set, we cannot modify it. This is because Elasticsearch will assign the corresponding document to the corresponding shard according to the id of each document and the number of primary shards. If this number is modified later, the corresponding shard may not be found every time you search.

We can view the settings in our index through the following interface:
curl -XGET http://localhost:9200/twitter/_settings?pretty
Above we can get the setting information of the twitter index:
{
"twitter" : {
"settings" : {
"index" : {
"creation_date" : "1565618906830",
"number_of_shards" : "1",
"number_of_replicas" : "1",
"uuid" : "rwgT8ppWR3aiXKsMHaSx-w",
"version" : {
"created" : "7030099"
},
"provided_name" : "twitter"
}
}
}
}

Replica
By default, Elasticsearch creates one primary shard and one replica for each index. This means that each index will contain one primary shard and each shard will have one replica.

Allocating multiple shards and replicas is the essence of the design of the distributed search function, providing high availability and fast access to documents in the index. The main difference between primary and replica shards is that only primary shards can accept indexing requests. Both replica and primary shards can serve query requests.

In the diagram above, we have an Elasticsearch cluster consisting of two nodes in a default sharding configuration. Elasticsearch automatically arranges a primary shard split across two nodes. There is a replica shard corresponding to each primary shard, but the arrangement of these replica shards is completely different from the arrangement of the primary shards.

Allow us to clarify: remember that the number_of_shards value is relative to the index, not the entire cluster. This value specifies the number of shards per index (not the total number of primary shards in the cluster).

We can get the health of an index through the following interface:
http://localhost:9200/_cat/indices/twitter

The above interface can return the following information:

If an index is displayed in red, it means that at least one primary shard of this index has not been allocated correctly, and some shards and their corresponding replicas cannot be accessed normally. If it is green, it means that each shard of the index has a copy (replica), and its copy is successfully replicated in the corresponding replica shard. If one of the nodes is broken, the corresponding replica of the other node will work, so that the data will not be lost.

We can check the health status of the cluster with the following command:
GET _cluster/health

Shard Health

Red: At least one primary shard is not allocated in the cluster
Yellow: All primary replicas are assigned, but at least one replica is not assigned
Green: all shards allocated
Next step: If you want to learn more about Lucene's data storage in shards, read " Elasticsearch: inverted index, doc_values and source ".

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00