li Meng (ynuosoft), a deep user of Elastic-stack products, and an Elasticsearch Certified Engineer. He contacted Elasticsearch in 2012 and had in-depth experience in Elastic-Stack development, architecture, and O & M, I have practiced a variety of Elasticsearch projects, the most violent big data analysis applications, and the most complex business system applications. In my spare time, I have provided Elastic-stack consulting training and optimization implementation for enterprises.
blue is better than blue.
It has been a long time since I entered the Elastic-Stack technology Stack. In order to avoid the limited vision of lack of knowledge, it is necessary to visit the outside world and enrich my world view. This article analyzes and discusses from the perspective of Elastic competitive products.
• Which application scenarios use Elasticsearch best? • Which application scenarios do not use Elasticsearch best?
This article only represents individual views, but does not represent the views of the community technical camp. It has no intention of verbal disputes and is limited to my limited experience and knowledge, which may be inconsistent with readers' views.
Elasticseach has gradually evolved into an all-round data product since it started as a search engine and now focuses on the field of big data analysis. Among Elasticsearch many excellent functions, there are more and more cross-competitions with many data products. Some features are unique, while some features are only attached. Understanding the characteristics of these products is helpful for better application to business needs.
Lucene is a core library of search, and Elastic is also built on the basis of Lucene. The competitive relationship between them is determined by Lucene itself.
In the Internet 2.0 era, the simplest technical requirement to test internet companies was to see how their search was doing. At that time, everyone did almost the same thing, all of them build a search engine based on Lucene's core library, and the rest depends on the level of developers of various companies. I had the honor to be a vertical search engine based on Lucene before 2012. It is necessary to mention many problems:
• The project is based on Lucene packaging. The business code is built and released together with the core library. The code coupling degree is very high. Every time there is a data field change, it needs to be recompiled, packaged and released. This process is very complicated, and quite dangerous. • To re-publish the program, you need to close the original program, which involves process switching. • Regular full re-generation of index data also involves switching between new and old indexes and real-time index refresh. A complex program mechanism needs to be designed to ensure the needs of each independent business line, A Lucene index process needs to be built separately. When there are more business lines, management is troublesome. When a single Lucene index data exceeds the limit of a single instance, it needs to be distributed, there is no way to use Lucene, so the conventional method is to split into multiple index processes according to a specific classification, and the client queries with a specific classification, the backend routes to specific indexes based on specific categories. • The difficulty of controlling the Lucene library itself is too many factors to consider for the development engineers who have little skills. A little carelessness will lead to a big program problem.
The advantages of Elasticsearch competing with Lucene core library lie in:
• The Lucene core library is perfectly encapsulated and a friendly Restful-API is designed. Developers do not need to pay too much attention to the underlying mechanism and directly use it out of the box. • Sharding and replica mechanisms directly solve the performance and high availability problems in clusters.
Elastic the rapid development in recent years, few projects are found to build search engines based on Lucene in the market. Almost all of them choose Elasticsearch as the basic database service. Due to its open-source features, the vast number of cloud vendors have also customized their development on this basis and deeply integrated with their own cloud platforms, but have not developed a branch independently.
In this competition, Elasticsearch a complete victory.
Solr is the first search engine product with complete functions based on Lucene core library, which was born much earlier than Elasticsearch. In the early stage, Solr had great advantages in the field of full-text search, almost completely overriding Elastic, in the era of big data development in recent years, Elastic has met many big data processing requirements due to its distributed characteristics, especially the popularity of ELK, which almost completely forgot the existence of Solr, although Solr-Coud distributed products have also been launched, they have basically no advantages.
I have contacted several data companies. The full-text search is based on Solr and is a single-node mode. If some problems occur occasionally, I will consult a consultant to solve the problems. It is difficult to find personnel. Later, they will be migrated to the Elasticsearch.
At present, almost all the companies, big and small, are using Elasticsearch in the market. Except for the old systems and some based on Sol r, all the new system projects should be Elasticsearch.
I personally think there are several reasons:
• ES is more friendly and concise than Solr, with lower threshold. • ES has more features than Solr products, including sharding mechanism and data analysis capabilities. • With the development of ES ecosystem, the entire technology stack of Elastic-stack is quite complete and easy to integrate with various data systems. • ES community development is more active, and Solr hardly has a special technical analysis conference.
In this competition, Elasticsearch a complete victory.
compared with Elasticsarch, relational databases have the following advantages: the transaction isolation mechanism is irreplaceable, but its limitations are obvious:
• The query performance of relational databases decreases significantly after the data volume exceeds millions of levels. The essence is that the index algorithm is not efficient, and the B + tree algorithm is not as efficient as the inverted index algorithm. • Indexes in relational databases are restricted by the leftmost principle. The query condition fields cannot be combined arbitrarily. Otherwise, the index fails. On the contrary, Elasticserach can be combined arbitrarily. This is especially obvious when data tables are associated with queries, elasticsearch can be solved by using large-width tables, but relational databases cannot. • Multi-condition query after Database Sharding and table Sharding is difficult to implement. Elasticsearch a natural distributed design, multiple indexes and multiple shards can be queried together. • The aggregation performance of relational databases is low, with a little bit more data volume and a little more Query Column cardinality. The performance decreases quickly. Elasticsearch uses column storage for aggregation, which is highly efficient. • Relational databases focus on equilibrium, Elasticsearch on specific query speed.
If data does not need strict transaction mechanism isolation, I think Elasticsearch can be used instead. If data requires transaction isolation and query performance, it can be implemented by using a combination of DB and ES. For more information, see the author's blog article "real-time data synchronization of DB and ES hybrid applications".
OpenTSDB is implemented internally based on HBase and is a time series Database. It mainly optimizes and processes the data structure for data with time characteristics and requirements, so that it is suitable for storing data with time characteristics, such as monitoring data and temperature change data, the open-source monitoring system open-falcon of Xiaomi is implemented based on OpenTSDB.
Elastic the product itself has no intention of the time series field, with the popularity of ELK, many companies use ELK to build monitoring systems, although the numerical type is not particularly processed in the time series database, however, due to its convenient use and the advantages of the ecological technology stack, we also accept this fact.
Elasticsearch build a time series is simple and has good performance:
• Index creation rules. You can create indexes by year, month, week, week, day, and hour. • In terms of data filling, a time field is customized for sorting, and other fields are not required. • In terms of data query, in addition to querying by actual sequence, there are also more search criteria.
Unless there are very strict monitoring requirements for time series data, it is more appropriate to select Elasticsearch.
HBase is a representative of columnar databases. Its internal fatal designs greatly limit its application scope:
• Access to HBase data can only be based on Rowkey. The design of Rowkey directly determines the usage of HBase. • It does not support secondary indexes. To implement it, you need to introduce a third party. I will not talk much about its various technical principles, but talk about some of its usage.
The company belongs to the logistics express industry, which is a vehicle-related project. It records all the driving tracks of vehicles, and the vehicle-mounted equipment regularly reports the track information of vehicles. The back-end data storage is based on HBase, the data volume is more than dozens of TB. The business side needs to calculate its fuel consumption per kilometer and related costs based on the vehicle trajectory information. Therefore, it needs to query data in batches according to the query criteria, the query criteria include some non-rowkey fields, such as the time range, ticket number, and City number, which are almost impossible to achieve. The previous brute force has been done, and performance problems are worrying. The first problem of this project is that rowkeys are difficult to be designed to meet the query requirements. The second problem is the secondary index problem, which requires many query conditions.
If the columnar database is only used for Rowkey access scenarios, the Elastic can also be used. As long as the_id is designed, the same effect can be achieved as HBase.
If you need to introduce a third-party component to query data in a columnar database, it is better to build it directly on the Elasticsearch.
Unless you have strict requirements for using columnar databases, Elasticsearch are more universal and applicable to business scenarios.
MongoDB is a representative of the document database. The data model is based on Bson, while the document data model of Elasticsearch is Json. Bson is essentially an extension of Json and can be directly converted to each other, and their data modes can be freely expanded and basically unlimited. MongoDB itself competes with relational databases and supports strict transaction isolation mechanism. In this aspect, it is actually different from Elasticsearch products. However, in actual work, few companies put core business data on MongoDB, and relational databases are still the first choice. If this positioning is exceeded, Elasticsearh has the following advantages over MongoDB:
• Document query performance, inverted index/KDB-Tree is better than B + Tree. • Data aggregation and analysis capability. ES itself provides column data doc_value, which is much faster than Mongo. • Cluster sharding mechanism makes ES architecture design better. • ES provides more features than MongoDB and provides a wider range of scenarios. • Document data samples, ObjectId automatically generated by the MongoDB.
The company happened to have a project. The data layer was originally designed and built based on MongoDB. There were many query problems. Later, it was successfully migrated to the Elasticsearch platform. The number of server data was reduced from 15 to 3, the query performance has been greatly improved by ten times. For more information, see The Author's other article "after migrating from MongoDB to ES, we reduced 80% of servers."
regardless of data transaction isolation, Elasticsearch can completely replace MongoDB.
ClickHouse is an MPP query and analysis database with high activity in recent years, and many top companies have introduced it. Why should we introduce it? The reason may be different from other head companies, as follows:
• I have been engaged in big data for a long time and often encounter real-time query requirements for data aggregation. In the early stage, we chose a relational database for aggregation query, such as MySQL/PostgreSQL, performance bottlenecks may occur if you do not pay attention to them. • Elasticsearch product is introduced later, which is based on column design and sharding architecture. Its performance is obviously better than that of single-node relational databases in various aspects. • The limitations of Elasticsearch are obvious. First, when the data volume exceeds tens of millions or hundreds of millions, if the number of aggregated columns is too large, the performance will also reach a bottleneck. Second, deep secondary aggregation is not supported, as a result, some complex aggregation requirements require manual code writing and external implementation, which increases the development workload. • ClickHouse is introduced later to replace Elasticserach for deep aggregation. The performance is good, the data volume is tens of millions of billions, and the resource consumption is much lower than before, the same server resources can meet more business needs.
ClickHouse, like Elasticsearch, uses a columnar storage structure and supports replica sharding. The difference is that the underlying layer of the ClickHouse has some unique implementations, as follows:
• MergeTree the merge tree table engine, which provides data partitions, level -1 indexes, and level -2 indexes. • Vector Engine Vector Engine, data is not only stored by column, but also processed by Vector (a part of column), which can use CPU more efficiently.
Durid is a big data MPP query data product. Its core function is Rollup. All raw data that needs to be rolled up must contain time series fields. Elasticsearch released this feature after version 6.3.X. At this time, the two products form a competitive relationship. Whoever is superior depends on the application scenario requirements.
The Druid sample data, which must contain the time field.
I have been responsible for all data projects related to Elasticsearch technology stacks in the company before. At that time, I also encountered some requirements for real-time aggregation query to return some data, but our requirements were different, index data is an offline update. All indexes are deleted and re-created every day to insert data. In this case, the version of the Elastic used is 6.8.X, only offline data Rollup is supported, so this function is not used, Elastic in 7.2. The real-time Rollup feature is released only after version X.
• Druid is more focused. The product design revolves around Rollup, with Elastic attached. • Druid supports a variety of external data, which can be directly connected to Kafka data streams or internal data of the platform; elastic only supports internal index data, and external data needs to be imported to the index by using a third-party tool. • Druid discards the original data after data Rollup. Elastic, after the original index base, generate new index data after Rollup; • Druid is similar to the technical architecture of Elastic. Druid supports node responsibility separation and horizontal scaling; • Druid and Elastic support inverted indexes in data models, and search and filter based on this.
• Elasticsearch product has comprehensive functions, wide application range and good performance. Comprehensive application is the first choice. • Elasticsearch almost outperforms all competitive products in the field of search and query. In my opinion, relational databases solve data transaction problems, Elasticsearch almost all search and query problems. • Elasticsearch, in the field of data analysis, product capabilities are weak, and simple and common scenarios can be used on a large scale. However, in specific business scenarios, more professional data products should be selected, for example, complex aggregation, large-scale Rollup, and large-scale Key-Value. • Elasticsearch is more and more like an all-round data product than a search engine. It is used in almost all industries and is very popular in the industry. • The Elasticsearch is used well, and I get off work early.
1. The content comes from the actual work of the author using a variety of technology stacks to meet the scenario requirements, and some practical experience and summary thinking are obtained, providing reference for the later generations. 2. This article focuses on the comparison of competitive products of Elastic. It is only for general analysis. The granularity is coarse and the depth is limited. There will be more professional and in-depth competitive product analysis articles in the future. Please look forward to it.
Statement: This article is authorized and reprinted by the original author "Li Meng". For unauthorized users, the right to investigate their legal responsibilities is reserved.
for more discounts, visit the Alibaba Cloud Elasticsearch official website.
Alibaba Cloud Elasticsearch commercial general Edition, 1-Core 2G ,SSD 20g free for the first month Alibaba Cloud Logstash 2-Core 4G free for the first month