Optimization of CK, ES and RediSearch solutions in tens of millions of data queries

References



ES page turning optimization and performance optimization | Nuggets
ES billion-level data retrieval optimization, three-second return to break through the performance bottleneck | InfoQ
RedisJson was born, and its performance crushed ES and Mongo | Nuggets

foreword


When encountering a business requirement during development, it is necessary to filter out data of no more than 10W from the tens of millions of pot data, and sort and break it up according to the configured weighting rules (similar to the fact that the commodity data under one category cannot be continuous. appears 3 times).
The realization of the business requirements, design ideas and scheme optimization are introduced below, and the following scheme is designed for "querying 10W-level data in tens of millions of data"

Multithreading + CK page turning scheme
ES scroll scan deep page solution
ES + Hbase combination solution
RediSearch + RedisJSON combination scheme

initial design
The overall scheme is designed as

First, filter out the "target data" from the pot table according to the configured "filtering rules"
According to the configured "sorting rule", sort the "target data" to get the "result data"

The technical solution is as follows

Run the derivative task every day, import the existing tens of millions of pot data (Hive table) into Clickhouse, and then use the CK table for data filtering.
Build the filtering rules and sorting rules of the business configuration into a "Filter + Sort" object SelectionQueryCondition.
When fetching "target data" from the CK pool table, enable multithreading, perform paging filtering, and store the obtained "target data" in the result list.

//page size default 5000
int pageSize = this.getPageSize();
//page number
int pageCnt = totalNum / this.getPageSize() + 1;

List> result = Lists.newArrayList();
List>>> futureList = new ArrayList<>(pageCnt);

//Enable multi-threaded call
for (int i = 1; i <= pageCnt; i++) {
//Construct the filtering rules and sorting rules of the business configuration as SelectionQueryCondition objects
SelectionQueryCondition selectionQueryCondition = buildSelectionQueryCondition(selectionQueryRuleData);
selectionQueryCondition.setPageSize(pageSize);
selectionQueryCondition.setPage(i);
futureList.add(selectionQueryEventPool.submit(new QuerySelectionDataThread(selectionQueryCondition)));
}


for (Future>> future : futureList) {
//RPC call
List> queryRes = future.get(20, TimeUnit.SECONDS);
if (CollectionUtils.isNotEmpty(queryRes)) {
// store the target data in result
result.addAll(queryRes);
}
}
copy code

Sort the target data result to get the final "result data".

CK paging query
In the third step of the "Preliminary Design Scheme" chapter, it is mentioned that "when fetching target data from the CK pool table, enable multithreading and perform paging filtering". The CK pagination query is introduced here.

It encapsulates the queryPoolSkuList method, which is responsible for obtaining target data from the CK table. This method internally calls the sqlSession.selectList method.

public List> queryPoolSkuList( Map params ) {
List> resultMaps = new ArrayList<>();

QueryCondition queryCondition = parseQueryCondition(params);
List> mapList = lianNuDao.queryPoolSkuList(getCkDt(),queryCondition);
if (CollectionUtils.isNotEmpty(mapList)) {
for (Map data : mapList) {
resultMaps.add(camelKey(data));
}
}
return resultMaps;
}
copy code
//lianNuDao.queryPoolSkuList

@Autowired
@Qualifier("ckSqlNewSession")
private SqlSession sqlSession;

public List> queryPoolSkuList( String dt, QueryCondition queryCondition ) {
queryCondition.setDt(dt);
queryCondition.checkMultiQueryItems();
return sqlSession.selectList("LianNu.queryPoolSkuList",queryCondition);
}
copy code

The queryPoolSkuList query method that interacts with CK is called in the sqlSession.selectList method. Part of the code is as follows.


copy code

It can be seen that in the CK paging query, the paging is realized by limit #{limitStart},#{limitEnd}.

The limit paging scheme has performance problems when "deep page turning". After the first version of the plan is launched, 10W of data is screened in the 1000W-level pot data, and the worst time will be about 10s~18s.
Optimize deep page turning with ES Scroll Scan
For the performance problem of CK deep page turning, it has been optimized and optimized by using Elasticsearch's scroll scan page turning solution.
ES's page turning solution

Regarding the "ES page turning solution", please refer to ES page turning optimization and performance optimization for details | Nuggets

ES page turning, there are the following options

from + size turn pages
scroll page
scroll scan
search after page turn

Performance Advantages and Disadvantages of Page Turning Scenarios Low from + size, good flexibility, simple deep paging problem, small amount of data, can tolerate deep paging problem Real-time; sortable, but unable to skip pages to query and query massive data. Scroll scan is based on the scroll scheme, which further improves the performance of massive data query. It cannot be sorted. The other disadvantages are the same as scroll querying massive data. The problem is that it is complex to reflect real-time changes in data and requires a globally unique field. The implementation of continuous paging is more complicated, because each query requires the results of the previous query. It is not suitable for large-scale page jump queries, but is suitable for paging of massive data
For the above several page-turning schemes, query different numbers of data, and the time-consuming data is as follows.

ES page turning mode 1-1049000-4901099000-99010from + size8ms30ms117msscroll7ms66ms36mssearch_after5ms8ms7ms
time-consuming data
Here, we use Elasticsearch's scroll scan page-turning solution and the CK page-turning solution in the first version to query the data, and compare the time-consuming data.


As shown in the above test data, it can be found that, taking the pots of 100,000, 1,000,000, 10,000,000 as an example

The larger the magnitude of the pot, the more time consuming to query the same amount of data
When the query result is below 3W, the ES performance is excellent; when the query result is above 5W, the CK multi-threading performance is excellent

ES+Hbase combined query solution

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00