Characteristics of searching for test questions in online education scenarios
A large number of test questions may exist in a question library and the number of test questions keeps increasing. This causes high pressure on the databases.
Most search behaviors occur during peak hours, including a large number of concurrent searches. In this case, search results may be returned at a high latency, which affects user experience.
Different stages of learning are covered. More and more user scenarios are involved.
Subject disciplines are classified into various categories. Data becomes more and more complex. Therefore, interdisciplinary errors may occur in search queries.
Powerful algorithms are required to improve search accuracy.
Multimodal searching capabilities are required to search images and text.
Multilingual processing capabilities are required to process search queries in multiple languages such as English.
Best practices of OpenSearch in education industries
Exclusive analyzer for queries for test questions
Query processing flowchart
2. Understanding of query semantics
An analyzer is the most basic module that affects the search effect. OpenSearch integrates with an exclusive analyzer for queries for test questions. In addition, you can upload your own query terms to create a custom analyzer.
Examples
Query
What is the area of the following triangle in square centimeters?
Spelling correction
What is the area of the following triangle in square centimeters?
Discipline category prediction
Mathematics
Tokenization
What is the area of the following triangle in square centimeters?
Term weight analysis
1 7 1 7 1 4 7 7 1
Synonym rewriting
square centimeters -> (cm ^ 2)
Text vectorization
-0.100582,-0.0540699,-0.0417337,0.0602,...
3. Category prediction
What is category prediction?
After you enter a search query, multiple commodities are found. The system calculates the relevance between the search query and the category of each commodity. Provided that the relevance is referenced in the corresponding sort expression, a commodity whose category has a higher relevance to the search query has a higher sort score. In this case, the commodity ranks higher.
Application of category prediction in online education scenarios
Predict the discipline and question type to which a test question belongs based on the image information in the query and the result of optical character recognition (OCR).
Predict the types of fields such as the question description and options.
4. Term weight analysis
Description: The term weight analysis feature evaluates the importance of each term in search queries and quantifies the evaluated importance as a weight. OpenSearch may not use low-importance terms to retrieve documents. This helps increase the number of documents that are retrieved. If the search queries that you entered contain low-importance terms and these terms are involved in the document retrieval process, only a small number of documents may be retrieved based on the search queries.
Purposes: Remove low-importance terms from a query, rewrite a query, and analyze text relevance.
(1) Generate training data based on user behavior.
(2) Train models for term weight analysis.
The sequence labeling model.
The prediction label (7,4,1). A higher score indicates a higher importance of a term, and indicates that the retrieved results can be more accurate.
Examples
Query | The factors of 35 are () and the multiples of 24 within 100 are () |
Corresponding term weight scores | 4 1 71 1 1 1 1 1 4 1 7 1 1 1 |
In this question, the weight scores of "factors" and "multiples" are 7 points, which are the highest. OpenSearch preferentially uses "factors" and "multiples" to retrieve documents. The weight scores of "35" and "24" are 4 points. The weight scores of other elements in the question are 1 point. OpenSearch does not use those elements whose weight scores are 1 point to retrieve documents.
5. Query rewriting
To meet different business requirements, OpenSearch allows you to perform multiple interventions at a time such as the use of intervention dictionaries, spelling correction, synonyms, and term weight analysis.
Examples
(1) The OCR feature may identify some non-question elements, which interfere with the results of query analysis. In this case, you can use term weight analysis to ensure that non-question element fields are labelled with low weight. This can improve the retrieval and sorting effects.
(2) You can create an intervention dictionary for synonym configuration to expand the retrieval scope. For example, if a query contains cubic meters, you can add tons as the synonym of cubic meters.
Custom sort
OpenSearch supports rough sort and fine sort. In the rough sort process, the top N high-quality documents are selected from all documents that are retrieved. Then, the top N high-quality documents are scored and sorted in the fine sort process. This way, you can obtain the documents that best match your requirements. To achieve a finer-grained sorting effect, you can write sort expressions and use them for applications to control the sorting of search results.
Effect comparison
An online education platform offers K12 education solutions. The platform has tens of millions of users. Their question libraries include about 80 million test questions and the questions are continuously increasing. The question libraries consist of two parts: their own question libraries and third-party question libraries. Before the platform uses OpenSearch, the platform implements the photo search feature by using the OCR feature and their own Elasticsearch-based search service. However, the platform faces many problems such as low accuracy of search results and high search latency.
After they use OpenSearch to implement their search feature:
The absolute value of search accuracy is increased by 5%.
The search latency is reduced to 50 ms. The original latency ranges from 100 ms to 300 ms.
Data can be synchronized from an offline application to OpenSearch at a throughput greater than 4,000 transactions per second (TPS).
Sample query: "Zhang Huiyan says that the style of Song poetry in the Song dynasty is probably similar to Yuefu."
Level | Results that are retrieved before OpenSearch is used | Results that are retrieved after OpenSearch is used |
Top1 | Zhang Hui is a solo singer of a song and dance troupe. Her wage is CNY 5,800 per month. In June 2006, Zhang Hui participated in three performances of the troupe in Shanghai and received a reward of CNY 3,800... | Zhang Huiyan says that the style of Song poetry in the Song dynasty is probably similar to Yuefu. |
Top2 | Zhang Huiyan's love for music comes from... | Zhang Huiyan says that the style of Song poetry in the Song dynasty is probably similar to Yuefu. () |
Top3 | Among the following documents, which one is the document that is cited in an article published by Ms. Zhang Hui in music periodicals of China? | Among the following options, which one is probably similar to the style of Song poetry in the Song dynasty that Zhang Huiyan said? |
Sample query: "A geometrical body consists of some identical small cubes. ___ identical small cubes are required to build the geometrical body."