All Products
Search
Document Center

DashVector:Advanced use

Last Updated:Dec 02, 2024

Background

BM25 overview

Best Match 25 (BM25) is a ranking function widely used in information retrieval to score and rank documents in a given query. In essence, BM25 calculates the relevance of a document to each term in the query and obtains the weighted sum of the relevance. BM25 calculates the relevance score by using the following formula.

image.svg

Where, q and d indicate the query and document, respectively, qi indicates the ith term in q, R(qi, d) indicates the relevance of term qi to document d, Wi indicates the weight of term qi, and the calculated score(q, d) indicates the relevance score of q to d. The greater the score, the higher the relevance. Wi and R(qi, d) can be expressed by using the following formulas.

image.svg

image.svg

image.svg

Where, N indicates the total number of documents, N(qi) indicates the number of documents containing term qi, tf(qi, d) indicates the frequency of term qi in document d, Ld indicates the length of document d, Lavg indicates the average length of all documents, k1 and b are hyperparameters to control the impact of tf(qi, d) and Ld on the score.

Generate sparse vectors

To facilitate BM25 score calculation, the document and query are encoded separately, and their relevance is calculated based on their dot product. In BM25, the score can be split into two sparse vectors, which can be generated by using the encode_document and encode_query APIs of DashText, respectively. The following figure shows the generation logic.

0757393f649ed207d210686e6d6ef84f

The generated sparse vectors can be expressed as follows.

image.svg

image.svg

Calculate the score or distance

After sparse vectors d and q are generated, their distance is calculated based on their dot product, that is, the sum of products of values of corresponding terms. The distance is calculated by using sparse vectors based on the following formula.

image.svg

The greater the score, the higher the relevance. If both dense and sparse vectors are used, only the dot product metric is supported in distance calculation.

Train a custom model

The built-in BM25 model is trained based on a general corpus. Therefore, we recommend that you train a custom BM25 model for better output in specific scenarios. You can perform the following steps to train a custom model by using DashText:

Step 1. Determine the scenario

To search for documents by using sparse vectors, check the sources of the query and documents. In general, you need to prepare a certain number of documents for your corpus. The documents must be related to your business scenario.

Step 2. Prepare the corpus

The corpus determines the parameters used in the BM25 model. We recommend that you prepare a corpus based on the following principles:

  • The source of the corpus reflects the characteristics of the scenario, so that N(qi) reflects the term frequency in the real world.

  • Adjust the size and number of chunks in the corpus. The corpus must contain a considerable number of long text.

In general, you can directly use the documents prepared in Step 1 to build a corpus.

Step 3. Prepare a tokenizer

A tokenizer plays an important role in word segmentation, which directly affects sparse vector generation. You can use a custom tokenizer to get better results in your specific scenario. DashText allows you to expand a tokenizer by using the following two methods:

  • Method 1: Pass in a custom vocabulary to the Jieba tokenizer built in DashText. This feature is not supported by the SDK for Java.

from dashtext import TextTokenizer, SparseVectorEncoder

my_tokenizer = TextTokenizer.from_pretrained(model_name='Jieba', dict='dict.txt')
my_encoder = SparseVectorEncoder(tokenize_function=my_tokenizer.tokenize)
  • Method 2: Provide a function of the Callable[[str], List[str]] type for your custom tokenizer.

from dashtext import SparseVectorEncoder
from transformers import BertTokenizer

my_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
my_encoder = SparseVectorEncoder(tokenize_function=my_tokenizer.tokenize)
import com.aliyun.dashtext.common.DashTextException;
import com.aliyun.dashtext.common.ErrorCode;
import com.aliyun.dashtext.encoder.SparseVectorEncoder;
import com.aliyun.dashtext.tokenizer.BaseTokenizer;

import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;

public class Main {
    public static class MyTokenizer implements BaseTokenizer {
        @Override
        public List<String> tokenize(String s) throws DashTextException {
            if (s == null) {
                throw new DashTextException(ErrorCode.INVALID_ARGUMENT);
            }

            // Use regular expressions to split text by white spaces and punctuation marks, and convert it to lowercase.
            return Arrays.stream(s.split("\\s+|(?<!\\d)[.,](?!\\d)"))
                    .map(String::toLowerCase)
                    .filter(token -> !token.isEmpty())   // Filter out empty strings.
                    .collect(Collectors.toList());
        }
    }

    public static void main(String[] args) {
        SparseVectorEncoder encoder = new SparseVectorEncoder(new MyTokenizer());
    }
}

Step 4. Train the model

In essence, the model training is a parameter collection process. Training a custom model involves a large number of tokenization and hashing jobs, which may be time-consuming. DashText provides the SparseVectorEncoder.train API for model training.

Step 5. Optional. Fine-tune parameters

After the model is trained, prepare a validation dataset and fine-tune k1 and b for best recall. We recommend that you fine-tune k1 and b based on the following principles:

  • k1 (1.2 < k1 < 2) controls the impact of term frequency in a document on the score. The larger the k1, the smaller the impact.

  • b (0 < b < 1) controls the impact of document length on the score. The larger the b, the greater the impact.

In general, you do not need to fine-tune k1 or b.

Step 6. Optional. Fine-tune the model

In real cases, you may need to add data to the corpus to incrementally update parameters of the BM25 model. The SparseVectorEncoder.train API of DashText natively supports the incremental update. After an update, vectors encoded by the previous model become invalid and need to be re-vectorized.

Sample code

Here is an example of training a custom model:

from dashtext import SparseVectorEncoder
from pydantic import BaseModel
from typing import Dict, List


class Result(BaseModel):
    doc: str
    score: float


def calculate_score(query_vector: Dict[int, float], document_vector: Dict[int, float]) -> float:
    score = 0.0
    for key, value in query_vector.items():
        if key in document_vector:
            score += value * document_vector[key]
    return score


# Create an empty SparseVectorEncoder encoder. You can set the custom tokenizer.
encoder = SparseVectorEncoder()


# Step 1. Prepare the corpus and documents
corpus_document: List[str] = [
    "The quick brown fox rapidly and agilely leaps over the lazy dog that lies idly by the roadside.",
    "Never jump over the lazy dog quickly",
    "A fox is quick and jumps over dogs",
    "The quick brown fox",
    "Dogs are domestic animals",
    "Some dog breeds are quick and jump high",
    "Foxes are wild animals and often have a brown coat",
]


# Step 2. Train the BM25 model
encoder.train(corpus_document)


# Step 3. Fine-tune the parameters
query: str = "quick brown fox"
print(f"query: {query}")
k1s = [1.0, 1.5]
bs = [0.5, 0.75]
for k1, b in zip(k1s, bs):
    print(f"current k1: {k1}, b: {b}")
    encoder.b = b
    encoder.k1 = k1
    query_vector = encoder.encode_queries(query)
    results: List[Result] = []
    for idx, doc in enumerate(corpus_document):
        doc_vector = encoder.encode_documents(doc)
        score = calculate_score(query_vector, doc_vector)
        results.append(Result(doc=doc, score=score))
    results.sort(key=lambda r: r.score, reverse=True)

    for result in results:
        print(result)


# Step 4. Select the optimal parameters and save the model
encoder.b = 0.75
encoder.k1 = 1.5
encoder.dump("./model.json")


# Step 5. Load the model for subsequent use
new_encoder = SparseVectorEncoder()
bm25_model_path = "./model.json"
new_encoder.load(bm25_model_path)


# Step 6. Fine-tune and save the model
extra_corpus: List[str] = [
    "The fast fox jumps over the lazy, chubby dog",
    "A swift fox hops over a napping old dog",
    "The quick fox leaps over the sleepy, plump dog",
    "The agile fox jumps over the dozing, heavy-set dog",
    "A speedy fox vaults over a lazy, old dog lying in the sun"
]

new_encoder.train(extra_corpus)
new_bm25_model_path = "new_model.json"
new_encoder.dump(new_bm25_model_path)
import com.aliyun.dashtext.encoder.SparseVectorEncoder;

import java.io.*;
import java.util.*;

public class Main {

    public static class Result {
        public String doc;
        public float score;

        public Result(String doc, float score) {
            this.doc = doc;
            this.score = score;
        }

        @Override
        public String toString() {
            return String.format("Result(doc=%s, score=%f)", doc, score);
        }
    }

    public static float calculateScore(Map<Long, Float> queryVector, Map<Long, Float> documentVector) {
        float score = 0.0f;
        for (Map.Entry<Long, Float> entry : queryVector.entrySet()) {
            if (documentVector.containsKey(entry.getKey())) {
                score += entry.getValue() * documentVector.get(entry.getKey());
            }
        }
        return score;
    }

    public static void main(String[] args) throws IOException {
        // Create an empty SparseVectorEncoder encoder. You can set the custom tokenizer.
        SparseVectorEncoder encoder = new SparseVectorEncoder();

        // Step 1. Prepare the corpus and documents
        List<String> corpusDocument = Arrays.asList(
                "The quick brown fox rapidly and agilely leaps over the lazy dog that lies idly by the roadside.",
                "Never jump over the lazy dog quickly",
                "A fox is quick and jumps over dogs",
                "The quick brown fox",
                "Dogs are domestic animals",
                "Some dog breeds are quick and jump high",
                "Foxes are wild animals and often have a brown coat"
        );

        // Step 2. Train the BM25 model
        encoder.train(corpusDocument);

        // Step 3. Fine-tune the parameters
        String query = "quick brown fox";
        System.out.println("query: " + query);
        float[] k1s = {1.0f, 1.5f};
        float[] bs = {0.5f, 0.75f};
        for (int i = 0; i < k1s.length; i++) {
            float k1 = k1s[i];
            float b = bs[i];
            System.out.println("current k1: " + k1 + ", b: " + b);
            encoder.setB(b);
            encoder.setK1(k1);

            Map<Long, Float> queryVector = encoder.encodeQueries(query);
            List<Result> results = new ArrayList<>();
            for (String doc : corpusDocument) {
                Map<Long, Float> docVector = encoder.encodeDocuments(doc);
                float score = calculateScore(queryVector, docVector);
                results.add(new Result(doc, score));
            }

            results.sort((r1, r2) -> Float.compare(r2.score, r1.score));

            for (Result result : results) {
                System.out.println(result);
            }
        }

        // Step 4. Select the optimal parameters and save the model
        encoder.setB(0.75f);
        encoder.setK1(1.5f);
        encoder.dump("./model.json");

        // Step 5. Load the model for subsequent use
        SparseVectorEncoder newEncoder = new SparseVectorEncoder();
        newEncoder.load("./model.json");

        // Step 6. Fine-tune and save the model
        List<String> extraCorpus = Arrays.asList(
                "The fast fox jumps over the lazy, chubby dog",
                "A swift fox hops over a napping old dog",
                "The quick fox leaps over the sleepy, plump dog",
                "The agile fox jumps over the dozing, heavy-set dog",
                "A speedy fox vaults over a lazy, old dog lying in the sun"
        );
        newEncoder.train(extraCorpus);
        newEncoder.dump("./new_model.json");
    }
}

API reference

For more information about DashText API, visit https://pypi.org/project/dashtext/.