×
Community Blog Term Frequency Statistics in PostgreSQL Full-Text Searches

Term Frequency Statistics in PostgreSQL Full-Text Searches

This article discusses how you can use PostgreSQL along with ts_stat or MADlib for term frequency and Inverse document frequency analysis.

By Digoal.

Term Frequency (abbreviated TF)-how frequently an expression occurs in a document or other body of text-and Inverse Document Frequency (IDF)-a measure for determining whether a term is common or rare in a given document or corpus-are common terms in the text analysis and text mining fields. The two of these are often used in cooperation with each other.

Previously, I wrote about how the capabilities of PostgreSQL when combined with the cosine and other linear correlation algorithms frequently used in text, image, and array similarity fields can serve as a basis of text and keyword analysis, and more specifically term frequency and inverse document frequency.

PostgreSQL supports the full-text search, and the text vector type, tsvector.

Now you may be asking how do you find terms or analyze term frequency in a pile of texts using PostgreSQL? Well, in this article, we will look at how you can use ts_stat and MADlib for term frequency and Inverse document frequency analysis.

Method 1: Using ts_stat

The first method that we will cover in this blog comes from the built-in function of PostgreSQL, ts_stat, which is used to generate lexeme statistics. For example, in a Q&A knowledge base, you could look into which particular term appears most and in how many texts does it appear?

The function ts_stat is described in detail in these following documents:

Now that we're all on the same page, let's move on to gathering some document statistics. The function ts_stat is useful for checking your configuration and for finding stop-word candidates, which are words that extremely frequent in a text or just in general-think 'the', 'of' and 'one', for example.

ts_stat(sqlquery text, [ weights text, ]  
        OUT word text, OUT ndoc integer,  
        OUT nentry integer) returns setof record  

In the above code, sqlquery is a text value containing an SQL query that must return a single tsvector column. ts_stat executes the query and returns statistics about each distinct lexeme (or word) contained in the tsvector data. The columns returned are:

  • word text: the value of a lexeme (that is, a term or word).
  • ndoc integer: the number of documents (tsvectors) the word occurred in.
  • nentry integer: the total number of occurrences of the word.

If weights are supplied, only occurrences having one of those weights are counted. For example, to find the ten most frequent words (excluding your stop-word candidates) in a document collection, you can do this method:

SELECT * FROM ts_stat('SELECT vector FROM apod')  
ORDER BY nentry DESC, ndoc DESC, word  
LIMIT 10;  

This next one is the same, but with counting only word occurrences with a weight A or B:

SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')  
ORDER BY nentry DESC, ndoc DESC, word  
LIMIT 10;  

Testing This Method

To test out this method, you'll want to first create a function to generate random strings;

create or replace function gen_rand_str(int) returns text as 
$$
    
  select substring(md5(random()::text), 4, $1);    

$$
 language sql strict stable;

Then, you'll want to create a function to generate several random terms.

create or replace function gen_rand_tsvector(int,int) returns tsvector as 
$$
    
  select array_to_tsvector(array_agg(gen_rand_str($1))) from generate_series(1,$2);    

$$
 language sql strict;   
postgres=# select gen_rand_tsvector(4,10);  
                           gen_rand_tsvector                             
-----------------------------------------------------------------------  
 '21eb' '2c9c' '4406' '5d9c' '9ac4' 'a27b' 'ab13' 'ba77' 'e3f2' 'f198'  
(1 row)  

Next, create a test table and write test data.

postgres=# create table ts_test(id int, info tsvector);  
CREATE TABLE  
postgres=# insert into ts_test select generate_series(1,100000), gen_rand_tsvector(4,10);  
INSERT 0 100000  

Following this, check the term frequency to see how many times the term has appeared and in how many texts (records) it has appeared.

postgres=# SELECT * FROM ts_stat('SELECT info FROM ts_test')  
ORDER BY nentry DESC, ndoc DESC, word  
LIMIT 10;  
 word | ndoc | nentry   
------+------+--------  
 e4e6 |   39 |     39  
 9596 |   36 |     36  
 a84c |   35 |     35  
 2b44 |   32 |     32  
 5146 |   32 |     32  
 92f6 |   32 |     32  
 cd56 |   32 |     32  
 fd00 |   32 |     32  
 4258 |   31 |     31  
 5f18 |   31 |     31  
(10 rows)  

Next, write another batch of test data, and check the term frequency to see how many times the term has appeared and in how many texts (records) it has appeared.

postgres=# insert into ts_test select generate_series(1,100000), gen_rand_tsvector(2,10);  
INSERT 0 100000  
postgres=# SELECT * FROM ts_stat('SELECT info FROM ts_test')                               
ORDER BY nentry DESC, ndoc DESC, word  
LIMIT 10;  
 word | ndoc | nentry   
------+------+--------  
 30   | 4020 |   4020  
 a7   | 4005 |   4005  
 20   | 3985 |   3985  
 c5   | 3980 |   3980  
 e6   | 3970 |   3970  
 f1   | 3965 |   3965  
 70   | 3948 |   3948  
 5e   | 3943 |   3943  
 e4   | 3937 |   3937  
 2b   | 3934 |   3934  
(10 rows)  

Method 2: Using MADlib

Next, we will look how you can use MADlib. MADlib, in reality, also provides training functions for term frequency statistics, which you can find here. You'll need to understand these functions to be able to use this method.

Next, when it comes to Term frequency, tf(t,d) is to the raw frequency of a word/term in a document, such as the number of times that word or term t occurs in document d. Note that for this function, 'word' and 'term' are used interchangeably-as they can both be understood as lexemes. Also, note that the term frequency is not normalized by the document length.

This is the main part of what we need tro use MADlib.

    term_frequency(input_table,  
                   doc_id_col,  
                   word_col,  
                   output_table,  
                   compute_vocab)  

To understand the code above, let's look at the arguments in detail.

  • input_table: The name of the table storing the documents. Each row is in the form <doc_id, word_vector> where doc_id is an id, unique to each document, and word_vector is a text array containing the words in the document. The word_vector should contain multiple entries of a word if the document contains multiple occurrence of that word.
  • id_col: The name of the column containing the document id.
  • word_col: The name of the column containing the vector of words or terms (lexeme) in the document. This column should of type that can be cast to.
  • output_table: The name of the table to store the term frequency output. The output table contains the following columns:

    • id_col: This the document id column (same as the one provided as input).
    • word: A word or term (that is, a lexeme) present in a document. This is either the original word present in word_col or an id representing the word (depending on the value of compute_vocab below).
    • count: The number of times this word is found in the document.
  • compute_vocab: This argument is Optional, and by default it is set to FALSE. Flag this argument to indicate if a term is to be created. If TRUE, an additional output table is created containing the vocabulary of all words, with an id assigned to each word. The table is called output_table_vocabulary (suffix added to the output_table name) and contains the following columns:

    • wordid: An id assignment for each word
    • word: The word/term

References

0 0 0
Share on

digoal

276 posts | 24 followers

You may also like

Comments