# Community

Blog Events Webinars Tutorials Forum
×
Community Blog Term Frequency Statistics in PostgreSQL Full-Text Searches

# Term Frequency Statistics in PostgreSQL Full-Text Searches

This article discusses how you can use PostgreSQL along with ts_stat or MADlib for term frequency and Inverse document frequency analysis.

By Digoal.

Term Frequency (abbreviated TF)-how frequently an expression occurs in a document or other body of text-and Inverse Document Frequency (IDF)-a measure for determining whether a term is common or rare in a given document or corpus-are common terms in the text analysis and text mining fields. The two of these are often used in cooperation with each other.

Previously, I wrote about how the capabilities of PostgreSQL when combined with the cosine and other linear correlation algorithms frequently used in text, image, and array similarity fields can serve as a basis of text and keyword analysis, and more specifically term frequency and inverse document frequency.

PostgreSQL supports the full-text search, and the text vector type, tsvector.

Now you may be asking how do you find terms or analyze term frequency in a pile of texts using PostgreSQL? Well, in this article, we will look at how you can use `ts_stat` and `MADlib` for term frequency and Inverse document frequency analysis.

## Method 1: Using ts_stat

The first method that we will cover in this blog comes from the built-in function of PostgreSQL, ts_stat, which is used to generate lexeme statistics. For example, in a Q&A knowledge base, you could look into which particular term appears most and in how many texts does it appear?

The function `ts_stat` is described in detail in these following documents:

Now that we're all on the same page, let's move on to gathering some document statistics. The function `ts_stat` is useful for checking your configuration and for finding stop-word candidates, which are words that extremely frequent in a text or just in general-think 'the', 'of' and 'one', for example.

``````ts_stat(sqlquery text, [ weights text, ]
OUT word text, OUT ndoc integer,
OUT nentry integer) returns setof record  ``````

In the above code, `sqlquery` is a text value containing an SQL query that must return a single `tsvector` column. `ts_stat` executes the query and returns statistics about each distinct lexeme (or word) contained in the `tsvector` data. The columns returned are：

• `word text`: the value of a lexeme (that is, a term or word).
• `ndoc integer`: the number of documents (`tsvectors`) the word occurred in.
• `nentry integer`: the total number of occurrences of the word.

If weights are supplied, only occurrences having one of those weights are counted. For example, to find the ten most frequent words (excluding your stop-word candidates) in a document collection, you can do this method:

``````SELECT * FROM ts_stat('SELECT vector FROM apod')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;  ``````

This next one is the same, but with counting only word occurrences with a weight A or B:

``````SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;  ``````

### Testing This Method

To test out this method, you'll want to first create a function to generate random strings;

``````create or replace function gen_rand_str(int) returns text as
\$\$

select substring(md5(random()::text), 4, \$1);

\$\$
language sql strict stable;``````

Then, you'll want to create a function to generate several random terms.

``````create or replace function gen_rand_tsvector(int,int) returns tsvector as
\$\$

select array_to_tsvector(array_agg(gen_rand_str(\$1))) from generate_series(1,\$2);

\$\$
language sql strict;
postgres=# select gen_rand_tsvector(4,10);
gen_rand_tsvector
-----------------------------------------------------------------------
'21eb' '2c9c' '4406' '5d9c' '9ac4' 'a27b' 'ab13' 'ba77' 'e3f2' 'f198'
(1 row)  ``````

Next, create a test table and write test data.

``````postgres=# create table ts_test(id int, info tsvector);
CREATE TABLE
postgres=# insert into ts_test select generate_series(1,100000), gen_rand_tsvector(4,10);
INSERT 0 100000  ``````

Following this, check the term frequency to see how many times the term has appeared and in how many texts (records) it has appeared.

``````postgres=# SELECT * FROM ts_stat('SELECT info FROM ts_test')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;
word | ndoc | nentry
------+------+--------
e4e6 |   39 |     39
9596 |   36 |     36
a84c |   35 |     35
2b44 |   32 |     32
5146 |   32 |     32
92f6 |   32 |     32
cd56 |   32 |     32
fd00 |   32 |     32
4258 |   31 |     31
5f18 |   31 |     31
(10 rows)  ``````

Next, write another batch of test data, and check the term frequency to see how many times the term has appeared and in how many texts (records) it has appeared.

``````postgres=# insert into ts_test select generate_series(1,100000), gen_rand_tsvector(2,10);
INSERT 0 100000
postgres=# SELECT * FROM ts_stat('SELECT info FROM ts_test')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;
word | ndoc | nentry
------+------+--------
30   | 4020 |   4020
a7   | 4005 |   4005
20   | 3985 |   3985
c5   | 3980 |   3980
e6   | 3970 |   3970
f1   | 3965 |   3965
70   | 3948 |   3948
5e   | 3943 |   3943
e4   | 3937 |   3937
2b   | 3934 |   3934
(10 rows)  ``````

Next, we will look how you can use MADlib. MADlib, in reality, also provides training functions for term frequency statistics, which you can find here. You'll need to understand these functions to be able to use this method.

Next, when it comes to Term frequency, `tf(t,d)` is to the raw frequency of a word/term in a document, such as the number of times that word or term `t` occurs in document `d`. Note that for this function, 'word' and 'term' are used interchangeably-as they can both be understood as lexemes. Also, note that the term frequency is not normalized by the document length.

This is the main part of what we need tro use MADlib.

``````    term_frequency(input_table,
doc_id_col,
word_col,
output_table,
compute_vocab)  ``````

To understand the code above, let's look at the arguments in detail.

• `input_table`: The name of the table storing the documents. Each row is in the form `<doc_id, word_vector>` where `doc_id` is an `id`, unique to each document, and `word_vector` is a text array containing the words in the document. The `word_vector` should contain multiple entries of a word if the document contains multiple occurrence of that word.
• `id_col`: The name of the column containing the document id.
• `word_col`: The name of the column containing the vector of words or terms (lexeme) in the document. This column should of type that can be cast to.
• `output_table`: The name of the table to store the term frequency output. The output table contains the following columns:

• `id_col`: This the document id column (same as the one provided as input).
• `word`: A word or term (that is, a lexeme) present in a document. This is either the original word present in `word_col` or an id representing the word (depending on the value of `compute_vocab` below).
• `count`: The number of times this word is found in the document.
• `compute_vocab`: This argument is Optional, and by default it is set to `FALSE`. Flag this argument to indicate if a term is to be created. If `TRUE`, an additional output table is created containing the vocabulary of all words, with an id assigned to each word. The table is called `output_table_vocabulary` (suffix added to the `output_table` name) and contains the following columns:

• `wordid`: An id assignment for each word
• `word`: The word/term

0 0 0
Share on

# digoal

255 posts | 21 followers

# digoal

255 posts | 21 followers

# Related Products

• ## Database Overview

Fully managed and less trouble database services

• ## AnalyticDB for PostgreSQL

An online MPP warehousing service based on the Greenplum Database open source program

• ## Image Search

An intelligent image search service with product search and generic search features to help users resolve image search requests.

• ## OpenSearch

OpenSearch helps develop intelligent search services.