Term Frequency (abbreviated TF)-how frequently an expression occurs in a document or other body of text-and Inverse Document Frequency (IDF)-a measure for determining whether a term is common or rare in a given document or corpus-are common terms in the text analysis and text mining fields. The two of these are often used in cooperation with each other.
Previously, I wrote about how the capabilities of PostgreSQL when combined with the cosine and other linear correlation algorithms frequently used in text, image, and array similarity fields can serve as a basis of text and keyword analysis, and more specifically term frequency and inverse document frequency.
PostgreSQL supports the full-text search, and the text vector type, tsvector.
Now you may be asking how do you find terms or analyze term frequency in a pile of texts using PostgreSQL? Well, in this article, we will look at how you can use
MADlib for term frequency and Inverse document frequency analysis.
The first method that we will cover in this blog comes from the built-in function of PostgreSQL, ts_stat, which is used to generate lexeme statistics. For example, in a Q&A knowledge base, you could look into which particular term appears most and in how many texts does it appear?
ts_stat is described in detail in these following documents:
Now that we're all on the same page, let's move on to gathering some document statistics. The function
ts_stat is useful for checking your configuration and for finding stop-word candidates, which are words that extremely frequent in a text or just in general-think 'the', 'of' and 'one', for example.
ts_stat(sqlquery text, [ weights text, ] OUT word text, OUT ndoc integer, OUT nentry integer) returns setof record
In the above code,
sqlquery is a text value containing an SQL query that must return a single
ts_stat executes the query and returns statistics about each distinct lexeme (or word) contained in the
tsvector data. The columns returned are：
word text: the value of a lexeme (that is, a term or word).
ndoc integer: the number of documents (
tsvectors) the word occurred in.
nentry integer: the total number of occurrences of the word.
If weights are supplied, only occurrences having one of those weights are counted. For example, to find the ten most frequent words (excluding your stop-word candidates) in a document collection, you can do this method:
SELECT * FROM ts_stat('SELECT vector FROM apod') ORDER BY nentry DESC, ndoc DESC, word LIMIT 10;
This next one is the same, but with counting only word occurrences with a weight A or B:
SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab') ORDER BY nentry DESC, ndoc DESC, word LIMIT 10;
To test out this method, you'll want to first create a function to generate random strings;
create or replace function gen_rand_str(int) returns text as $$ select substring(md5(random()::text), 4, $1); $$ language sql strict stable;
Then, you'll want to create a function to generate several random terms.
create or replace function gen_rand_tsvector(int,int) returns tsvector as $$ select array_to_tsvector(array_agg(gen_rand_str($1))) from generate_series(1,$2); $$ language sql strict; postgres=# select gen_rand_tsvector(4,10); gen_rand_tsvector ----------------------------------------------------------------------- '21eb' '2c9c' '4406' '5d9c' '9ac4' 'a27b' 'ab13' 'ba77' 'e3f2' 'f198' (1 row)
Next, create a test table and write test data.
postgres=# create table ts_test(id int, info tsvector); CREATE TABLE postgres=# insert into ts_test select generate_series(1,100000), gen_rand_tsvector(4,10); INSERT 0 100000
Following this, check the term frequency to see how many times the term has appeared and in how many texts (records) it has appeared.
postgres=# SELECT * FROM ts_stat('SELECT info FROM ts_test') ORDER BY nentry DESC, ndoc DESC, word LIMIT 10; word | ndoc | nentry ------+------+-------- e4e6 | 39 | 39 9596 | 36 | 36 a84c | 35 | 35 2b44 | 32 | 32 5146 | 32 | 32 92f6 | 32 | 32 cd56 | 32 | 32 fd00 | 32 | 32 4258 | 31 | 31 5f18 | 31 | 31 (10 rows)
Next, write another batch of test data, and check the term frequency to see how many times the term has appeared and in how many texts (records) it has appeared.
postgres=# insert into ts_test select generate_series(1,100000), gen_rand_tsvector(2,10); INSERT 0 100000 postgres=# SELECT * FROM ts_stat('SELECT info FROM ts_test') ORDER BY nentry DESC, ndoc DESC, word LIMIT 10; word | ndoc | nentry ------+------+-------- 30 | 4020 | 4020 a7 | 4005 | 4005 20 | 3985 | 3985 c5 | 3980 | 3980 e6 | 3970 | 3970 f1 | 3965 | 3965 70 | 3948 | 3948 5e | 3943 | 3943 e4 | 3937 | 3937 2b | 3934 | 3934 (10 rows)
Next, we will look how you can use MADlib. MADlib, in reality, also provides training functions for term frequency statistics, which you can find here. You'll need to understand these functions to be able to use this method.
Next, when it comes to Term frequency,
tf(t,d) is to the raw frequency of a word/term in a document, such as the number of times that word or term
t occurs in document
d. Note that for this function, 'word' and 'term' are used interchangeably-as they can both be understood as lexemes. Also, note that the term frequency is not normalized by the document length.
This is the main part of what we need tro use MADlib.
term_frequency(input_table, doc_id_col, word_col, output_table, compute_vocab)
To understand the code above, let's look at the arguments in detail.
input_table: The name of the table storing the documents. Each row is in the form
id, unique to each document, and
word_vectoris a text array containing the words in the document. The
word_vectorshould contain multiple entries of a word if the document contains multiple occurrence of that word.
id_col: The name of the column containing the document id.
word_col: The name of the column containing the vector of words or terms (lexeme) in the document. This column should of type that can be cast to.
output_table: The name of the table to store the term frequency output. The output table contains the following columns:
id_col: This the document id column (same as the one provided as input).
word: A word or term (that is, a lexeme) present in a document. This is either the original word present in
word_color an id representing the word (depending on the value of
count: The number of times this word is found in the document.
compute_vocab: This argument is Optional, and by default it is set to
FALSE. Flag this argument to indicate if a term is to be created. If
TRUE, an additional output table is created containing the vocabulary of all words, with an id assigned to each word. The table is called
output_table_vocabulary (suffix added to the
output_table name) and contains the following columns:
wordid: An id assignment for each word
word: The word/term
digoal - September 12, 2019
digoal - February 5, 2020
digoal - September 12, 2019
digoal - February 5, 2020
Alibaba Clouder - December 11, 2017
Alibaba Clouder - January 9, 2018
Fully managed and less trouble database servicesLearn More
An intelligent image search service with product search and generic search features to help users resolve image search requests.Learn More
An online MPP warehousing service based on the Greenplum Database open source programLearn More
Migrating to fully managed cloud databases brings a host of benefits including scalability, reliability, and cost efficiency.Learn More
More Posts by digoal