This topic explains how to use the pgml extension in AnalyticDB for PostgreSQL V7.0 to perform inference tasks, such as text embedding, sentiment analysis, and entity recognition, directly on your data within the database. This approach eliminates the need to move data out of the database, simplifying your data pipeline and enabling real-time analysis with high efficiency.
Core concepts: The two main pgml functions
The pgml extension revolves around two powerful functions. You will choose one based on the task you need to perform:
pgml.embed()Used exclusively for Text Embedding tasks.
It takes a model name and text as input and returns an array of floating-point numbers (the text vector).
pgml.transform()Used for all other types of inference tasks, such as Text Classification and Named Entity Recognition.
It uses a flexible JSONB parameter to specify the task and model, making it a versatile inference interface.
If you need a model that is not on the supported list, submit a ticket to contact the development team.
Task 1: Text embedding
Text embedding converts natural language text into numerical vectors. This is fundamental for text analysis, similarity computing, machine learning tasks, and Retrieval-Augmented Generation (RAG).
Supported models
Model name | Supported language | Maximum number of tokens | Number of dimensions | Model size |
thenlper/gte-large-zh | Chinese | 512 | 1024 | 1.25 GB |
thenlper/gte-small-zh | Chinese | 512 | 512 | 0.12 GB |
thenlper/gte-large | English | 512 | 1024 | 1.25 GB |
thenlper/gte-small | English | 512 | 512 | 0.21 GB |
Alibaba-NLP/gte-Qwen2-7B-instruct | Multiple languages | 32000 | 3584 | 26.45 GB |
Alibaba-NLP/gte-Qwen2-1.5B-instruct | Multiple languages | 32000 | 1536 | 6.62 GB |
Syntax
To obtain the embedding vector for a given text, call the pgml.embed function with the model name and your text input.
pgml.embed(transformer TEXT, inputs TEXT or TEXT[])Single inference.
CREATE FUNCTION pgml."embed"( "transformer" TEXT, "inputs" TEXT, "kwargs" jsonb DEFAULT '{}' ) RETURNS real[] IMMUTABLE STRICT PARALLEL SAFE LANGUAGE c /* Rust */;Batch inference.
CREATE FUNCTION pgml."embed"( "transformer" TEXT, "inputs" TEXT[], "kwargs" jsonb DEFAULT '{}' ) RETURNS real[][] IMMUTABLE STRICT PARALLEL SAFE LANGUAGE c /* Rust */;
Parameters
Parameter | Description | Example |
| The name of the pre-trained model you want to use. See more in the Model name. | 'thenlper/gte-large' |
| The text (or array of texts) to perform inference on. | 'The self-attention mechanism is widely used in neural networks.' |
| Additional JSONB parameters, such as specifying the inference device. You can include universal parameters, such as inference devices, in this parameter. For more information, see the open source Hugging Face models. Note Currently, only CPU is supported. The current version does not support GPU inference. | '{"device": "cpu"}' |
Examples
Embed a single piece of text
SELECT pgml.embed('thenlper/gte-small-zh', 'The self-attention mechanism is widely used in neural networks.');The following result is returned:
embed
------------------------------------------------------------
{0.011534364,-0.029397607, ... -0.00056651415,-0.05465962}Embed multiple pieces of text at a time
SELECT pgml.embed('thenlper/gte-small-zh', ARRAY['The self-attention mechanism is widely used in neural networks.', 'The core idea is to allow models to consider other elements in the sequence when processing a single element']);The following result is returned:
embed
------------------------------------------------------------
{0.011534364,-0.029397607, ... -0.00056651415,-0.05465962}
{0.024123996,0.0360483154, ... -0.029659372,-0.0198373856}Embed existing text data in a table
For example, import the inference text data to a table named dump_table.
CREATE TABLE dump_table(id int, content text);
-- Data in the content column:
/*
id | content
----+--------------------------------------------------------
2 | Attention weights are generated by calculating the dot product of queries and keys and invoking the softmax() function. The weights indicate the attention given by the current element to other elements.
1 | The self-attention mechanism generates attention scores by calculating the similarity between elements in the sequence. The scores represent the importance of each element to other elements.
*/Call the embed function on the content column of the dump_table table and save the results in the result_table table.
CREATE TABLE result_table AS SELECT id, content, pgml.embed('thenlper/gte-small-zh', content) AS embed FROM dump_table;
SELECT id, content, embed[1:2] FROM result_table;The following result is returned:
id | content | embed
----+------------------------------+-------------------
2 | Attention weights are generated by calculating the dot product of queries and keys and invoking the softmax() function. The weights indicate the attention given by the current element to other elements.| {-0.036033697,..., -0.030451842}
1 | The self-attention mechanism generates attention scores by calculating the similarity between elements in the sequence. The scores represent the importance of each element to other elements. | {-0.020080239,..., -0.017561916}
(2 rows)Task 2: Text classification
Text classification automatically assigns text to predefined categories, such as determining if a customer review is "positive" or "negative". Pre-trained models can implement various data processing tasks, such as sentiment analysis, automatic data labeling, and review.
Supported models
Model name | Description | Language | Maximum number of tokens | Model size |
Alibaba-NLP/gte-multilingual-reranker-base | Identifies text. | Multiple languages | 8192 | 306 MB |
lxyuan/distilbert-base-multilingual-cased-sentiments-student | Classifies sentiment analysis as positive or negative. | Multiple languages | 512 | 541 MB |
Syntax
For this task, call the pgml.transform function. The key is to construct a task JSONB object that specifies the task type as text-classification and provides the model name.
CREATE FUNCTION pgml."transform"(
"task" jsonb,
"args" jsonb DEFAULT '{}',
"inputs" TEXT[] DEFAULT ARRAY[]::TEXT[],
"cache" bool DEFAULT false
) RETURNS jsonb
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c;Parameters
Parameter | Description | Example |
| The pre-trained model that you want to use and the task type. Set the task type to | { "task": "text-classification", "model": "lxyuan/distilbert-base-multilingual-cased-sentiments-student" } |
| The additional parameters. For more information, see the open source Hugging Face models. | {"max_new_tokens": 32} |
| The text (or array of texts) to perform inference on. | ARRAY['The product quality is good'] |
| Specifies whether to cache the | true |
Examples
Classify a piece of text
SELECT pgml.transform(
task => '{"task": "text-classification",
"model": "lxyuan/distilbert-base-multilingual-cased-sentiments-student"
}'::JSONB,
inputs => ARRAY[
'The product quality is good'
]
) AS positivity;
The returned JSON shows a label of "positive" with a confidence score of 0.9924, indicating the model is highly certain this is a positive review.
positivity
-----------------------------------------------------
[{"label": "positive", "score": 0.992497742176056}]
(1 row)Classify existing text data in a table
Read data from the dump_table table and classify the data based on the content in the content column. If the classification result is positive, the corresponding pos result is set to true, and the results are saved in the result_table table.
CREATE TABLE result_table AS
SELECT
id,
content,
pgml.transform(
task => '{
"task": "text-classification",
"model": "lxyuan/distilbert-base-multilingual-cased-sentiments-student"
}'::JSONB,
inputs => ARRAY[content]::text[])->0->>'label' = 'positive' AS pos
FROM dump_table;Task 3: Named entity recognition
Named Entity Recognition (NER) identifies and categorizes named entities in text, such as people (PER), locations (LOC), and organizations (ORG). This helps subsequent in-depth data processing and analysis.
Supported models
Model name | Language | Description | Model size |
Babelscape/wikineural-multilingual-ner | Multiple languages | Labels information, such as organizations, person names, and places, in the text. | 709 MB |
Syntax
You can call the transform function provided by the pgml extension to extract named entities.
CREATE FUNCTION pgml."transform"(
"task" jsonb,
"args" jsonb DEFAULT '{}',
"inputs" TEXT[] DEFAULT ARRAY[]::TEXT[],
"cache" bool DEFAULT false
) RETURNS jsonb
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c;Parameters
Parameter | Description | Example |
| The pre-trained model that you want to use and the task type. Set the task type to | { "task": "token-classification", "model": "Babelscape/wikineural-multilingual-ner" } |
| The additional parameters. For more information, see the open source Hugging Face models. | {"max_new_tokens": 32} |
| The text (or array of texts) to perform inference on. | ARRAY['Li Ming arrived in Hangzhou yesterday'] |
| Specifies whether to cache the | true |
Examples
Extract a piece of text
SELECT pgml.transform(
task => '{"task": "token-classification",
"model": "Babelscape/wikineural-multilingual-ner"
}'::JSONB,
inputs => ARRAY[
'Li Ming arrived in a city yesterday'
]
);The returned JSON contains a list of entities. For example,
"Wolfgang" is tagged as
B-PER(Beginning of a Person entity)."Berlin" is tagged as
B-LOC(Beginning of a Location entity).
transform
------------------------------------------------------------
[[{"end": 1, "word": "Li", "index": 1, "score": 0.9668680429458618, "start": 0, "entity": "B-PER"}, {"end": 2, "word": "Ming", "index": 2, "score": 0.9899099469184875, "start": 1
, "entity": "I-PER"}, {"end": 8, "word": "Some", "index": 8, "score": 0.9998119473457336, "start": 7, "entity": "B-LOC"}, {"end": 9, "word": "City", "index": 9, "score": 0.998930156
2309264, "start": 8, "entity": "I-LOC"}]]
(1 row)Extract existing text data in a table
CREATE TABLE result_table AS SELECT id, content, pgml.transform(
task => '{"task": "token-classification",
"model": "Babelscape/wikineural-multilingual-ner"
}'::JSONB,
inputs => ARRAY[
content
]
) AS RESULT FROM dump_table;
SELECT * FROM result_table ;The following result is returned:
------------------------------------------------------
1 | My name IS Wolfgang AND I live IN Berlin. | [[{"end": 19, "word": "Wolfgang", "index": 4, "score": 0.9645113348960876, "start": 11, "entity": "B-PER"}, {"end": 40, "word":
"Berlin", "index": 9, "score": 0.9998326301574708, "start": 34, "entity": "B-LOC"}]]
2 | Li Ming will go to a certain city today. | [[{"end": 1, "word": "Li", "index": 1, "score": 0.9602842330932616, "start": 0, "entity": "B-PER"}, {"end": 2, "word": "Ming", "i
ndex": 2, "score": 0.9939024448394777, "start": 1, "entity": "I-PER"}, {"end": 8, "word": "certain", "index": 8, "score": 0.9995771050453186, "start": 7, "entity": "B-LOC"}, {"end":
9, "word": "city", "index": 9, "score": 0.997289776802063, "start": 8, "entity": "I-LOC"}]]
(2 rows)