Vector Search for Unstructured Data Overview - AnalyticDB for PostgreSQL

Traditional databases handle exact matches and range queries well — finding all orders placed between two dates, or all users in a given region. But they break down when the question is "find me images similar to this one" or "retrieve documents with the same meaning as this sentence." That's where vector analysis comes in.

AnalyticDB for PostgreSQL provides vector analysis built on a massively parallel processing (MPP) architecture. AI algorithms convert unstructured data — images, audio, text — into feature vectors, and vector distance measures how similar two items are. Because vector analysis runs inside the same MPP engine that handles structured queries, you can combine similarity search with standard SQL filters in a single statement.

The vector analysis feature of AnalyticDB for PostgreSQL is widely used across Alibaba's business portfolio, including the data mid-end, e-commerce, new retail, and urban intelligence.

How it works

A web application uses a feature extraction service to convert unstructured data into feature vectors, then writes those vectors to the vector library in AnalyticDB for PostgreSQL. At query time, the application extracts a feature vector from the input data and calls the retrieval analysis interface to find the closest matches.

Unstructured data (image / audio / text)
        │
        ▼
Feature extraction service
        │
        ├── Write path ──► Vector library in AnalyticDB for PostgreSQL
        │
        └── Query path ──► Retrieval analysis interface ──► Ranked results

The MPP architecture distributes both storage and computation across nodes. This lets similarity searches scale horizontally as your vector dataset grows, without requiring a separate vector store outside your SQL environment.

Key concepts

Concept	Description
Feature vector	A numerical representation of an item's characteristics, extracted by an AI model. Items with similar features produce vectors that are close together in high-dimensional space.
Vector distance	A measure of similarity between two vectors. Smaller distance means higher similarity.
kNN join	A k-nearest neighbor join that compares every vector in one set against every vector in another to find the closest pairs — similar to kNN join in Apache Spark. Used for product deduplication and face clustering.
FP32 / FP16	Floating-point precision formats. A 512-dimensional FP32 vector uses 2 KB of storage; converting to FP16 halves that to 1 KB.

Use cases

Vector analysis is suited for scenarios where you need to find items by content similarity rather than exact attribute values:

Image similarity search: Find products that look like a given photo — for example, visually similar dresses in an e-commerce catalog.
Audio matching: Identify audio files that match a given recording using voiceprint recognition.
Semantic text search: Retrieve documents or passages that share the same meaning as a query, even when the exact words differ.
Duplicate detection: Remove duplicate files by comparing their fingerprints across a large dataset.
Product grouping: Cluster images that contain the same product across a large image library.

When vector analysis is a good fit:

Your data is unstructured (images, audio, text) and exact-match queries are insufficient.
You need to filter similarity results by structured attributes such as price, date, or category in the same query.
Your dataset requires real-time updates — new items must be searchable immediately after ingestion.

Benefits

Hybrid queries across structured and unstructured data

Vector analysis is built into the same SQL engine as structured queries, so you can combine similarity search with standard filters in one statement. For example: find dresses visually similar to a given image, with a price between USD 100 and USD 200, published within the last month.

Real-time ingestion and querying

Vectors written to AnalyticDB for PostgreSQL are immediately available for query. Standalone vector systems typically apply updates in batch — making newly ingested data unavailable until the next processing cycle (usually the following day).

Large-scale vector collision with kNN join

The k-nearest neighbor (kNN) join operation compares similarities between two sets of vectors. This is computationally intensive — comparable to kNN join in Spark — but AnalyticDB for PostgreSQL is optimized to handle it at scale. Typical applications include product deduplication (detecting new products similar to existing catalog entries) and face clustering (grouping images of the same person from a face database).

Standard SQL interface

Use SQL statements to query your vector data. No proprietary query language or separate client SDK is required, which reduces integration effort for teams already working with PostgreSQL-compatible tools.

Reduced storage costs with FP16 compression

One 512-dimensional FP32 vector consumes 2 KB. Converting FP32 to FP16 cuts storage costs by 50%, without changing the query interface.