An External Collection is an innovative feature of Alibaba Cloud Milvus that lets you query data directly from Data Lake Formation (DLF) lake tables. By synchronizing only metadata, Milvus can efficiently query vector data in DLF without duplicating or storing the raw data. This document explains how to create and manage External Collections using Milvus Manager.
Limitations
This feature is only supported in Milvus 2.6 and later. The minor version of your instance must be 2.6.3-0.4.12_3.9.0 or later. To upgrade your instance, see Upgrade Version.
Currently, only Paimon tables are supported.
Key concepts
Concept | Description |
External Collection | Maps the schema of a DLF lake table to a Milvus collection for querying, without requiring data migration. |
lake table | A table in a data lake. Currently, only Paimon tables are supported. |
snapshot | A point-in-time snapshot of a lake table. |
tag | A label applied to a specific snapshot of a lake table. |
Associate a RAM identity
Before using the External Collection feature, you must associate a RAM identity with a Milvus user. Once associated, Milvus uses that RAM account's identity to authorize access to DLF resources.
The following rules apply to RAM account association:
A Milvus administrator can bind or unbind a RAM account for any Milvus user within their management scope.
A non-admin Milvus user can only bind or unbind a RAM account for their own account.
When logged in with an Alibaba Cloud account, you can bind any RAM user under that account to the target Milvus account.
When logged in as a RAM user, you can only bind that same user account.
Associating an Alibaba Cloud account is not supported.
Create an external collection
Follow the three-step wizard to create an External Collection:
Step 1: Link a lake table
Select a DLF instance, database, and target table. The page displays the schema (field name and type) of the lake table. Unsupported types are highlighted in red. For detailed mapping rules, see Schema mapping rules below.
Step 2: Configure the external collection
Configure the following information:
Name and Description: The name and description of the collection.
Primary key configuration: The primary key supports INT64 and VarChar types. Auto ID is not supported.
Vector field: Supports Float, Binary, Float16, and BFloat16 vector types. You must specify the dimension.
Scalar field configuration: Configure the scalar field mapping based on the fields of the lake table.
You must specify the vector dimension based on the actual data in your DLF lake table.
Step 3: Confirm data mapping
Review the mapping between the DLF lake table fields and the Milvus collection fields, then click Create to create the corresponding Milvus collection.
Snapshot sync
After creating an External Collection, you must synchronize a snapshot before you can query data. You can perform a quick sync using the post-creation wizard, or configure it later on the Snapshot sync page.
The Snapshot sync page displays the tag information, tag status, synchronization time, and row count for the DLF lake table synchronized with the current External Collection.
Sync operations
Auto mode: Automatically fetches the latest snapshot of the DLF lake table (prioritizing COMPACT commit types), creates a tag, and synchronizes it to the Milvus External Collection. You can configure a snapshot prefix for filtering.
Manual mode: Synchronizes a specific, existing tag by its name.
Scheduled sync
You can enable scheduled sync and configure a policy for automatic synchronization:
Simple cycle mode: Set the sync interval by minute, hour, or day.
Cron expression mode: Use a Cron expression to define a custom sync schedule. You can also configure a snapshot prefix.
Sync task list
You can filter tasks by status or tag and page through the results.
Unsupported features
Compared to regular collections, External Collections have the following limitations:
Schema limitations
Unsupported field types:
The
auto_id=Trueparameter is not supported. External Collections must use an external primary key.TIMESTAMPZ and GEOMETRY types are not supported.
SparseFloatVector type is not supported.
Nested complex types, such as
Array<Array<T>>, are not supported.A dynamic field is not supported.
A partition key is not supported.
Field naming limitations:
Field names are case-sensitive.
Field names must exactly match the corresponding Paimon table field names.
Field name mapping is not supported. For example, you cannot map a Milvus field
user_idto a Paimon fielduserId.
Other limitations:
Schema evolution is not supported.
Field name remapping is not supported.
The
is_function_output=Trueparameter for function output fields is not supported.
Data operation limitations
Unsupported operations:
insert(),upsert(),delete(), andflush()operations are not supported.Data cannot be modified directly; it must be changed in the source Paimon table.
Supported operations:
search()for vector search.query()for scalar queries.create_index()to create an index.load()andrelease()to load or release a collection.bulk_import()to trigger a refresh (incremental sync).
Index limitations
Supported index types:
Vector indexes: HNSW, HNSW_SQ, HNSW_PQ, IVF_FLAT, IVF_SQ8, IVF_PQ, IVF_RABITQ, and SCANN.
Scalar indexes: INVERTED, BITMAP, and STL_SORT.
Index behavior:
Indexes are built by Milvus and stored in Milvus object storage.
Indexing does not affect the data in the Paimon table.
Deleting the collection also deletes its indexes.
Performance limitations
Cold-read performance: Without an index, queries must scan the corresponding columns in the Paimon table. To improve query performance, create an index for each mapped field.
Synchronization performance: Synchronization speed depends on the size of the Paimon table's snapshot, network bandwidth, and Catalog response time.
Consistency limitations
Tag consistency:
External Collections provide consistency guarantees based on DLF tags. You are responsible for managing the lifecycle of these tags.
Query results reflect the data state associated with the tag from the last successful synchronization. While a new synchronization is in progress, you can continue to query the data associated with the previous tag.
To query the latest data, you must manually trigger a refresh.
Concurrency limitations:
Only one synchronization task can run at a time for the same External Collection.
You can perform queries during synchronization. Queries run against the data from the previous synchronization.
Schema mapping rules
External Collections support mapping the following Milvus field types to Paimon table fields:
Scalar type mapping
Milvus type | Paimon type | Description |
Bool | BOOLEAN | Boolean type |
Int8 | TINYINT | 8-bit integer |
Int16 | SMALLINT | 16-bit integer |
Int32 | INT | 32-bit integer |
Int64 | BIGINT | 64-bit integer |
Float | FLOAT | Single-precision floating-point |
Double | DOUBLE | Double-precision floating-point |
VarChar | STRING / VARCHAR / CHAR | String type |
Vector type mapping
Milvus type | Paimon type | Description |
FloatVector | ARRAY<FLOAT> / ARRAY<DOUBLE> | Floating-point vector |
Float16Vector | ARRAY<FLOAT> / ARRAY<DOUBLE> | Half-precision floating-point vector |
BinaryVector | ARRAY<BOOL> | Boolean vector |
Array type mapping
Milvus type | Paimon type | Description |
Array<Bool> | ARRAY<BOOLEAN> | Boolean array |
Array<Int8> | ARRAY<TINYINT> | Int8 array |
Array<Int16> | ARRAY<SMALLINT> | Int16 array |
Array<Int32> | ARRAY<INT> | Int32 array |
Array<Int64> | ARRAY<BIGINT> | Int64 array |
Array<Float> | ARRAY<FLOAT> | Float array |
Array<Double> | ARRAY<DOUBLE> | Double array |
Array<VarChar> | ARRAY<STRING> | String array |