PolarDB for AI lets you connect external databases—PostgreSQL, Hive, Elasticsearch, and others—to Data-Agent so that users can query them using plain language. Register the schema metadata of your external tables, and the built-in large language model (LLM) translates natural language questions like "find the customer with the most orders" into SQL or DSL queries that run against your actual data source.
How it works
PolarDB for AI stores metadata from your external data sources—table schemas, column descriptions, and sample values—in a dedicated table. The built-in _polar4ai_text2vec model converts that metadata into vectors and stores them in a searchable schema_index table.
When a user asks a question, the NL2SQL (natural language to SQL) feature:
-
Converts the question into a vector.
-
Searches
schema_indexto find the most relevant tables and columns. -
Generates an executable SQL or DSL query using the retrieved metadata.
Choose a build mode
PolarDB for AI provides two modes for registering external metadata. Choose based on how much control you need and whether your source database can export DDL.
| schema_info mode | schema_meta mode | |
|---|---|---|
| How it works | You define each table, column, data type, and sample value manually | The system parses DDL or Elasticsearch index definitions automatically |
| Best for | Fine-grained control: adding business descriptions, sample values, and foreign key relationships to improve NL2SQL accuracy | Quickly onboarding many tables from databases that can export standard DDL |
| Supported sources | Any data source (all metadata is entered manually) | Hive, MySQL, PostgreSQL, and Elasticsearch |
| Trade-off | More setup work; richer metadata quality | Less setup work; accuracy depends on what is already in the DDL |
Start with schema_info mode when query accuracy matters most or when your source database cannot export DDL. Use schema_meta mode when you need to onboard many tables quickly and the DDL contains adequate column descriptions.
Think of the metadata you provide as an onboarding package for the LLM. The more context you give—clear table descriptions, realistic sample values, explicit foreign key relationships—the more accurately it can translate questions into correct queries. A sparse metadata set produces acceptable results for simple questions but will struggle with joins and business-specific filters.
Prerequisites
Before you begin, ensure that you have:
-
An AI node added to your cluster and a database account configured for the AI node connection. For details, see Enable the PolarDB for AI feature
If you added an AI node when you purchased the cluster, you can skip the node setup and go straight to configuring the database account. The account must have Read and Write permissions on the target database.
-
A connection to your PolarDB cluster using the cluster endpoint. For details, see Log on to PolarDB for AI
When connecting from the command line, include the -c flag. If you use Data Management Service (DMS), it connects to the primary endpoint by default, which routes queries away from the AI node. Switch to the cluster endpoint by using the database selector at the top of the DMS page to search for and select your target database.
Build a Data-Agent with manual metadata (schema_info mode)
Use this mode when you want precise control over the metadata the LLM sees. Each row in the schema_info table describes one column from an external table.
Step 1: Create the schema_info table
CREATE TABLE schema_info (
id INT AUTO_INCREMENT PRIMARY KEY COMMENT 'Auto-increment primary key',
table_name VARCHAR(255) NOT NULL COMMENT 'Table name',
table_comment TEXT COMMENT 'Table comment',
column_name VARCHAR(255) NOT NULL COMMENT 'Column name',
column_comment TEXT COMMENT 'Column comment',
data_type VARCHAR(255) COMMENT 'Column data type (optional)',
sample_values TEXT COMMENT 'Sample values',
is_primary INT COMMENT 'Whether it is a primary key (1/0)',
is_foreign INT COMMENT 'Whether it is a foreign key (1/0)',
ext TEXT COMMENT 'Foreign key information',
db_type VARCHAR(128) COMMENT 'Relational database language type (such as MySQL, PostgreSQL)'
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='Database metadata information table';
Step 2: Insert metadata
Insert one row per column. Follow these constraints:
-
db_type: Set to the source database type, such asMySQLorPostgreSQL. All columns from the same logical database must share the samedb_type. -
sample_values: Separate multiple values with commas. Keep the total field length under 100 characters. -
is_foreign: Set to1if the column is a foreign key. -
ext: Required whenis_foreignis1. Format:<database_name>.<table_name>.<column_name>.
The following example inserts metadata for a customers table and an orders table. Note that orders.customer_id is a foreign key pointing to customers.id—this relationship lets the LLM generate correct JOIN queries.
-- Metadata for the customers table
INSERT INTO schema_info (table_name, table_comment, column_name, column_comment, data_type, sample_values, is_primary, is_foreign, ext, db_type)
VALUES
('customers', 'Customer information table', 'id', 'Unique customer ID', 'INT', '1,2,3', 1, 0, NULL, 'PostgreSQL'),
('customers', 'Customer information table', 'name', 'Customer name', 'VARCHAR(100)', 'Zhang San,Li Si,Wang Wu', 0, 0, NULL, 'PostgreSQL'),
('customers', 'Customer information table', 'email', 'Customer email', 'VARCHAR(100) UNIQUE', 'zhangsan@example.com', 0, 0, NULL, 'PostgreSQL');
-- Metadata for the orders table
INSERT INTO schema_info (table_name, table_comment, column_name, column_comment, data_type, sample_values, is_primary, is_foreign, ext, db_type)
VALUES
('orders', 'Order information table', 'order_id', 'Unique order ID', 'INT', '1001,1005,1003', 1, 0, NULL, 'PostgreSQL'),
('orders', 'Order information table', 'order_date', 'Order date', 'DATE', '2023-01-01,2023-05-06', 0, 0, NULL, 'PostgreSQL'),
('orders', 'Order information table', 'customer_id', 'Associated customer ID', 'INT', '1,2,3', 0, 1, 'dbname.customers.id', 'PostgreSQL'),
('orders', 'Order information table', 'total_amount', 'Total order amount', 'DECIMAL(10,2)', '99.99,101.81', 0, 0, NULL, 'PostgreSQL');
Step 3: Build the search index
The following commands create the schema_index table and start an asynchronous task that vectorizes all metadata in schema_info.
-- Create the vector index table
/*polar4ai*/CREATE TABLE schema_index(
id integer,
table_name varchar,
table_comment text_ik_max_word,
table_ddl text_ik_max_word,
column_names text_ik_max_word,
column_comments text_ik_max_word,
sample_values text_ik_max_word,
vecs vector_768,
ext text_ik_max_word,
database_service text_ik_max_word,
PRIMARY key (id)
);
-- Start the asynchronous indexing task
/*polar4ai*/SELECT * FROM PREDICT (
MODEL _polar4ai_text2vec, SELECT ''
) WITH (mode='async', resource='schema_info', to_sample = 1)
INTO schema_index;
to_sample = 1tells the system to vectorize the sample values inschema_info.sample_values, which improves accuracy for value-based queries such as "orders with status 'shipped'". Set it to0to skip this step.
After the command runs, it returns a task ID such as b3b8fd1e-b886-11f0-9f89-97664bebacb7. Save this ID to check the task status in the next step.
Step 4: Check the task status
Run the following command with your task ID. When the status changes to finish, the index is ready.
/*polar4ai*/SHOW TASK `<task_id>`;
Replace <task_id> with the ID returned in the previous step.
Step 5: Query data using Data-Agent
With the index built, use the NL2SQL feature to query your external data in plain language.
/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, select 'the 10 customers with the most orders') WITH (basic_index_name='schema_index');
Build a Data-Agent by parsing DDL (schema_meta mode)
Use this mode to onboard many tables at once by importing their DDL or Elasticsearch index definitions. You do not need actual tables in the database—the LLM parses the creation statements and builds the search index directly.
Step 1: Create the schema_meta table
CREATE TABLE IF NOT EXISTS schema_meta (
db_name VARCHAR(255) COMMENT 'Database name',
create_table_statement TEXT COMMENT 'Statement to create a table or index'
) COMMENT='Stores DDL or index definitions from external databases';
Step 2: Import DDL or index definitions
Insert the creation statements from your external data source into schema_meta.
-
For relational databases (Hive, MySQL, PostgreSQL): each statement must start with
CREATE. -
For Elasticsearch: each statement must start with
PUT.
-- Relational database DDL example (Hive)
INSERT INTO schema_meta (db_name, create_table_statement) VALUES
('my_hive_db', "CREATE TABLE dwd.dwd_col_df_court_case_detail (
case_id BIGINT COMMENT 'Case ID',
created_date STRING COMMENT 'Creation time (yyyy-mm-dd)'
)
COMMENT 'Case order table'
PARTITIONED BY (dt STRING)
STORED AS ORC;");
-- Elasticsearch index definition example
INSERT INTO schema_meta (db_name, create_table_statement) VALUES
('my_es_cluster', 'PUT /product_info
{
"settings": {"number_of_shards": 5, "number_of_replicas": 1},
"mappings" : {
"properties": {
"productName": {"type": "text", "analyzer": "ik_smart"},
"describe": {"type": "text", "analyzer": "ik_smart"}
}
}
}');
Batch import using Python scripts
To import DDL for many tables at once, use the following Python scripts.
Step 3: Build the search index
The following operation drops the schema_index table. Any previously built index will be lost. If you already have a working index, back it up or proceed with caution. This operation is for demonstration purposes only.
-- 1. Drop the existing index table
/*polar4ai*/DROP TABLE schema_index;
-- 2. Create a new vector index table
/*polar4ai*/CREATE TABLE schema_index(
id integer,
table_name varchar,
table_comment text_ik_max_word,
table_ddl text_ik_max_word,
column_names text_ik_max_word,
column_comments text_ik_max_word,
sample_values text_ik_max_word,
vecs vector_768,
ext text_ik_max_word,
database_service text_ik_max_word,
PRIMARY key (id)
);
-- 3. Start the asynchronous indexing task from schema_meta
/*polar4ai*/SELECT * FROM PREDICT (
MODEL _polar4ai_text2vec, SELECT ''
) WITH (mode='async', resource='schema_meta', data_service = 'ES')
INTO schema_index;
You must explicitly specifydata_service = 'ES'when yourschema_metatable contains Elasticsearch index definitions.
After the command runs, save the returned task ID to check progress in the next step.
Step 4: Check the task status
/*polar4ai*/SHOW TASK `<task_id>`;
When the status changes to finish, the index is ready.
Step 5: Query data using Data-Agent
/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, select 'short-term high-yield products') WITH (basic_index_name='schema_index');