Build NL2SQL Pipelines with LLM Integration in PolarDB - PolarDB

Querying a database normally requires SQL expertise. LLM-based NL2SQL removes that barrier: the built-in large language model (LLM) translates plain-language questions into SQL statements, so analysts and non-technical users can query PolarDB directly. The model handles date arithmetic, value mappings (such as valid->isValid=1), and domain-specific patterns after calibration.

Quick demo

/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT 'Filter for the 2 students with the most absences, sort them by the number of absences in descending order, and display their names and number of absences.') WITH (basic_index_name='schema_index');

Output:

SELECT s.student_name, COUNT(sc.id) AS leave_count FROM students s JOIN student_courses sc ON s.id = sc.student_id WHERE sc.status = 0 GROUP BY s.student_name ORDER BY leave_count DESC LIMIT 2;

How it works

Implementing NL2SQL follows three phases:

Phase	Goal	Key tasks
Phase 1: Quick launch	Get NL2SQL working on your database from scratch	Add table and column comments, build the search index, run your first query
Phase 2: Optimization and fine-tuning	Improve accuracy for your specific business data	Add question templates, configure pre- and post-processing rules, customize comments
Phase 3: Production deployment	Run NL2SQL reliably in production	Smooth deployment of the NL2SQL system into the production environment

The rest of this document walks through each phase in sequence.

Prerequisites

Before you begin, make sure you have:

An AI node added to your PolarDB cluster, with a database account that has Read and Write permissions. See Enable the PolarDB for AI feature.
A connection to the cluster using the Cluster Endpoint. See Log on to PolarDB for AI.

Note If you added an AI node when purchasing the cluster, you can skip to setting the database account permissions.

Important

When connecting from the command line, include the -c option. If you use Data Management (DMS), DMS connects via the Primary Endpoint by default, which routes queries away from the AI node. Change the connection to the Cluster Endpoint before running NL2SQL queries.

Usage notes

How to phrase questions

Structure your questions with the condition first, followed by the entity (column value), then the column name. This order improves the accuracy of the generated SQL.

Example:

SELECT 'What is the property name of a "house" or "apartment" with more than one room?'

Here, *with more than one room* is the condition, *house* and *apartment* are the column values, and *property name* is the column name.

Factors that affect accuracy

Factor	Effect
Richness of table and column comments	More descriptive comments improve how well the model understands your schema
Semantic match between your question and column comments	The closer the match, the better the generated SQL
Number of columns in the generated SQL	Fewer columns and simpler conditions produce more accurate results
SQL logical complexity	Less advanced syntax produces more accurate results

Phase 1: Quick launch

Step 1: Add comments to your data tables

NL2SQL relies on table and column comments to understand your schema. Before running any queries, add comments to the tables you plan to query.

Table comments help the model identify which tables to use. Keep them concise (within 10 characters) and focused on the table's main content — for example, *orders* or *inventory*.

Column comments should use common nouns or phrases that accurately reflect the column's meaning — for example, *order ID*, *date*, or *store name*. You can also include value mappings, such as: Indicates whether it is valid. 0: No. 1: Yes.

Note If you cannot modify the original comments, use the Customize table and column comments method in Phase 2 to add overriding comments without touching the originals.

Step 2: Build a search index table

The search index table stores vectorized representations of your schema, which the model uses to select relevant tables for each query.

You can use any table name that follows database naming rules. This guide uses schema_index as an example.

/*polar4ai*/CREATE TABLE schema_index(id integer, table_name varchar, table_comment text_ik_max_word, table_ddl text_ik_max_word, column_names text_ik_max_word, column_comments text_ik_max_word, sample_values text_ik_max_word, vecs vector_768, ext text_ik_max_word, PRIMARY KEY (id));

Note The search index table is not visible in the standard database view. To list all PolarDB for AI tables, run /*polar4ai*/SHOW TABLES;. To delete the table, run /*polar4ai*/DROP TABLE IF EXISTS schema_index;.

Step 3: Import your schema into the search index

This step vectorizes all tables in the current database and imports the schema information into the search index table.

/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_text2vec, SELECT '') WITH (mode='async', resource='schema') INTO schema_index;

Parameters

Parameter	Required	Description
`mode`	Yes	Data writing mode. Fixed to `async` for asynchronous mode.
`resource`	Yes	Resource type. Fixed to `schema` to vectorize table information.
`tables_included`	No	Tables to vectorize. Default: `''` (all tables). To specify multiple tables, separate names with commas.
`to_sample`	No	Whether to sample column values. `0` (default): no sampling. `1`: sample column values. Sampling improves SQL quality for tables with fewer than 15 columns but increases import time.
`columns_excluded`	No	Columns to exclude from NL2SQL. Default: `''` (none excluded). Format: `table_name1.column_name1,table_name2.column_name1`.

Example: Vectorize specific tables and sample column values, excluding certain columns:

/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_text2vec, SELECT '') WITH (mode='async', resource='schema', tables_included='graph_info,image_info,text_info', to_sample=1, columns_excluded='graph_info.time,text_info.ext') INTO schema_index;

Step 4: Check the import status

After running the import, a task_id is returned (for example, bce632ea-97e9-11ee-bdd2-492f4dfe0918). Check the status using:

/*polar4ai*/SHOW TASK `bce632ea-97e9-11ee-bdd2-492f4dfe0918`;

When the return value is finish, the import is complete. To view the indexed schema information:

/*polar4ai*/SELECT * FROM schema_index;

Step 5: Run NL2SQL queries

Syntax

/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT '<question>') WITH (basic_index_name='<basic_index_name>');

Replace <question> with your natural language question and <basic_index_name> with the name of your search index table.

Parameters

Parameter	Required	Default	Description
`basic_index_name`	Yes	—	Name of the search index table.
`to_optimize`	No	`0`	Whether to optimize the generated SQL. `0`: no optimization. `1`: apply SQL optimization.
`basic_index_top`	No	`3`	Number of most similar tables to retrieve. Valid values: [1,10]. Increase to 4 or higher when a query spans multiple tables.
`basic_index_threshold`	No	`0.1`	Similarity threshold for table selection. Valid values: (0,1]. Only tables with a vector match score above this threshold are selected.

Examples

Sort results and display specific columns:

/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT 'Sort by teacher name in ascending alphabetical order, and display the teacher''s name and the name of the course they are assigned to teach.') WITH (basic_index_name='schema_index');

SELECT t.teacher_name, c.course_name FROM teachers t JOIN courses c ON t.id = c.teacher_id ORDER BY t.teacher_name ASC;

Filter, sort, and limit results:

/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT 'Filter for the 2 students with the most absences, sort them by the number of absences in descending order, and display their names and number of absences.') WITH (basic_index_name='schema_index');

SELECT s.student_name, COUNT(sc.id) AS leave_count FROM students s JOIN student_courses sc ON s.id = sc.student_id WHERE sc.status = 0 GROUP BY s.student_name ORDER BY leave_count DESC LIMIT 2;