Querying a database normally requires SQL expertise. LLM-based NL2SQL removes that barrier: the built-in large language model (LLM) translates plain-language questions into SQL statements, so analysts and non-technical users can query PolarDB directly. The model handles date arithmetic, value mappings (such as valid->isValid=1), and domain-specific patterns after calibration.
Quick demo
/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT 'Filter for the 2 students with the most absences, sort them by the number of absences in descending order, and display their names and number of absences.') WITH (basic_index_name='schema_index');Output:
SELECT s.student_name, COUNT(sc.id) AS leave_count FROM students s JOIN student_courses sc ON s.id = sc.student_id WHERE sc.status = 0 GROUP BY s.student_name ORDER BY leave_count DESC LIMIT 2;How it works
Implementing NL2SQL follows three phases:
| Phase | Goal | Key tasks |
|---|---|---|
| Phase 1: Quick launch | Get NL2SQL working on your database from scratch | Add table and column comments, build the search index, run your first query |
| Phase 2: Optimization and fine-tuning | Improve accuracy for your specific business data | Add question templates, configure pre- and post-processing rules, customize comments |
| Phase 3: Production deployment | Run NL2SQL reliably in production | Smooth deployment of the NL2SQL system into the production environment |
The rest of this document walks through each phase in sequence.
Prerequisites
Before you begin, make sure you have:
An AI node added to your PolarDB cluster, with a database account that has Read and Write permissions. See Enable the PolarDB for AI feature.
A connection to the cluster using the Cluster Endpoint. See Log on to PolarDB for AI.
When connecting from the command line, include the -c option. If you use Data Management (DMS), DMS connects via the Primary Endpoint by default, which routes queries away from the AI node. Change the connection to the Cluster Endpoint before running NL2SQL queries.
Usage notes
How to phrase questions
Structure your questions with the condition first, followed by the entity (column value), then the column name. This order improves the accuracy of the generated SQL.
Example:
SELECT 'What is the property name of a "house" or "apartment" with more than one room?'Here, *with more than one room* is the condition, *house* and *apartment* are the column values, and *property name* is the column name.
Factors that affect accuracy
| Factor | Effect |
|---|---|
| Richness of table and column comments | More descriptive comments improve how well the model understands your schema |
| Semantic match between your question and column comments | The closer the match, the better the generated SQL |
| Number of columns in the generated SQL | Fewer columns and simpler conditions produce more accurate results |
| SQL logical complexity | Less advanced syntax produces more accurate results |
Phase 1: Quick launch
Step 1: Add comments to your data tables
NL2SQL relies on table and column comments to understand your schema. Before running any queries, add comments to the tables you plan to query.
Table comments help the model identify which tables to use. Keep them concise (within 10 characters) and focused on the table's main content — for example, *orders* or *inventory*.
Column comments should use common nouns or phrases that accurately reflect the column's meaning — for example, *order ID*, *date*, or *store name*. You can also include value mappings, such as: Indicates whether it is valid. 0: No. 1: Yes.
Step 2: Build a search index table
The search index table stores vectorized representations of your schema, which the model uses to select relevant tables for each query.
You can use any table name that follows database naming rules. This guide uses schema_index as an example.
/*polar4ai*/CREATE TABLE schema_index(id integer, table_name varchar, table_comment text_ik_max_word, table_ddl text_ik_max_word, column_names text_ik_max_word, column_comments text_ik_max_word, sample_values text_ik_max_word, vecs vector_768, ext text_ik_max_word, PRIMARY KEY (id));/*polar4ai*/SHOW TABLES;. To delete the table, run /*polar4ai*/DROP TABLE IF EXISTS schema_index;.Step 3: Import your schema into the search index
This step vectorizes all tables in the current database and imports the schema information into the search index table.
/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_text2vec, SELECT '') WITH (mode='async', resource='schema') INTO schema_index;Parameters
| Parameter | Required | Description |
|---|---|---|
mode | Yes | Data writing mode. Fixed to async for asynchronous mode. |
resource | Yes | Resource type. Fixed to schema to vectorize table information. |
tables_included | No | Tables to vectorize. Default: '' (all tables). To specify multiple tables, separate names with commas. |
to_sample | No | Whether to sample column values. 0 (default): no sampling. 1: sample column values. Sampling improves SQL quality for tables with fewer than 15 columns but increases import time. |
columns_excluded | No | Columns to exclude from NL2SQL. Default: '' (none excluded). Format: table_name1.column_name1,table_name2.column_name1. |
Example: Vectorize specific tables and sample column values, excluding certain columns:
/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_text2vec, SELECT '') WITH (mode='async', resource='schema', tables_included='graph_info,image_info,text_info', to_sample=1, columns_excluded='graph_info.time,text_info.ext') INTO schema_index;Step 4: Check the import status
After running the import, a task_id is returned (for example, bce632ea-97e9-11ee-bdd2-492f4dfe0918). Check the status using:
/*polar4ai*/SHOW TASK `bce632ea-97e9-11ee-bdd2-492f4dfe0918`;When the return value is finish, the import is complete. To view the indexed schema information:
/*polar4ai*/SELECT * FROM schema_index;Step 5: Run NL2SQL queries
Syntax
/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT '<question>') WITH (basic_index_name='<basic_index_name>');Replace <question> with your natural language question and <basic_index_name> with the name of your search index table.
Parameters
| Parameter | Required | Default | Description |
|---|---|---|---|
basic_index_name | Yes | — | Name of the search index table. |
to_optimize | No | 0 | Whether to optimize the generated SQL. 0: no optimization. 1: apply SQL optimization. |
basic_index_top | No | 3 | Number of most similar tables to retrieve. Valid values: [1,10]. Increase to 4 or higher when a query spans multiple tables. |
basic_index_threshold | No | 0.1 | Similarity threshold for table selection. Valid values: (0,1]. Only tables with a vector match score above this threshold are selected. |
Examples
Sort results and display specific columns:
/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT 'Sort by teacher name in ascending alphabetical order, and display the teacher''s name and the name of the course they are assigned to teach.') WITH (basic_index_name='schema_index');SELECT t.teacher_name, c.course_name FROM teachers t JOIN courses c ON t.id = c.teacher_id ORDER BY t.teacher_name ASC;Filter, sort, and limit results:
/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT 'Filter for the 2 students with the most absences, sort them by the number of absences in descending order, and display their names and number of absences.') WITH (basic_index_name='schema_index');SELECT s.student_name, COUNT(sc.id) AS leave_count FROM students s JOIN student_courses sc ON s.id = sc.student_id WHERE sc.status = 0 GROUP BY s.student_name ORDER BY leave_count DESC LIMIT 2;