To help users who are not familiar with SQL analyze data in databases, PolarDB for AI provides a proprietary AI model for Large Language Model-based Natural Language to SQL (LLM-based NL2SQL) translation. This model is built-in and ready to use. Compared to traditional natural language to SQL (NL2SQL) methods, the LLM-based NL2SQL model provides stronger language understanding. The SQL statements that it generates support more functions, such as date arithmetic. The model can even understand simple mappings, such as valid->isValid=1. After you make adjustments, it can also understand your common SQL patterns, such as defaulting to datastatus=1 in conditions.
Demo
/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT 'Filter for the 2 students with the most absences, sort them by the number of absences in descending order, and display their names and number of absences.') WITH (basic_index_name='schema_index');Output: SELECT s.student_name, COUNT(sc.id) AS leave_count FROM students s JOIN student_courses sc ON s.id = sc.student_id WHERE sc.status = 0 GROUP BY s.student_name ORDER BY leave_count DESC LIMIT 2;
End-to-end process for natural language to SQL translation
To help you efficiently implement NL2SQL applications, the process is divided into three phases: Quick Launch, Optimization and Fine-tuning, and Production Deployment.
Quick Launch: This phase helps you quickly build basic NL2SQL capabilities from scratch.
Optimization and Fine-tuning: We provide deep optimization for specific issues in actual business scenarios.
Production: This phase ensures the smooth deployment of the NL2SQL system into the production environment.
Prerequisites
Add an AI node and set a database account to connect to the AI node. For more information, see Enable the PolarDB for AI feature.
NoteIf you added an AI node when you purchased the cluster, you can directly set a database account for the AI node.
The database account for the AI node must have Read and Write permissions to read from and write to the target database.
Connect to the PolarDB cluster using the cluster endpoint. For more information, see Log on to PolarDB for AI.
ImportantWhen you connect to the cluster from the command line, add the
-coption.When you use the PolarDB for AI feature in Data Management (DMS), DMS connects to the PolarDB cluster by default using the Primary Endpoint. This prevents SQL statements from being routed to the AI node. You must manually change the connection endpoint to the Cluster Endpoint.
Usage notes
How to ask questions: In your questions, clearly state the conditions and related entity values. Place the condition first, followed by the entity for the column value that you are looking for, and finally the possible column name. Example:
SELECT 'What is the property name of a "house" or "apartment" with more than one room?'Here, with more than one room is the condition, house and apartment are the entities corresponding to the column values, and property name is the possible column name.
Query result accuracy: LLM-based NL2SQL is an AI model based on a large language model, and its performance is influenced by many factors. To ensure that the query results meet your expectations, consider the following factors that may affect the overall accuracy:
Richness of table and column comments: Adding comments to each table and its columns improves the query accuracy.
Match between the user question and the column comments in the table: The closer the semantic match between the keywords in the user question and the column comments, the better the query performance is.
Length of the generated SQL statement: Queries are more accurate if the SQL statement involves fewer columns and has simpler conditions.
Logical complexity of the SQL statement: The less advanced syntax that is used in an SQL statement, the more accurate the query is.
Instructions
Standardize data tables
A prerequisite for NL2SQL is that the model must understand the meaning of your tables, including what the column names represent. Before you use LLM-based NL2SQL, you must add comments to frequently used data tables and their columns.
Table comments
Table comments help the LLM-based NL2SQL model better understand the basic information of a table. This helps the model locate the tables that are involved in an SQL statement. Comments should provide a concise and clear overview of the main content of the table, such as orders or inventory. Keep comments within 10 characters and avoid excessive explanations.
Column comments
Column comments usually consist of common nouns or phrases, such as order ID, date, or store name. These comments must accurately reflect the meaning of the column names. You can also add sample data or mappings to column comments. For example, for a column named
isValid, the comment can beIndicates whether it is valid. 0: No. 1: Yes..
If you cannot modify the original comments, you can use the custom table and column comments feature in the advanced usage section to adjust the comments. For more information, see Advanced usage - Customize table and column comments.
Data preparation
You can prepare a test dataset based on your business scenario. This topic uses the following test dataset: test_dataset.sql.
Build a search index table
To extract data from data tables, you must build a search index table. You can use a custom table name that complies with database specifications. This topic uses
schema_indexas an example. The SQL statement to build the search index table is as follows:/*polar4ai*/CREATE TABLE schema_index(id integer, table_name varchar, table_comment text_ik_max_word, table_ddl text_ik_max_word, column_names text_ik_max_word, column_comments text_ik_max_word, sample_values text_ik_max_word, vecs vector_768, ext text_ik_max_word, PRIMARY KEY (id));NoteThe search index table is not directly displayed in the database. To view related information, you can run the SQL statement
/*polar4ai*/SHOW TABLES;.To delete the search index table, using
schema_indexas an example, you can run the SQL statement/*polar4ai*/DROP TABLE IF EXISTS schema_index;.
Import data from the data tables into the search index table
After you run the following SQL statement, PolarDB for AI performs vectorization on all tables in the current database and samples column values by default.
/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_text2vec, SELECT '') WITH (mode='async', resource='schema') INTO schema_index;Parameter description
_polar4ai_text2vecis the text-to-vector model.After
INTO, specify the name of the search index table that you created in Step 1. Build a search index table.You can configure multiple parameters in
WITH()to set the related behavior:Parameter
Required
Description
mode
Yes
The data writing mode. This is fixed to `async` for asynchronous mode.
resource
Yes
The resource type. This is fixed to `schema` to vectorize table information.
tables_included
No
The tables to vectorize.
The default value is
'', which vectorizes all tables. To specify multiple tables, separate the table names with commas (,) and join them into a string.to_sample
No
Specifies whether to sample column values. Sampling column values increases the time it takes to import data into the search index table, but it improves the quality of the generated SQL when the table has few columns (less than 15). Valid values:
0 (default): Does not sample column values.
1: Samples column values.
columns_excluded
No
The columns to exclude from LLM-based NL2SQL operations.
The default value is
''. This indicates that all columns in all tables used for vector conversion participate in the subsequent LLM-based NL2SQL operation. To set this parameter, you need to concatenate the columns in the selected tables that do not participate in the subsequent LLM-based NL2SQL operation into a string in thetable_name1.column_name1,table_name1.column_name2,table_name2.column_name1format.Example: The following SQL statement performs vectorization on the
graph_info,image_info, andtext_infotables in the current database and samples the column values. Thetimecolumn in thegraph_infotable and theextcolumn in thetext_infotable are excluded from subsequent LLM-based NL2SQL operations./*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_text2vec, SELECT '') WITH (mode='async', resource='schema', tables_included='graph_info,image_info,text_info', to_sample=1, columns_excluded='graph_info.time,text_info.ext') INTO schema_index;
Check the task status
After you run the statement to import data into the search index table, the
task_idfor the task is returned, such asbce632ea-97e9-11ee-bdd2-492f4dfe0918. You can use the following command to check the import status. When the return value isfinish, the task is complete./*polar4ai*/SHOW TASK `bce632ea-97e9-11ee-bdd2-492f4dfe0918`;You can run the following SQL statement to view the search index information:
/*polar4ai*/SELECT * FROM schema_index;
Use LLM-based NL2SQL online
Syntax
/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT '<question>') WITH (basic_index_name='<basic_index_name>');Parameter description
After
<question>, enter the question that you want to convert into an SQL statement. The following table shows some recommended sample questions:This example scenario
Other scenarios
Sort by teacher name in ascending alphabetical order, and display the teacher's name and the name of the course they are assigned to teach.
Filter for the 2 students with the most absences, sort them by the number of absences in descending order, and display their names and number of absences.
Query the names and locations of courses that start between October 1, 2023, and October 3, 2023.
Count the number of courses each student has taken, filter for students who have taken more than 2 courses, and display their names and course counts in descending order of course count.
Filter for students whose addresses contain "Beijing" or "Shanghai" and display their names and phone numbers.
What are the IDs, roles, and names of experts who have performed two or more treatments?
What is the name of the dog breed that is kept in the largest numbers?
Which owner paid the most for their dog's treatments? List the owner's ID and last name.
Tell me the ID and last name of the owner who spent the most on their dog's treatments.
What is the description of the treatment type with the lowest total cost?
You can configure multiple parameters in
WITH()to set the related behavior:Parameter name
Required
Description
basic_index_name
Yes
The name of the search index table in the current database.
to_optimize
No
Specifies whether to perform SQL optimization. Valid values:
0 (default): No optimization.
1: Performs SQL optimization. PolarDB for AI further processes the generated SQL statement to make it more optimal.
basic_index_top
No
The number of most similar tables to retrieve. Valid values: [1,10].
The default value is 3, which selects the 3 most optimal tables for the current question. A setting of 1 is usually sufficient.
If multiple tables are involved, you can set this value to 4 or higher to expand the retrieval and improve the results.
basic_index_threshold
No
The threshold used to determine the similarity of retrieved tables. Valid values: (0,1].
The default value is 0.1. A database table is selected only if its vector match score exceeds 0.1.
Examples
Display specific content and sort it in a specified order.
/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT 'Sort by teacher name in ascending alphabetical order, and display the teacher''s name and the name of the course they are assigned to teach.') WITH (basic_index_name='schema_index');SELECT t.teacher_name, c.course_name FROM teachers t JOIN courses c ON t.id = c.teacher_id ORDER BY t.teacher_name ASC;Search based on specified conditions, display the content that meets the conditions, and sort it in a specified order.
/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT 'Filter for the 2 students with the most absences, sort them by the number of absences in descending order, and display their names and number of absences.') WITH (basic_index_name='schema_index');SELECT s.student_name, COUNT(sc.id) AS leave_count FROM students s JOIN student_courses sc ON s.id = sc.student_id WHERE sc.status = 0 GROUP BY s.student_name ORDER BY leave_count DESC LIMIT 2;
Advanced usage
PolarDB for AI provides four advanced usage methods. If you encounter the following issues, you can refer to the corresponding advanced usage instructions to resolve them.
Configure question templates: Configure general question templates to enable the model to generate SQL statements based on specific knowledge.
Build a configuration table: Pre-process questions or post-process the generated SQL statements.
Customize table and column comments: If the comments of the original table or columns cannot be modified, you can add new comments for the table and its columns to overwrite the original comments.
Wide table support: When a table has too many columns or you receive a
Please use column index to avoid oversize table information.error, you can build a column index table to enable model support for wide tables.
Configure question templates
Question templates are created for specific domain knowledge and help the model better understand the current question. You can configure general question templates to guide the model with specific knowledge, which enables it to generate SQL statements based on that knowledge.
Instructions
Create a question template table
The name of the question template table must start with
polar4ai_nl2sql_pattern, and its schema must include the five columns from the following `CREATE TABLE` statement.DROP TABLE IF EXISTS `polar4ai_nl2sql_pattern`; CREATE TABLE `polar4ai_nl2sql_pattern` ( `id` int(11) NOT NULL AUTO_INCREMENT COMMENT 'Primary key', `pattern_question` text COMMENT 'Template question', `pattern_description` text COMMENT 'Template description', `pattern_sql` text COMMENT 'Template SQL', `pattern_params` text COMMENT 'Template parameters', PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;Question template column description
Column name
Description
Template question
A question with parameters, provided as input to the LLM-based NL2SQL model.
In the template question, write parameters in the format
#{XXX}.Template description
A summary of the template question, with some entities as parameters, such as date, year, or organization. These entities usually correspond to specific columns in a table.
In the template description, write parameters in the format
[XXX], and they must be in the same order as the parameters in the template question.Template SQL
The correct SQL statement corresponding to the template question. This SQL statement must treat the parameters from the template question as variables.
NoteThe parameters in the template question and template SQL can be different, but they must be related. For example, the parameters should have the same prefix, and their values must have a one-to-one mapping. For
#{category}and#{categoryCode}, if the parameter values forcategoryare common trademark, special trademark, and collective trademark, the correspondingcategoryCodevalues are 0, 1, and 2. For more information, see the example below.Template parameters
The template parameters are a JSON string with a structure consisting of three parameters:
table_name,param_info, andexplanation. The parameters are as follows:table_name
string: The name of the table in the template SQL.param_info
array: A description of the parameters in the template SQL.param_name
string: The parameter name.value
array: Sample values for the parameter.NoteIf the parameter corresponds to a finite set of enumerated values, list all the values in value.
If they are just sample values, you can list 2 to 4 values.
If there are corresponding parameters, you need to map them using the array index. For
#{category}and#{categoryCode}, "common trademark" incategorycorresponds to 0 incategoryCode, and "special trademark" corresponds to 2 incategoryCode. For more information, see the example below.
explanation
string: Additional information. This usually includes requirements for the generated SQL statement, such as what information to output or descriptions of fields.
NoteIf you do not need to specify template parameters, you can set the template parameter value to any of the following:
NULL
An empty string
An empty list string []
Example
Template question
Template description
Template SQL
Template parameters
Query for courses with course name #{courseName} and status #{status}
What are the courses with [Course Name] and [Status]?
SELECT course_name, course_time, course_location FROM courses WHERE course_name=#{courseName} AND status=#{statusCode}[{"table_name":"courses","param_info":[{"param_name":"#{courseName}","value":["Mathematics","Physics","Chemistry","English","History","Geography","Biology","Computer Science","Art","Music","Physical Education","Programming","Literature","Psychology","Philosophy","Economics","Sociology","Physics Lab","Chemistry Lab","Biology Lab"]},{"param_name": "#{status}", "value": ["Not started","In progress"]},{"param_name": "#{statusCode}","value": [0,1]}], "explanation": "Output the course name (course_name), course time (course_time), and course location (course_location). Note: status is a constant mapping type. The variable mapping field is statusCode."}]
What are the national standards planned for release in the year #{issueDate} for projects with the status #{projectStat}?
What are the national standards planned for release in [Year] for projects with [Project Status]?
SELECT DISTINCT planNum, projectCnName, projectStat FROM sy_cd_me_buss_std_gjbzjh WHERE `planNum` IS NOT NULL AND `dataStatus` != 3 AND `isValid` = 1 AND projectStat=#{projectStat} AND DATE_FORMAT(`issueDate`, '%Y')=#{issueDate}[{"table_name":"sy_cd_me_buss_std_gjbzjh","param_info":[{"param_name":"#{issueDate}","value":[2009,2010,2011,2012]},{"param_name":"#{projectStat}","value":["Seeking comments","Published","Under review"]}],"explanation":"Output the standard name (projectCnName), plan number, and project status"}]
What are the trademarks of type #{category} and international classification #{intCls}?
What are the trademarks of [Trademark Type] and [International Classification]?
SELECT DISTINCT tmName, regNo, status FROM sy_cd_me_buss_ip_tdmk_new WHERE dataStatus!=3 AND isValid = 1 AND category=#{categoryCode} AND intCls=#{intClsCode}[{"table_name":"sy_cd_me_buss_ip_tdmk_new","param_info":[{"param_name":"#{intCls}","value":["Chemical raw materials","Paints and varnishes","Cosmetics and cleaning preparations","Industrial oils and greases","Pharmaceuticals"]},{"param_name":"#{category}","value":["Common trademark","Special trademark","Collective trademark"]},{"param_name":"#{intClsCode}","value":[1,2,3,4,5]},{"param_name":"#{categoryCode}","value":[0,1,2]}],"explanation":"Output the trademark name (tmName), application/registration number (regNo), and trademark status (status). Note: category is a constant mapping type. The variable mapping field is categoryCode. intCls is a constant mapping type. The variable mapping field is intClsCode."}]
Using the scenario in this topic as an example, create the first template question in the preceding table. Run the following SQL statement:
INSERT INTO `polar4ai_nl2sql_pattern` (`pattern_question`,`pattern_description`,`pattern_sql`,`pattern_params`) VALUES ('Query for courses with course name #{courseName} and status #{status}','What are the courses with [Course Name] and [Status]?','SELECT course_name, course_time, course_location FROM courses WHERE course_name=#{courseName} AND status=#{statusCode}','[{"table_name":"courses","param_info":[{"param_name":"#{courseName}","value":["Mathematics","Physics","Chemistry","English","History","Geography","Biology","Computer Science","Art","Music","Physical Education","Programming","Literature","Psychology","Philosophy","Economics","Sociology","Physics Lab","Chemistry Lab","Biology Lab"]},{"param_name": "#{status}", "value": ["Not started","In progress"]},{"param_name": "#{statusCode}","value": [0,1]}], "explanation": "Output the course name (course_name), course time (course_time), and course location (course_location). Note: status is a constant mapping type. The variable mapping field is statusCode."}]');Create a question template index table
You can use a custom index table name that complies with database specifications. This topic uses
pattern_indexas an example. The SQL statement to create the question template index table is as follows:/*polar4ai*/CREATE TABLE pattern_index(id integer, pattern_question text_ik_max_word, pattern_description text_ik_max_word, pattern_sql text_ik_max_word, pattern_params text_ik_max_word, pattern_tables text_ik_max_word, vecs vector_768, PRIMARY KEY (id));NoteThe question template index table is not directly displayed in the database. To view related information, you can run the SQL statement
/*polar4ai*/SHOW TABLES;.To delete the question template index table, using
pattern_indexas an example, you can run the SQL statement/*polar4ai*/DROP TABLE IF EXISTS pattern_index;.
Import information from the question template table into the index table
NoteThe question template table cannot be empty. Before you run the SQL statement to import data into the index table, you must add at least one record.
/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_text2vec, SELECT '') WITH (mode='async', resource='pattern') INTO pattern_index;Parameter description
_polar4ai_text2vecis the text-to-vector model.After
INTO, specify the name of the question template index table that you created in Step 2. Create a question template index table.You can configure multiple parameters in
WITH()to set the related behavior:Parameter
Required
Description
mode
Yes
The data writing mode. This is fixed to `async` for asynchronous mode.
resource
Yes
The resource type. This is fixed to `pattern` to vectorize question template information.
pattern_table_name
No
The name of the question template table to vectorize. This is the table name from Step 1: Create a question template table.
The default value is
polar4ai_nl2sql_pattern, which performs vectorization on thepolar4ai_nl2sql_patterntable. When setting this, specify the name of a question template table that starts withpolar4ai_nl2sql_pattern.If you want to maintain different question template index tables for different scenarios or businesses, you can specify different table names when you create the question template index tables. For example, you can create the
polar4ai_nl2sql_pattern_usertable for user-related scenarios, and then in the second step of creating the question template index, set the index name topattern_index_user. When you import information into the index table, use the following SQL statement:/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_text2vec, SELECT '') WITH (mode='async', resource='pattern', pattern_table_name='polar4ai_nl2sql_pattern_user') INTO pattern_index_user;
Check the task status
After you run the statement to import data into the index table, the
task_idfor the task is returned, such asbce632ea-97e9-11ee-bdd2-492f4dfe0918. You can use the following command to check the import status./*polar4ai*/SHOW TASK `bce632ea-97e9-11ee-bdd2-492f4dfe0918`;When the return value is
finish, the task is complete. The question template information can then be referenced by the model. You can run the following SQL statement to view the question template index information:/*polar4ai*/SELECT * FROM pattern_index;NoteIf the data in the
polar4ai_nl2sql_patterntable changes, you must run Step 3. Import information from the question template table into the index table again.Use question templates online
You can run the following SQL statement to use LLM-based NL2SQL and question templates online:
/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT 'Query for courses with the course name Mathematics and status in progress') WITH (basic_index_name='schema_index', pattern_index_name='pattern_index');SELECT course_name, course_time, course_location FROM courses WHERE course_name='Mathematics' AND status=1;Parameter description
After SELECT, enter the question that you want to convert into an SQL statement.
basic_index_nameis the name of the search index table in the current database.pattern_index_nameis the name of the question template index table.You can configure multiple parameters in
WITH()to set the related behavior. For more information about other parameters, see Use LLM-based NL2SQL online:Parameter name
Parameter description
Value range
pattern_index_top
The number of most similar question templates to retrieve.
Valid values: [1,10].
The default value is 2, which selects the 2 most optimal templates for the current question template.
pattern_index_threshold
The threshold used to determine the similarity of retrieved question templates.
Valid values: (0,1].
The default value is 0.85. A question template is selected only if its vector match score exceeds 0.85.
Build a configuration table
If you want to pre-process questions or post-process the final generated SQL statement, you can use a configuration table.
Scenarios
Scenario 1: Replace deterministic words in a question, such as names, industry terms, or product names.
For example, for all questions that involve
Zhang San, you can replaceZhang SanwithZS001. In this case, for the questionsWhat were Zhang San's sales last month?andWhat are Zhang San's total sales this year?, you can use the configuration table to pre-process them intoWhat were ZS001's sales last month?andWhat are ZS001's total sales this year?before you call the large language model.Scenario 2: Add extra information to questions that contain specific words.
For example, for all questions that involve
total sales, you can add the calculation formula fortotal sales:Total sales = SUM(sales). Before you call the large language model, you can use the configuration table to add this type of information, which is appended when a question meets the corresponding condition.Scenario 3: Map and replace values for specific tables or columns.
For example, for all final SQL statements that involve the
student_coursestable, you can replacestatus = 'on leave'withstatus = 0as a fallback measure for column value mapping.
Syntax
The SQL statement to build the configuration table is as follows. The table name polar4ai_nl2sql_llm_config is fixed and cannot be changed.
DROP TABLE IF EXISTS `polar4ai_nl2sql_llm_config`;
CREATE TABLE `polar4ai_nl2sql_llm_config` (
`id` int(11) NOT NULL AUTO_INCREMENT COMMENT 'Primary key',
`is_functional` int(11) NOT NULL DEFAULT '1' COMMENT 'Whether it is effective',
`text_condition` text COMMENT 'Text condition',
`query_function` text COMMENT 'Query processing',
`formula_function` text COMMENT 'Formula information',
`sql_condition` text COMMENT 'SQL condition',
`sql_function` text COMMENT 'SQL processing',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;When the data in the polar4ai_nl2sql_llm_config table changes, you do not need to perform any operations. The changes take effect immediately.
Parameter description
Column name | Description | Value range | Example |
is_functional | Indicates whether the configuration in this row is effective. When the configuration table exists, it is used by default for every NL2SQL operation. If there is a configuration item that you do not want to use but do not want to delete, you can set is_functional to 0 to disable the configuration for that row. |
|
|
text_condition | Pre-processing: Performs a text condition check on the question. If the condition is met, the query_function and formula_function columns are used for processing. Otherwise, they are not used. |
| If text_condition is For example:
|
query_function | Pre-processing: Processes the question. This is used when the text_condition is met. |
| If query_function is set to For example:
|
formula_function | Pre-processing: Adds calculation formulas or other information related to specific business logic or concepts to the question. This is used when the text_condition is met. | - | If formula_function is |
sql_condition | Post-processing: Performs a condition check on the SQL statement generated by the model. If the condition is met, sql_function is used to process the SQL statement. Otherwise, it is not used. |
| If sql_condition= For example:
|
sql_function | Post-processing: Processes the SQL statement. This can be used to enforce value mappings in business logic. This is used when the sql_condition is met. |
| If sql_function= |
Example
is_functional | text_condition | query_function | formula_function | sql_condition | sql_function |
1 |
|
| |||
1 |
| ||||
1 |
|
|
Before you add the configuration table, run the following SQL statement:
/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT 'Filter for the 2 students with the most absences, sort them by the number of absences in descending order, and display their names and number of absences.') WITH (basic_index_name='schema_index');SELECT s.student_name, COUNT(sc.id) AS leave_count FROM students s JOIN student_courses sc ON s.id = sc.student_id WHERE sc.status = 0 GROUP BY s.student_name ORDER BY leave_count DESC LIMIT 2;Build the configuration table according to the Syntax.
Add a configuration record. If the
studentstable or thestudent_coursestable is in the SQL statement, and thecoursestable is not, replacestatus = 0withstatus = 10.INSERT INTO `polar4ai_nl2sql_llm_config` (`is_functional`,`sql_condition`,`sql_function`) VALUES (1,'students||student_courses&&!!courses','{"replace":{"status = 0":"status = 10"}}');Run the SQL statement from Step 1. The generated SQL statement now has the
statusvalue replaced./*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT 'Filter for the 2 students with the most absences, sort them by the number of absences in descending order, and display their names and number of absences.') WITH (basic_index_name='schema_index');SELECT s.student_name, COUNT(sc.id) AS leave_count FROM students s JOIN student_courses sc ON s.id = sc.student_id WHERE sc.status = 10 GROUP BY s.student_name ORDER BY leave_count DESC LIMIT 2;
Customize table and column comments
If you find that you cannot modify the comments of an existing data table or column during the Standardize data tables process, you can add new comments for that table and its columns in the polar4ai_nl2sql_table_extra_info table. When you use LLM-based NL2SQL, the comments in this table overwrite the original comments.
Syntax
The SQL statement to create the custom table and column comments table is as follows. The table name polar4ai_nl2sql_table_extra_info is fixed and cannot be changed.
DROP TABLE IF EXISTS `polar4ai_nl2sql_table_extra_info`;
CREATE TABLE `polar4ai_nl2sql_table_extra_info` (
`id` int(11) NOT NULL AUTO_INCREMENT COMMENT 'Primary key',
`table_name` text COMMENT 'Table name',
`table_comment` text COMMENT 'Table description',
`column_name` text COMMENT 'Column name',
`column_comment` text COMMENT 'Column description',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;When the data in the polar4ai_nl2sql_table_extra_info table changes, you must run the step to import data from the data tables into the search index table again for the changes in the polar4ai_nl2sql_table_extra_info table to take effect.
Example
Create the custom table and column comments table.
Modify the comment for the
statuscolumn in thestudent_coursestable. In this example, a new option description is added for thestatuscolumn: 2-Absent.INSERT INTO `polar4ai_nl2sql_table_extra_info` (`table_name`,`table_comment`,`column_name`,`column_comment`) VALUES ('student_courses','Course and student information table','status','Student status: 0-On leave, 1-Normal, 2-Absent.');Run the step to import data from the data tables into the search index table again.
/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_text2vec, SELECT '') WITH (mode='async', resource='schema') INTO schema_index;Check the task status.
After you run the statement to import data into the search index table, the
task_idfor the task is returned, such asbce632ea-97e9-11ee-bdd2-492f4dfe0918. You can use the following command to check the import status. WhentaskStatusisfinish, the task is complete./*polar4ai*/SHOW TASK `bce632ea-97e9-11ee-bdd2-492f4dfe0918`;Use LLM-based NL2SQL online.
/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT 'Filter for the 2 students with the most absences, sort them by the number of absences in descending order, and display their names and number of absences.') WITH (basic_index_name='schema_index');SELECT s.student_name, COUNT(sc.id) AS absence_count FROM student_courses sc JOIN students s ON sc.student_id = s.id WHERE sc.status = 2 GROUP BY s.student_name ORDER BY absence_count DESC LIMIT 2;From the preceding output, you can see that the LLM-based NL2SQL model has mapped absent to the column value 2 for the
statuscolumn in thestudent_coursestable.
Wide table support
If your data tables include wide tables with many columns, or if you receive a Please use column index to avoid oversize table information. error when you use LLM-based NL2SQL, you can use the following process to support wide tables.
When you use LLM-based NL2SQL online, both basic_index_name and pattern_index_name in WITH() can use the same column_index_name. Unlike schema or pattern, this is only for information simplification.
For most requests, the column_index_name parameter has no effect. It is required only for NL2SQL requests that trigger length limits, where it simplifies table information. This may cause a slight loss in accuracy but effectively avoids errors in the LLM-based NL2SQL model that are caused by an overly long prompt.
Build a column index table.
You can use a custom column index table name that complies with database specifications, but the name must not be the same as any existing table in the current database. In addition, only one column index table is required per database. The `CREATE TABLE` statement is as follows:
/*polar4ai*/CREATE TABLE column_index(id integer, table_name varchar, table_comment text_ik_max_word, column_name text_ik_max_word, column_comment text_ik_max_word, is_primary integer, is_foreign integer, vecs vector_768, ext text_ik_max_word, PRIMARY KEY (id));NoteThe column index table is not directly displayed in the database. To view related information, you can run the SQL statement
/*polar4ai*/SHOW TABLES;.To delete the column index table, using
column_indexas an example, you can run the SQL statement/*polar4ai*/DROP TABLE IF EXISTS column_index;.
Import data from the data tables into the column index table at the column granularity.
/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_text2vec, SELECT '') WITH (mode='async', resource='column') into column_index;Check the task status.
After you run the statement to import data into the column index table, the
task_idfor the task is returned, such asbce632ea-97e9-11ee-bdd2-492f4dfe0918. You can use the following command to check the import status. WhentaskStatusisfinish, the task is complete./*polar4ai*/SHOW TASK `bce632ea-97e9-11ee-bdd2-492f4dfe0918`;Use LLM-based NL2SQL for wide tables online.
/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT 'Sort by teacher name in ascending alphabetical order, and display the teacher''s name and the name of the course they are assigned to teach.') WITH (basic_index_name='schema_index', column_index_name='column_index');

