All Products
Search
Document Center

PolarDB:NL2SQL: Natural language querying

最終更新日:Dec 15, 2025

To help users who are not familiar with SQL analyze data in databases, PolarDB for AI provides a proprietary AI model for Large Language Model-based Natural Language to SQL (LLM-based NL2SQL) translation. This model is built-in and ready to use. Compared to traditional natural language to SQL (NL2SQL) methods, the LLM-based NL2SQL model provides stronger language understanding. The SQL statements that it generates support more functions, such as date arithmetic. The model can even understand simple mappings, such as valid->isValid=1. After you make adjustments, it can also understand your common SQL patterns, such as defaulting to datastatus=1 in conditions.

Demo

/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT 'Filter for the 2 students with the most absences, sort them by the number of absences in descending order, and display their names and number of absences.') WITH (basic_index_name='schema_index');
Output: SELECT s.student_name, COUNT(sc.id) AS leave_count FROM students s JOIN student_courses sc ON s.id = sc.student_id WHERE sc.status = 0 GROUP BY s.student_name ORDER BY leave_count DESC LIMIT 2;

End-to-end process for natural language to SQL translation

To help you efficiently implement NL2SQL applications, the process is divided into three phases: Quick Launch, Optimization and Fine-tuning, and Production Deployment.

  • Quick Launch: This phase helps you quickly build basic NL2SQL capabilities from scratch.

  • Optimization and Fine-tuning: We provide deep optimization for specific issues in actual business scenarios.

  • Production: This phase ensures the smooth deployment of the NL2SQL system into the production environment.

Prerequisites

  • Add an AI node and set a database account to connect to the AI node. For more information, see Enable the PolarDB for AI feature.

    Note
    • If you added an AI node when you purchased the cluster, you can directly set a database account for the AI node.

    • The database account for the AI node must have Read and Write permissions to read from and write to the target database.

  • Connect to the PolarDB cluster using the cluster endpoint. For more information, see Log on to PolarDB for AI.

    Important
    • When you connect to the cluster from the command line, add the -c option.

    • When you use the PolarDB for AI feature in Data Management (DMS), DMS connects to the PolarDB cluster by default using the Primary Endpoint. This prevents SQL statements from being routed to the AI node. You must manually change the connection endpoint to the Cluster Endpoint.

Usage notes

  • How to ask questions: In your questions, clearly state the conditions and related entity values. Place the condition first, followed by the entity for the column value that you are looking for, and finally the possible column name. Example:

    SELECT 'What is the property name of a "house" or "apartment" with more than one room?'

    Here, with more than one room is the condition, house and apartment are the entities corresponding to the column values, and property name is the possible column name.

  • Query result accuracy: LLM-based NL2SQL is an AI model based on a large language model, and its performance is influenced by many factors. To ensure that the query results meet your expectations, consider the following factors that may affect the overall accuracy:

    • Richness of table and column comments: Adding comments to each table and its columns improves the query accuracy.

    • Match between the user question and the column comments in the table: The closer the semantic match between the keywords in the user question and the column comments, the better the query performance is.

    • Length of the generated SQL statement: Queries are more accurate if the SQL statement involves fewer columns and has simpler conditions.

    • Logical complexity of the SQL statement: The less advanced syntax that is used in an SQL statement, the more accurate the query is.

Instructions

Standardize data tables

A prerequisite for NL2SQL is that the model must understand the meaning of your tables, including what the column names represent. Before you use LLM-based NL2SQL, you must add comments to frequently used data tables and their columns.

  • Table comments

    Table comments help the LLM-based NL2SQL model better understand the basic information of a table. This helps the model locate the tables that are involved in an SQL statement. Comments should provide a concise and clear overview of the main content of the table, such as orders or inventory. Keep comments within 10 characters and avoid excessive explanations.

  • Column comments

    Column comments usually consist of common nouns or phrases, such as order ID, date, or store name. These comments must accurately reflect the meaning of the column names. You can also add sample data or mappings to column comments. For example, for a column named isValid, the comment can be Indicates whether it is valid. 0: No. 1: Yes..

Note

If you cannot modify the original comments, you can use the custom table and column comments feature in the advanced usage section to adjust the comments. For more information, see Advanced usage - Customize table and column comments.

Data preparation

Note

You can prepare a test dataset based on your business scenario. This topic uses the following test dataset: test_dataset.sql.

  1. Build a search index table

    To extract data from data tables, you must build a search index table. You can use a custom table name that complies with database specifications. This topic uses schema_index as an example. The SQL statement to build the search index table is as follows:

    /*polar4ai*/CREATE TABLE schema_index(id integer, table_name varchar, table_comment text_ik_max_word, table_ddl text_ik_max_word, column_names text_ik_max_word, column_comments text_ik_max_word, sample_values text_ik_max_word, vecs vector_768, ext text_ik_max_word, PRIMARY KEY (id));
    Note
    • The search index table is not directly displayed in the database. To view related information, you can run the SQL statement /*polar4ai*/SHOW TABLES;.

    • To delete the search index table, using schema_index as an example, you can run the SQL statement /*polar4ai*/DROP TABLE IF EXISTS schema_index;.

  2. Import data from the data tables into the search index table

    After you run the following SQL statement, PolarDB for AI performs vectorization on all tables in the current database and samples column values by default.

    /*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_text2vec, SELECT '') WITH (mode='async', resource='schema') INTO schema_index;

    Parameter description

    • _polar4ai_text2vec is the text-to-vector model.

    • After INTO, specify the name of the search index table that you created in Step 1. Build a search index table.

    • You can configure multiple parameters in WITH() to set the related behavior:

      Parameter

      Required

      Description

      mode

      Yes

      The data writing mode. This is fixed to `async` for asynchronous mode.

      resource

      Yes

      The resource type. This is fixed to `schema` to vectorize table information.

      tables_included

      No

      The tables to vectorize.

      The default value is '', which vectorizes all tables. To specify multiple tables, separate the table names with commas (,) and join them into a string.

      to_sample

      No

      Specifies whether to sample column values. Sampling column values increases the time it takes to import data into the search index table, but it improves the quality of the generated SQL when the table has few columns (less than 15). Valid values:

      • 0 (default): Does not sample column values.

      • 1: Samples column values.

      columns_excluded

      No

      The columns to exclude from LLM-based NL2SQL operations.

      The default value is ''. This indicates that all columns in all tables used for vector conversion participate in the subsequent LLM-based NL2SQL operation. To set this parameter, you need to concatenate the columns in the selected tables that do not participate in the subsequent LLM-based NL2SQL operation into a string in the table_name1.column_name1,table_name1.column_name2,table_name2.column_name1 format.

      Example: The following SQL statement performs vectorization on the graph_info, image_info, and text_info tables in the current database and samples the column values. The time column in the graph_info table and the ext column in the text_info table are excluded from subsequent LLM-based NL2SQL operations.

      /*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_text2vec, SELECT '') WITH (mode='async', resource='schema', tables_included='graph_info,image_info,text_info', to_sample=1, columns_excluded='graph_info.time,text_info.ext') INTO schema_index;
  3. Check the task status

    After you run the statement to import data into the search index table, the task_id for the task is returned, such as bce632ea-97e9-11ee-bdd2-492f4dfe0918. You can use the following command to check the import status. When the return value is finish, the task is complete.

    /*polar4ai*/SHOW TASK `bce632ea-97e9-11ee-bdd2-492f4dfe0918`;

    You can run the following SQL statement to view the search index information:

    /*polar4ai*/SELECT * FROM schema_index;

Use LLM-based NL2SQL online

Syntax

/*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT '<question>') WITH (basic_index_name='<basic_index_name>');

Parameter description

  • After <question>, enter the question that you want to convert into an SQL statement. The following table shows some recommended sample questions:

    This example scenario

    Other scenarios

    Sort by teacher name in ascending alphabetical order, and display the teacher's name and the name of the course they are assigned to teach.

    Filter for the 2 students with the most absences, sort them by the number of absences in descending order, and display their names and number of absences.

    Query the names and locations of courses that start between October 1, 2023, and October 3, 2023.

    Count the number of courses each student has taken, filter for students who have taken more than 2 courses, and display their names and course counts in descending order of course count.

    Filter for students whose addresses contain "Beijing" or "Shanghai" and display their names and phone numbers.

    What are the IDs, roles, and names of experts who have performed two or more treatments?

    What is the name of the dog breed that is kept in the largest numbers?

    Which owner paid the most for their dog's treatments? List the owner's ID and last name.

    Tell me the ID and last name of the owner who spent the most on their dog's treatments.

    What is the description of the treatment type with the lowest total cost?

  • You can configure multiple parameters in WITH() to set the related behavior:

    Parameter name

    Required

    Description

    basic_index_name

    Yes

    The name of the search index table in the current database.

    to_optimize

    No

    Specifies whether to perform SQL optimization. Valid values:

    • 0 (default): No optimization.

    • 1: Performs SQL optimization. PolarDB for AI further processes the generated SQL statement to make it more optimal.

    basic_index_top

    No

    The number of most similar tables to retrieve. Valid values: [1,10].

    • The default value is 3, which selects the 3 most optimal tables for the current question. A setting of 1 is usually sufficient.

    • If multiple tables are involved, you can set this value to 4 or higher to expand the retrieval and improve the results.

    basic_index_threshold

    No

    The threshold used to determine the similarity of retrieved tables. Valid values: (0,1].

    The default value is 0.1. A database table is selected only if its vector match score exceeds 0.1.

Examples

  • Display specific content and sort it in a specified order.

    /*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT 'Sort by teacher name in ascending alphabetical order, and display the teacher''s name and the name of the course they are assigned to teach.') WITH (basic_index_name='schema_index');
    SELECT t.teacher_name, c.course_name FROM teachers t JOIN courses c ON t.id = c.teacher_id ORDER BY t.teacher_name ASC;
  • Search based on specified conditions, display the content that meets the conditions, and sort it in a specified order.

    /*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT 'Filter for the 2 students with the most absences, sort them by the number of absences in descending order, and display their names and number of absences.') WITH (basic_index_name='schema_index');
    SELECT s.student_name, COUNT(sc.id) AS leave_count FROM students s JOIN student_courses sc ON s.id = sc.student_id WHERE sc.status = 0 GROUP BY s.student_name ORDER BY leave_count DESC LIMIT 2;

Advanced usage

PolarDB for AI provides four advanced usage methods. If you encounter the following issues, you can refer to the corresponding advanced usage instructions to resolve them.

  • Configure question templates: Configure general question templates to enable the model to generate SQL statements based on specific knowledge.

  • Build a configuration table: Pre-process questions or post-process the generated SQL statements.

  • Customize table and column comments: If the comments of the original table or columns cannot be modified, you can add new comments for the table and its columns to overwrite the original comments.

  • Wide table support: When a table has too many columns or you receive a Please use column index to avoid oversize table information. error, you can build a column index table to enable model support for wide tables.

Configure question templates

Question templates are created for specific domain knowledge and help the model better understand the current question. You can configure general question templates to guide the model with specific knowledge, which enables it to generate SQL statements based on that knowledge.

Instructions

  1. Create a question template table

    The name of the question template table must start with polar4ai_nl2sql_pattern, and its schema must include the five columns from the following `CREATE TABLE` statement.

    DROP TABLE IF EXISTS `polar4ai_nl2sql_pattern`;
    CREATE TABLE `polar4ai_nl2sql_pattern` (
      `id` int(11) NOT NULL AUTO_INCREMENT COMMENT 'Primary key',
      `pattern_question` text COMMENT 'Template question',
      `pattern_description` text COMMENT 'Template description',
      `pattern_sql` text COMMENT 'Template SQL',
      `pattern_params` text COMMENT 'Template parameters',
      PRIMARY KEY (`id`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

    Question template column description

    Column name

    Description

    Template question

    A question with parameters, provided as input to the LLM-based NL2SQL model.

    In the template question, write parameters in the format #{XXX}.

    Template description

    A summary of the template question, with some entities as parameters, such as date, year, or organization. These entities usually correspond to specific columns in a table.

    In the template description, write parameters in the format [XXX], and they must be in the same order as the parameters in the template question.

    Template SQL

    The correct SQL statement corresponding to the template question. This SQL statement must treat the parameters from the template question as variables.

    Note

    The parameters in the template question and template SQL can be different, but they must be related. For example, the parameters should have the same prefix, and their values must have a one-to-one mapping. For #{category} and #{categoryCode}, if the parameter values for category are common trademark, special trademark, and collective trademark, the corresponding categoryCode values are 0, 1, and 2. For more information, see the example below.

    Template parameters

    The template parameters are a JSON string with a structure consisting of three parameters: table_name, param_info, and explanation. The parameters are as follows:

    • table_name string: The name of the table in the template SQL.

    • param_info array: A description of the parameters in the template SQL.

      • param_name string: The parameter name.

      • value array: Sample values for the parameter.

        Note
        • If the parameter corresponds to a finite set of enumerated values, list all the values in value.

        • If they are just sample values, you can list 2 to 4 values.

        • If there are corresponding parameters, you need to map them using the array index. For #{category} and #{categoryCode}, "common trademark" in category corresponds to 0 in categoryCode, and "special trademark" corresponds to 2 in categoryCode. For more information, see the example below.

    • explanation string: Additional information. This usually includes requirements for the generated SQL statement, such as what information to output or descriptions of fields.

    Note

    If you do not need to specify template parameters, you can set the template parameter value to any of the following:

    • NULL

    • An empty string

    • An empty list string []

    Example

    Template question

    Template description

    Template SQL

    Template parameters

    Query for courses with course name #{courseName} and status #{status}

    What are the courses with [Course Name] and [Status]?

    SELECT course_name, course_time, course_location 
    FROM courses 
    WHERE 
    course_name=#{courseName} 
    AND status=#{statusCode}

    [{"table_name":"courses","param_info":[{"param_name":"#{courseName}","value":["Mathematics","Physics","Chemistry","English","History","Geography","Biology","Computer Science","Art","Music","Physical Education","Programming","Literature","Psychology","Philosophy","Economics","Sociology","Physics Lab","Chemistry Lab","Biology Lab"]},{"param_name": "#{status}", "value": ["Not started","In progress"]},{"param_name": "#{statusCode}","value": [0,1]}], "explanation": "Output the course name (course_name), course time (course_time), and course location (course_location). Note: status is a constant mapping type. The variable mapping field is statusCode."}]

    What are the national standards planned for release in the year #{issueDate} for projects with the status #{projectStat}?

    What are the national standards planned for release in [Year] for projects with [Project Status]?

    SELECT DISTINCT planNum, projectCnName, projectStat 
    FROM sy_cd_me_buss_std_gjbzjh 
    WHERE 
    `planNum` IS NOT NULL 
    AND `dataStatus` != 3 
    AND `isValid` = 1 
    AND projectStat=#{projectStat} 
    AND DATE_FORMAT(`issueDate`, '%Y')=#{issueDate}

    [{"table_name":"sy_cd_me_buss_std_gjbzjh","param_info":[{"param_name":"#{issueDate}","value":[2009,2010,2011,2012]},{"param_name":"#{projectStat}","value":["Seeking comments","Published","Under review"]}],"explanation":"Output the standard name (projectCnName), plan number, and project status"}]

    What are the trademarks of type #{category} and international classification #{intCls}?

    What are the trademarks of [Trademark Type] and [International Classification]?

    SELECT DISTINCT tmName, regNo, status 
    FROM sy_cd_me_buss_ip_tdmk_new 
    WHERE 
    dataStatus!=3 
    AND isValid = 1 
    AND category=#{categoryCode} 
    AND intCls=#{intClsCode}

    [{"table_name":"sy_cd_me_buss_ip_tdmk_new","param_info":[{"param_name":"#{intCls}","value":["Chemical raw materials","Paints and varnishes","Cosmetics and cleaning preparations","Industrial oils and greases","Pharmaceuticals"]},{"param_name":"#{category}","value":["Common trademark","Special trademark","Collective trademark"]},{"param_name":"#{intClsCode}","value":[1,2,3,4,5]},{"param_name":"#{categoryCode}","value":[0,1,2]}],"explanation":"Output the trademark name (tmName), application/registration number (regNo), and trademark status (status). Note: category is a constant mapping type. The variable mapping field is categoryCode. intCls is a constant mapping type. The variable mapping field is intClsCode."}]

    Using the scenario in this topic as an example, create the first template question in the preceding table. Run the following SQL statement:

    INSERT INTO `polar4ai_nl2sql_pattern` (`pattern_question`,`pattern_description`,`pattern_sql`,`pattern_params`) VALUES ('Query for courses with course name #{courseName} and status #{status}','What are the courses with [Course Name] and [Status]?','SELECT course_name, course_time, course_location FROM courses WHERE course_name=#{courseName} AND status=#{statusCode}','[{"table_name":"courses","param_info":[{"param_name":"#{courseName}","value":["Mathematics","Physics","Chemistry","English","History","Geography","Biology","Computer Science","Art","Music","Physical Education","Programming","Literature","Psychology","Philosophy","Economics","Sociology","Physics Lab","Chemistry Lab","Biology Lab"]},{"param_name": "#{status}", "value": ["Not started","In progress"]},{"param_name": "#{statusCode}","value": [0,1]}], "explanation": "Output the course name (course_name), course time (course_time), and course location (course_location). Note: status is a constant mapping type. The variable mapping field is statusCode."}]');
  2. Create a question template index table

    You can use a custom index table name that complies with database specifications. This topic uses pattern_index as an example. The SQL statement to create the question template index table is as follows:

    /*polar4ai*/CREATE TABLE pattern_index(id integer, pattern_question text_ik_max_word, pattern_description text_ik_max_word, pattern_sql text_ik_max_word, pattern_params text_ik_max_word, pattern_tables text_ik_max_word, vecs vector_768, PRIMARY KEY (id));
    Note
    • The question template index table is not directly displayed in the database. To view related information, you can run the SQL statement /*polar4ai*/SHOW TABLES;.

    • To delete the question template index table, using pattern_index as an example, you can run the SQL statement /*polar4ai*/DROP TABLE IF EXISTS pattern_index;.

  3. Import information from the question template table into the index table

    Note

    The question template table cannot be empty. Before you run the SQL statement to import data into the index table, you must add at least one record.

    /*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_text2vec, SELECT '') WITH (mode='async', resource='pattern') INTO pattern_index;

    Parameter description

    • _polar4ai_text2vec is the text-to-vector model.

    • After INTO, specify the name of the question template index table that you created in Step 2. Create a question template index table.

    • You can configure multiple parameters in WITH() to set the related behavior:

      Parameter

      Required

      Description

      mode

      Yes

      The data writing mode. This is fixed to `async` for asynchronous mode.

      resource

      Yes

      The resource type. This is fixed to `pattern` to vectorize question template information.

      pattern_table_name

      No

      The name of the question template table to vectorize. This is the table name from Step 1: Create a question template table.

      The default value is polar4ai_nl2sql_pattern, which performs vectorization on the polar4ai_nl2sql_pattern table. When setting this, specify the name of a question template table that starts with polar4ai_nl2sql_pattern.

      If you want to maintain different question template index tables for different scenarios or businesses, you can specify different table names when you create the question template index tables. For example, you can create the polar4ai_nl2sql_pattern_user table for user-related scenarios, and then in the second step of creating the question template index, set the index name to pattern_index_user. When you import information into the index table, use the following SQL statement:

      /*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_text2vec, SELECT '') WITH (mode='async', resource='pattern', pattern_table_name='polar4ai_nl2sql_pattern_user') INTO pattern_index_user;
  4. Check the task status

    After you run the statement to import data into the index table, the task_id for the task is returned, such as bce632ea-97e9-11ee-bdd2-492f4dfe0918. You can use the following command to check the import status.

    /*polar4ai*/SHOW TASK `bce632ea-97e9-11ee-bdd2-492f4dfe0918`;

    When the return value is finish, the task is complete. The question template information can then be referenced by the model. You can run the following SQL statement to view the question template index information:

    /*polar4ai*/SELECT * FROM pattern_index;
    Note

    If the data in the polar4ai_nl2sql_pattern table changes, you must run Step 3. Import information from the question template table into the index table again.

  5. Use question templates online

    You can run the following SQL statement to use LLM-based NL2SQL and question templates online:

    /*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT 'Query for courses with the course name Mathematics and status in progress') WITH (basic_index_name='schema_index', pattern_index_name='pattern_index');
    SELECT course_name, course_time, course_location FROM courses WHERE course_name='Mathematics' AND status=1;

    Parameter description

    • After SELECT, enter the question that you want to convert into an SQL statement.

    • basic_index_name is the name of the search index table in the current database.

    • pattern_index_name is the name of the question template index table.

    • You can configure multiple parameters in WITH() to set the related behavior. For more information about other parameters, see Use LLM-based NL2SQL online:

      Parameter name

      Parameter description

      Value range

      pattern_index_top

      The number of most similar question templates to retrieve.

      Valid values: [1,10].

      The default value is 2, which selects the 2 most optimal templates for the current question template.

      pattern_index_threshold

      The threshold used to determine the similarity of retrieved question templates.

      Valid values: (0,1].

      The default value is 0.85. A question template is selected only if its vector match score exceeds 0.85.

Build a configuration table

If you want to pre-process questions or post-process the final generated SQL statement, you can use a configuration table.

Scenarios

  • Scenario 1: Replace deterministic words in a question, such as names, industry terms, or product names.

    For example, for all questions that involve Zhang San, you can replace Zhang San with ZS001. In this case, for the questions What were Zhang San's sales last month? and What are Zhang San's total sales this year?, you can use the configuration table to pre-process them into What were ZS001's sales last month? and What are ZS001's total sales this year? before you call the large language model.

  • Scenario 2: Add extra information to questions that contain specific words.

    For example, for all questions that involve total sales, you can add the calculation formula for total sales: Total sales = SUM(sales). Before you call the large language model, you can use the configuration table to add this type of information, which is appended when a question meets the corresponding condition.

  • Scenario 3: Map and replace values for specific tables or columns.

    For example, for all final SQL statements that involve the student_courses table, you can replace status = 'on leave' with status = 0 as a fallback measure for column value mapping.

Syntax

The SQL statement to build the configuration table is as follows. The table name polar4ai_nl2sql_llm_config is fixed and cannot be changed.

DROP TABLE IF EXISTS `polar4ai_nl2sql_llm_config`;
CREATE TABLE `polar4ai_nl2sql_llm_config` (
  `id` int(11) NOT NULL AUTO_INCREMENT COMMENT 'Primary key',
  `is_functional` int(11) NOT NULL DEFAULT '1' COMMENT 'Whether it is effective',
  `text_condition` text COMMENT 'Text condition',
  `query_function` text COMMENT 'Query processing',
  `formula_function` text COMMENT 'Formula information',
  `sql_condition` text COMMENT 'SQL condition',
  `sql_function` text COMMENT 'SQL processing',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
Note

When the data in the polar4ai_nl2sql_llm_config table changes, you do not need to perform any operations. The changes take effect immediately.

Parameter description

Column name

Description

Value range

Example

is_functional

Indicates whether the configuration in this row is effective.

When the configuration table exists, it is used by default for every NL2SQL operation. If there is a configuration item that you do not want to use but do not want to delete, you can set is_functional to 0 to disable the configuration for that row.
  • 1 (default): Effective

  • 0: Not effective

  • If is_functional=1, the configuration in this row is effective.

  • If is_functional=0, the configuration in this row is not effective.

text_condition

Pre-processing: Performs a text condition check on the question.

If the condition is met, the query_function and formula_function columns are used for processing. Otherwise, they are not used.
  • Three conditional operators are supported: &&, ||, and !!, which represent AND, OR, and NOT, respectively.

  • If text_condition is empty or an empty string, it matches all questions.

If text_condition is Zhang San||Li Si&&!!Wang Wu, the condition is met if the question contains Zhang San, or if it contains Li Si and does not contain Wang Wu.

For example:

  • Question What are Zhang San's total sales this year?: The condition is met.

  • Question What are Li Si's total sales this year?: The condition is met.

  • Question What are Li Si and Wang Wu's total sales this year?: The condition is not met.

query_function

Pre-processing: Processes the question.

This is used when the text_condition is met.
  • Three processing methods are supported: append, delete, and replace.

  • The format must be a JSON string.

If query_function is set to {"append":["one","two"],"delete":["?"],"replace":{"Zhangsan":"a","Lisi":"b"}}, this means that when the text_condition is matched, one and two are appended to the query and ? is deleted. Finally, Zhangsan and Lisi are replaced with a and b respectively.

For example:

  • Question What are Zhang San's total sales this year?: When text_condition is met, it is processed into What are a's total sales this yearonetwo.

  • Question What are Li Si's total sales this year?: When text_condition is met, it is processed into What are b's total sales this yearonetwo.

formula_function

Pre-processing: Adds calculation formulas or other information related to specific business logic or concepts to the question.

This is used when the text_condition is met.

-

If formula_function is Total sales: SUM(sales), then during final processing, the SUM(sales) formula is used as additional information for "Total sales" in the question.

sql_condition

Post-processing: Performs a condition check on the SQL statement generated by the model.

If the condition is met, sql_function is used to process the SQL statement. Otherwise, it is not used.

  • Three conditional operators are supported: &&, ||, and !!, which represent AND, OR, and NOT, respectively.

  • If sql_condition is empty or an empty string, it matches all generated SQL statements.

If sql_condition=students||student_courses&&!!courses, the condition is met if the students table or the student_courses table is in the SQL statement, and the courses table is not in the SQL statement.

For example:

  • SQL statement SELECT * FROM student_courses: The condition is met.

  • SQL statement SELECT c.course_name FROM student_courses sc JOIN courses c ON sc.courses_id = c.id;: The condition is not met.

sql_function

Post-processing: Processes the SQL statement. This can be used to enforce value mappings in business logic.

This is used when the sql_condition is met.

  • Only the replace processing method is supported.

  • The format must be a JSON string.

If sql_function={"replace":{"status = 'on leave'":"status = 0","status = 'present'":"status = 1"}}, it means that if sql_condition is met, status = 'on leave' in the SQL statement is replaced with status = 0, and status = 'present' is replaced with status = 1.

Example

is_functional

text_condition

query_function

formula_function

sql_condition

sql_function

1

Zhang San||Li Si&&!!Wang Wu

{"append":["one","two"],"delete":["?"],"replace":{"Zhang San":"a","Li Si":"b"}}

1

Total sales: SUM(sales)

1

students||student_courses&&!!courses

{"replace":{"status = 'on leave'":"status = 0","status = 'present'":"status = 1"}}

  1. Before you add the configuration table, run the following SQL statement:

    /*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT 'Filter for the 2 students with the most absences, sort them by the number of absences in descending order, and display their names and number of absences.') WITH (basic_index_name='schema_index');
    SELECT s.student_name, COUNT(sc.id) AS leave_count FROM students s JOIN student_courses sc ON s.id = sc.student_id WHERE sc.status = 0 GROUP BY s.student_name ORDER BY leave_count DESC LIMIT 2;
  2. Build the configuration table according to the Syntax.

  3. Add a configuration record. If the students table or the student_courses table is in the SQL statement, and the courses table is not, replace status = 0 with status = 10.

    INSERT INTO `polar4ai_nl2sql_llm_config` (`is_functional`,`sql_condition`,`sql_function`) VALUES (1,'students||student_courses&&!!courses','{"replace":{"status = 0":"status = 10"}}');
  4. Run the SQL statement from Step 1. The generated SQL statement now has the status value replaced.

    /*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT 'Filter for the 2 students with the most absences, sort them by the number of absences in descending order, and display their names and number of absences.') WITH (basic_index_name='schema_index');
    SELECT s.student_name, COUNT(sc.id) AS leave_count FROM students s JOIN student_courses sc ON s.id = sc.student_id WHERE sc.status = 10 GROUP BY s.student_name ORDER BY leave_count DESC LIMIT 2;

Customize table and column comments

If you find that you cannot modify the comments of an existing data table or column during the Standardize data tables process, you can add new comments for that table and its columns in the polar4ai_nl2sql_table_extra_info table. When you use LLM-based NL2SQL, the comments in this table overwrite the original comments.

Syntax

The SQL statement to create the custom table and column comments table is as follows. The table name polar4ai_nl2sql_table_extra_info is fixed and cannot be changed.

DROP TABLE IF EXISTS `polar4ai_nl2sql_table_extra_info`;
CREATE TABLE `polar4ai_nl2sql_table_extra_info` (
  `id` int(11) NOT NULL AUTO_INCREMENT COMMENT 'Primary key',
  `table_name` text COMMENT 'Table name',
  `table_comment` text COMMENT 'Table description',
  `column_name` text COMMENT 'Column name',
  `column_comment` text COMMENT 'Column description',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
Note

When the data in the polar4ai_nl2sql_table_extra_info table changes, you must run the step to import data from the data tables into the search index table again for the changes in the polar4ai_nl2sql_table_extra_info table to take effect.

Example

  1. Create the custom table and column comments table.

  2. Modify the comment for the status column in the student_courses table. In this example, a new option description is added for the status column: 2-Absent.

    INSERT INTO `polar4ai_nl2sql_table_extra_info` (`table_name`,`table_comment`,`column_name`,`column_comment`) VALUES ('student_courses','Course and student information table','status','Student status: 0-On leave, 1-Normal, 2-Absent.');
  3. Run the step to import data from the data tables into the search index table again.

    /*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_text2vec, SELECT '') WITH (mode='async', resource='schema') INTO schema_index;
  4. Check the task status.

    After you run the statement to import data into the search index table, the task_id for the task is returned, such as bce632ea-97e9-11ee-bdd2-492f4dfe0918. You can use the following command to check the import status. When taskStatus is finish, the task is complete.

    /*polar4ai*/SHOW TASK `bce632ea-97e9-11ee-bdd2-492f4dfe0918`;
  5. Use LLM-based NL2SQL online.

    /*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT 'Filter for the 2 students with the most absences, sort them by the number of absences in descending order, and display their names and number of absences.') WITH (basic_index_name='schema_index');
    SELECT s.student_name, COUNT(sc.id) AS absence_count FROM student_courses sc JOIN students s ON sc.student_id = s.id WHERE sc.status = 2 GROUP BY s.student_name ORDER BY absence_count DESC LIMIT 2;

    From the preceding output, you can see that the LLM-based NL2SQL model has mapped absent to the column value 2 for the status column in the student_courses table.

Wide table support

If your data tables include wide tables with many columns, or if you receive a Please use column index to avoid oversize table information. error when you use LLM-based NL2SQL, you can use the following process to support wide tables.

Note

When you use LLM-based NL2SQL online, both basic_index_name and pattern_index_name in WITH() can use the same column_index_name. Unlike schema or pattern, this is only for information simplification.

For most requests, the column_index_name parameter has no effect. It is required only for NL2SQL requests that trigger length limits, where it simplifies table information. This may cause a slight loss in accuracy but effectively avoids errors in the LLM-based NL2SQL model that are caused by an overly long prompt.

  1. Build a column index table.

    You can use a custom column index table name that complies with database specifications, but the name must not be the same as any existing table in the current database. In addition, only one column index table is required per database. The `CREATE TABLE` statement is as follows:

    /*polar4ai*/CREATE TABLE column_index(id integer, table_name varchar, table_comment text_ik_max_word, column_name text_ik_max_word, column_comment text_ik_max_word, is_primary integer, is_foreign integer, vecs vector_768, ext text_ik_max_word, PRIMARY KEY (id));
    Note
    • The column index table is not directly displayed in the database. To view related information, you can run the SQL statement /*polar4ai*/SHOW TABLES;.

    • To delete the column index table, using column_index as an example, you can run the SQL statement /*polar4ai*/DROP TABLE IF EXISTS column_index;.

  2. Import data from the data tables into the column index table at the column granularity.

    /*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_text2vec, SELECT '') WITH (mode='async', resource='column') into column_index;
  3. Check the task status.

    After you run the statement to import data into the column index table, the task_id for the task is returned, such as bce632ea-97e9-11ee-bdd2-492f4dfe0918. You can use the following command to check the import status. When taskStatus is finish, the task is complete.

    /*polar4ai*/SHOW TASK `bce632ea-97e9-11ee-bdd2-492f4dfe0918`;
  4. Use LLM-based NL2SQL for wide tables online.

    /*polar4ai*/SELECT * FROM PREDICT (MODEL _polar4ai_nl2sql, SELECT 'Sort by teacher name in ascending alphabetical order, and display the teacher''s name and the name of the course they are assigned to teach.') WITH (basic_index_name='schema_index', column_index_name='column_index');

FAQ

Syntax error when you execute an SQL statement in DMS

If you encounter the following error when you run an AI SQL statement (an SQL statement prefixed with /*polar4ai*/) for the PolarDB for AI feature in DMS: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'xxx' at line xxx, check whether the DMS connection endpoint is the Cluster Endpoint of the PolarDB cluster.

The PolarDB for AI feature must run on an AI node. However, DMS connects to the PolarDB cluster using the Primary Endpoint by default. Therefore, you need to modify the DMS connection endpoint as follows.

  1. After you connect to the cluster through DMS, in the Database Instances > Logged-in Instances list in the navigation pane on the left, select the target cluster, right-click it, and select Edit Instance.image

  2. In the Edit Instance dialog box, change Basic Information > Entry Method to Connection String, and enter the Cluster Endpoint of the cluster. Click Save.image

  3. Because the original SQL window connects to the cluster using the Primary Endpoint, you must close that window and open a new one to run the SQL statement after you modify the Connection String.