AI Function integrates Large Language Model (LLM) capabilities directly into Serverless Spark, enabling data engineers to invoke LLMs within SQL or PySpark for tasks such as sentiment analysis, translation, and information extraction—without requiring an SDK, API, or model operations and maintenance.
Limits
Supported engine versions
esr-4.x: esr-4.6.0 and later.
esr-3.x: esr-3.5.0 and later.
esr-2.x: esr-2.9.0 and later.
Unsupported scenarios
Jobs submitted through the
Kyuubi GatewayorLivy Gateway.Enable configuration
Add the Spark configuration
spark.emr.serverless.ai.function.enable truewhen submitting a job or creating a session to enable AI Function.
Function list
The following table summarizes all built-in AI Functions in Serverless Spark.
Function | Description | Default model | Typical use cases |
| General-purpose model invocation interface that supports custom prompts, model selection, and generation parameters. | qwen-plus | Batch summarization, information extraction, calling external or self-hosted model endpoints. |
| Sentiment analysis (positive, negative, or neutral) | qwen-plus | Public opinion monitoring, customer service comment analysis, user feedback categorization. |
| Grammar and word choice correction | qwen-plus | Automated polishing before generating external-facing content. |
| Multi-class or multi-label classification | qwen-plus | Ticket categorization (complaint, inquiry, or suggestion), news topic identification, log type annotation. |
| Extracts predefined STRUCT fields from text | qwen-plus | Key information extraction (client, amount, date), comment attribute parsing (product, issue, satisfaction). |
| Computes semantic similarity between two text segments (0.0 to 1.0) | qwen-plus | Deduplication, retrieval matching, Q&A pair similarity scoring, RAG similar document retrieval. |
| Translates text into a specified language | qwen-plus | End-to-end multilingual reporting, cross-border scenarios. |
| Optimizes SQL statements by rewriting queries or suggesting comments | qwen-plus | Automatically suggests predicate pushdown, bucketing or partitioning, and join order improvements. |
| Converts HiveQL to Spark SQL-compatible syntax | qwen-plus | Hive job migration, UDF syntax adaptation, window function standardization. |
| Generates text semantic embeddings | text-embedding-v4 | RAG vector database construction, semantic clustering, similarity-based retrieval, feature engineering. |
General-purpose function: ai_query()
ai_query() is the general-purpose interface provided by AI Function for executing custom prompts to invoke large language models. It supports flexible integration with built-in or external services and is ideal for advanced scenarios not covered by standard task-specific functions—such as complex reasoning, multi-turn instructions, and structured output.
Function syntax
ai_query(prompt [, service_name] [, options])Parameter description
Parameter
Type
Required
Description
promptSTRING
Yes
The complete prompt sent to the model, including context, instruction, and input data. Explicitly specify the output format to improve stability.
service_nameSTRING
No
Specifies the model service name to invoke:
• If omitted, the default modelqwen-plusis used.• For model service names registered via Model Service, see Model Service.
optionsSTRING
No
Additional invocation parameters in JSON string format to control generation behavior. For supported parameters, see Options parameter description.
Return type: STRING.
Usage examples
Without specifying a model
-- ai_query() without specifying a model select ai_query ('Is the following review positive or negative?\n This thing is really no good');Return value:
This review is negative. Reason: "really no good" is a clear negative expression with strong tone, indicating strong dissatisfaction or negative evaluation toward something. Therefore, the overall sentiment is negative.Specifying a model
-- ai_query() specifying the 'my_qwen_service' model select ai_query ('Is the following review positive or negative?\n This thing is really no good', 'my_qwen_service') ;Return value:
<think> Hmm, the user wants me to determine if this review is positive or negative. The review says, "This thing is really no good." First, I need to understand what this means. The user is saying "this thing" is bad, so it should be a negative evaluation. Next, I consider common features of negative reviews. They usually contain clear negative words like "no good," "bad," "terrible," etc.—all typical indicators of negative sentiment. Also, the word "really" emphasizes the degree, but the core message is still dissatisfaction. Then, I think about the user's potential needs. They might be dissatisfied with a product or service and want to confirm whether this comment effectively conveys negativity. They may also want to know the direction of the sentiment or whether further action is needed. I should also consider deeper intentions. The user might be looking for ways to handle such comments or whether improvements are necessary. So, my response should clearly state that the comment is negative and explain why, ensuring the user understands without ambiguity. Finally, keep the answer concise, direct, and professional. </think> This review is **negative**.Specifying a model and parameters
-- ai_query() specifying the 'my_qwen_service' model and the parameter '{"chat_template_kwargs": {"enable_thinking": false}}' to disable the chain-of-thought (thinking) process select ai_query ( 'Is the following review positive or negative?\n This thing is really no good', 'my_qwen_service', '{"chat_template_kwargs": {"enable_thinking": false}}' );Return value:
This review is **negative**.
Task-specific functions
Sentiment analysis: ai_sentiment()
ai_sentiment() is a dedicated sentiment analysis function that automatically identifies emotional polarity in text and returns positive, negative, or neutral results. Use it for user comment analysis, public opinion monitoring, and similar scenarios.
Function syntax
ai_sentiment(text[, service_name] [, options])Parameter description
Parameter
Type
Required
Description
textSTRING
Yes
Text to analyze. Use complete sentences or paragraphs. Very short or meaningless text may reduce accuracy.
service_nameSTRING
No
Specifies the model service name to invoke:
• If omitted, the default modelqwen-plusis used.• For model service names registered via Model Service, see Model Service.
optionsSTRING
No
Additional invocation parameters in JSON string format to control generation behavior. For supported parameters, see Options parameter description.
Return type: STRING.
Usage example
SELECT ai_sentiment ('I absolutely love this phone! It’s amazing!');Return value:
positive
Grammar correction: ai_fix_grammar()
ai_fix_grammar() is a dedicated grammar correction function that automatically detects and fixes spelling errors, grammatical issues, and awkward phrasing. Use it for user-generated content cleaning, customer service dialogue optimization, and document pre-processing.
Function syntax
ai_fix_grammar(text[, service_name] [, options])Parameter description
Parameter
Type
Required
Description
textSTRING
Yes
Original text to correct. Use complete sentences or paragraphs.
service_nameSTRING
No
Specifies the model service name to invoke:
• If omitted, the default modelqwen-plusis used.• For model service names registered via Model Service, see Model Service.
optionsSTRING
No
Additional invocation parameters in JSON string format to control generation behavior. For supported parameters, see Options parameter description.
Return type: STRING.
Usage example
SELECT ai_fix_grammar ('He go to school yesterday and dont like math.');Return value:
He went to school yesterday and didn't like math.
Text classification: ai_classify()
ai_classify() is a dedicated text classification function that automatically categorizes input text based on a provided list of labels. Use it for news categorization, ticket routing, comment topic identification, and other multi-class classification tasks.
Function syntax
ai_classify(text, labels[, service_name] [, options])Parameter description
Parameter
Type
Required
Description
textSTRING
Yes
Text to classify. Use complete semantic units.
labelsARRAY
Yes
Array of candidate classification labels, such as
array('technology', 'sports', 'entertainment'). Labels should be distinct and avoid semantic overlap.service_nameSTRING
No
Specifies the model service name to invoke:
• If omitted, the default modelqwen-plusis used.• For model service names registered via Model Service, see Model Service.
optionsSTRING
No
Additional invocation parameters in JSON string format to control generation behavior. For supported parameters, see Options parameter description.
Return type: STRING.
Usage example
SELECT ai_classify( 'Messi scored a spectacular free kick in the match', array('technology', 'sports', 'entertainment', 'finance') ) AS category;Return value:
sports
Information extraction: ai_extract()
ai_extract() is a dedicated information extraction function that pulls predefined structured fields from unstructured text. Use it for user comment parsing, contract key information extraction, log structuring, and similar scenarios.
Function syntax
ai_extract(text, labels[, service_name] [, options])Parameter description
Parameter
Type
Required
Description
textSTRING
Yes
Source text from which to extract information.
labelsARRAY
Yes
List of predefined fields to extract, such as
Array('country', 'age', 'name').service_nameSTRING
No
Specifies the model service name to invoke:
• If omitted, the default modelqwen-plusis used.• For model service names registered via Model Service, see Model Service.
optionsSTRING
No
Additional invocation parameters in JSON string format to control generation behavior. For supported parameters, see Options parameter description.
Return type: JSON.
Usage example
select ai_extract ( 'I am Alice,from china . I am 6 years old. Welcome!', Array('country', 'age', 'name') );Return value:
{"country": "china", "age": "6", "name": "Alice"}
Similarity calculation: ai_similarity()
ai_similarity() is a semantic similarity function that measures how closely two pieces of text align in meaning. It uses cosine similarity on embeddings and is useful for duplicate detection, recommendation systems, and Q&A matching.
Function syntax
ai_similarity(text1, text2[, service_name] [, options])Parameter description
Parameter
Type
Required
Description
text1STRING
Yes
First text segment.
text2STRING
Yes
Second text segment.
service_nameSTRING
No
Specifies the model service name to invoke:
• If omitted, the default modelqwen-plusis used.• For model service names registered via Model Service, see Model Service.
optionsSTRING
No
Additional invocation parameters in JSON string format to control generation behavior. For supported parameters, see Options parameter description.
Return type: DOUBLE.
Usage example
SELECT ai_similarity ( 'I enjoy hiking in the mountains.', 'I love walking through mountain trails.' );Return value:
0.85
Translation: ai_translate()
ai_translate() is a dedicated text translation function that automatically translates input text into a target language. Use it for cross-language data analytics and internationalized content processing.
Function syntax
ai_translate(text, target_lang[, service_name] [, options])Parameter description
Parameter
Type
Required
Description
textSTRING
Yes
Original text to translate.
target_langSTRING
Yes
Target language code. Supported values:
• zh: Chinese
• en: English
• ja: Japanese
• ko: Korean
• ru: Russian
• de: Dutch
• es: Spanish
• fr: French
• it: Italianservice_nameSTRING
No
Specifies the model service name to invoke:
• If omitted, the default modelqwen-plusis used.• For model service names registered via Model Service, see Model Service.
optionsSTRING
No
Additional invocation parameters in JSON string format to control generation behavior. For supported parameters, see Options parameter description.
Return type: STRING.
Usage example
select ai_translate ('Where are you from?', 'en') ;Return value:
Where are you from?
SQL statement optimization: ai_sql_optimize()
ai_sql_optimize() is an SQL optimization suggestion function that analyzes input SQL queries and provides recommendations for predicate pushdown, bucketing or partitioning, and join ordering. Use it to speed up queries, save resources, and improve execution plans.
Function syntax
ai_sql_optimize(sql_text[, service_name] [, options])Parameter description
Parameter
Type
Required
Description
sql_textSTRING
Yes
SQL query text to optimize. Include full table schema context when possible.
service_nameSTRING
No
Specifies the model service name to invoke:
• If omitted, the default modelqwen-plusis used.• For model service names registered via Model Service, see Model Service.
optionsSTRING
No
Additional invocation parameters in JSON string format to control generation behavior. For supported parameters, see Options parameter description.
Return type: STRING.
Usage example
SELECT ai_sql_optimize ( 'SELECT * FROM logs WHERE date = "2025-01-01" AND user_id IN (SELECT id FROM users WHERE status = "active");' );Return value:
SELECT l.* FROM logs l JOIN users u ON l.user_id = u.id WHERE l.date = '2025-01-01' AND u.status = 'active' ;
Hive to Spark SQL conversion: ai_sql_hive_to_spark()
ai_sql_hive_to_spark() automatically converts HiveQL to Spark SQL-compatible syntax. It recognizes Hive-specific constructs (such as LATERAL VIEW and SORT BY) and rewrites them into equivalent Spark SQL forms. Use it when migrating Hive jobs to Spark environments.
Function syntax
ai_sql_hive_to_spark(hql_text[, service_name] [, options])Parameter description
Parameter
Type
Required
Description
hql_textSTRING
Yes
HiveQL statement to convert. Supports complex queries, window functions, and other syntax structures.
service_nameSTRING
No
Specifies the model service name to invoke:
• If omitted, the default modelqwen-plusis used.• For model service names registered via Model Service, see Model Service.
optionsSTRING
No
Additional invocation parameters in JSON string format to control generation behavior. For supported parameters, see Options parameter description.
Return type: STRING.
Usage example
SELECT ai_sql_hive_to_spark ( ' SELECT user_id, nvl2(email, "verified", "anonymous") AS user_type, current_date() AS load_date FROM users WHERE login_time >= date_sub(current_date(), 7) ' );Return value:
SELECT user_id, CASE WHEN email IS NOT NULL THEN "verified" ELSE "anonymous" END AS user_type, current_date() AS load_date FROM users WHERE login_time >= date_sub(current_date(), 7)
Text embedding: ai_embedding()
ai_embedding() is a dedicated text embedding function that converts text into high-dimensional semantic vectors (embeddings). Use these embeddings for downstream AI tasks such as semantic retrieval, clustering, and similarity computation.
Function syntax
ai_embedding(text[, dimension][, service_name] [, options])Parameter description
Parameter
Type
Required
Description
textSTRING
Yes
Text to embed. Keep length reasonable to avoid truncation.
dimensionINT
No
Embedding dimension. Supported values: 2,048, 1,536, 1,024 (default), 768, 512, 256, 128, 64.
service_nameSTRING
No
Specifies the model service name to invoke:
• If omitted, the default modeltext-embedding-v4is used.• For model service names registered via Model Service, see Model Service.
optionsSTRING
No
Additional invocation parameters in JSON string format to control generation behavior. For supported parameters, see Options parameter description.
Return type: ARRAY<DOUBLE>.
Usage example
select ai_embedding ('This is a sentence', 256);Return value:
[-0.08807944506406784,0.041450440883636475,0.027626311406493187,-0.03082999959588051,-0.027538539841771126,0.030325308442115784,-0.0742553174495697,0.12841078639030457,-0.20047178864479065,0.2492731511592865,-0.07697625458240509,-0.0023876791819930077,0.01588677428662777,0.039453621953725815,0.035416096448898315,-0.044588297605514526,-0.025783095508813858,-0.09935817122459412,0.062230516225099564,0.06122113764286041,-0.019803611561655998,0.06359098851680756,0.08360305428504944,0.10120139271020889,-0.08680674433708191,-0.04792364314198494,0.012836690060794353,-0.06468813866376877,-0.08948379755020142,-0.00902956910431385,-0.17264799773693085,0.009084426797926426,-0.026200013235211372,0.009852433577179909,-0.02606835402548313,0.08351528644561768,-0.0015181854832917452,0.06165999919176102,-0.03504306450486183,0.019748752936720848,-0.04077020660042763,-0.047440893948078156,-0.04590488225221634,-0.017225300893187523,-0.03984859585762024,0.09795381873846054,-0.016599925234913826,-0.05077623948454857,-0.04487355798482895,-0.029623130336403847,-0.09058094769716263,0.027209393680095673,0.020637447014451027,-0.05981678143143654,0.07285095751285553,-0.012353942729532719,-0.02896483801305294,-0.12604093551635742,-0.060343414545059204,0.008782709948718548,0.0731581598520279,0.014306874945759773,-0.012321027927100658,0.003225629683583975,0.04382029175758362,-0.054506558924913406,-0.04298645257949829,0.011119645088911057,-0.013692469336092472,-0.06780405342578888,0.027604369446635246,0.11024193465709686,-0.06323989480733871,-0.04594876617193222,-0.037325143814086914,0.05415547266602516,-0.013747327029705048,0.07969719171524048,-0.11858029663562775,0.004797301255166531,-0.012759889476001263,0.006895606406033039,-0.07465028762817383,-0.04066048935055733,0.1428932100534439,0.037390973418951035,-0.012748917564749718,-0.07846838235855103,0.05415547266602516,-0.0560864619910717,-0.045422133058309555,0.04989851638674736,0.07842449843883514,0.0030226565431803465,0.031356632709503174,-0.04254759103059769,-0.13051731884479523,-0.01195896789431572,0.037654291838407516,-0.02290855348110199,-0.013494981452822685,-0.15605904161930084,-0.018443141132593155,-0.0013947556726634502,0.05305831879377365,0.06736519187688828,-0.09172198921442032,-0.05419935658574104,0.004629985429346561,0.09751495718955994,-0.010527183301746845,-0.08500741422176361,-0.06253772228956223,0.029228154569864273,9.778375970199704E-4,-0.057359158992767334,-0.0935652032494545,2.768597041722387E-4,-0.01802622340619564,0.028262659907341003,0.07342147827148438,-0.023676561191678047,-0.09769050031900406,-0.07697625458240509,0.05665697902441025,-0.15061716735363007,0.012310056015849113,-0.004180152900516987,0.087508924305439,-0.02492731623351574,-0.04120906442403793,-0.09865599870681763,0.055472053587436676,-0.03798343613743782,0.013308465480804443,-0.035591643303632736,0.028196832165122032,0.0260244682431221,-1.6988728020805866E-4,-0.02999616228044033,-0.023435188457369804,0.02786768600344658,0.06337155401706696,0.009775632992386818,0.04219650477170944,-0.057315271347761154,0.01100444421172142,-0.07895112782716751,-0.020341215655207634,0.013703440316021442,-0.0821109265089035,0.008376763202250004,-0.01687421277165413,-0.0050935326144099236,-0.025475893169641495,0.013286522589623928,0.05169785022735596,0.04568545147776604,0.08329585194587708,0.03730320185422897,0.06069450452923775,0.020670361816883087,0.04309616982936859,0.026353614404797554,-0.061352793127298355,-0.07236821204423904,-0.012847661040723324,-0.006835263222455978,-0.034428659826517105,0.009084426797926426,-0.011827308684587479,0.034560319036245346,-0.03793954849243164,0.05929014831781387,0.026419444009661674,0.03372648358345032,-0.021229909732937813,-0.015711231157183647,0.017587361857295036,-0.03291459009051323,0.05371661111712456,-0.02712162211537361,0.25787484645843506,-0.04318394139409065,-0.002493280218914151,-0.00854133628308773,-0.07478194683790207,0.024773715063929558,-0.009320314973592758,0.032475728541612625,0.05367272347211838,-0.031488291919231415,0.05275111645460129,0.11199737340211868,-0.05582314357161522,0.00895276851952076,-0.016259808093309402,0.16071097552776337,-0.04412749409675598,-0.04083603248000145,-0.04465412721037865,-0.09856822341680527,-0.08979099988937378,0.02593669667840004,0.11910692602396011,-0.027253279462456703,0.05915848910808563,-0.11910692602396011,0.03693016991019249,-0.07144660502672195,-0.04480772837996483,-0.04586099460721016,-0.04274507984519005,-0.019430579617619514,-0.00844807829707861,0.027604369446635246,-0.022579409182071686,0.019397664815187454,0.035503871738910675,0.07899501174688339,-0.06657524406909943,0.02505897358059883,-0.023786276578903198,0.0518295057117939,0.03671073913574219,0.05678863823413849,0.1004992127418518,0.008546821773052216,0.03872950002551079,0.0085248788818717,0.08641176670789719,-0.006901092361658812,-0.13771463930606842,-0.06767239421606064,-0.07789786159992218,0.06122113764286041,-0.02087882161140442,-0.031246917322278023,0.022623294964432716,-0.006160513963550329,0.09268748760223389,0.08465632051229477,-0.1036590114235878,-0.09900708496570587,0.008299962617456913,0.06367875635623932,-0.004986560437828302,-0.0022011632099747658,0.02088979259133339,-0.016139119863510132,-0.07934610545635223,0.029754789546132088,-0.04880136623978615,0.05595480278134346,-0.01239782851189375,0.02087882161140442]
Options parameter description
This section covers only parameters supported by built-in Serverless Spark model services. For external model services, refer to the official documentation of the specific model you are using.
Parameter | Description | qwen-plus | text-embedding-v4 |
| Sampling temperature controls the diversity of generated text. Higher values increase diversity; lower values increase determinism. Range: [0, 2). | ✓ | ✗ |
| Nucleus sampling probability threshold that controls text diversity. Higher values increase diversity. Range: (0,1.0]. | ✓ | ✗ |
| Size of the candidate set during generation. Larger values increase randomness; smaller values increase determinism. Must be ≥ 0. | ✓ | ✗ |
| Controls repetition in generated text. Range: [-2.0, 2.0]. Positive values reduce repetition; negative values increase it. | ✓ | ✗ |
| Random number seed ensures reproducible results with identical inputs and parameters. Range: [0, 231-1]. | ✓ | ✗ |
| Specifies stop words. Generation stops immediately when the model outputs any string or token_id listed here. | ✓ | ✗ |
| Output format. Options: | ✓ | ✗ |
| Specifies vector encoding format, such as | ✗ | ✓ |
| Limits maximum input tokens. Input is truncated if exceeded. | ✓ | ✗ |
| Limits maximum output tokens. Generation stops early if exceeded. Useful for controlling output length. | ✓ | ✗ |
| Enables thinking mode when using hybrid-thinking models. | ✓ | ✗ |
| Maximum length of the thinking process. | ✓ | ✗ |
| Returns log probabilities of output tokens. Options: | ✓ | ✗ |
| Number of top candidate tokens to return at each generation step. Range: [0,5]. Only effective when | ✓ | ✗ |
| Enables web search during text generation. Values: | ✓ | ✗ |
| Web search strategy. Only effective when | ✓ | ✗ |
| Extra arguments passed to the chat template to control rendering behavior (such as enabling specific capability tags). | ✓ | ✗ |
| Generic parameter wrapper used in DashScope native APIs. All above parameters are passed within this field in native interfaces. | ✓ | ✓ |
FAQ
Q: How do I disable the "think" process (such as chain-of-thought or reasoning steps) in the ai_query() function?
A: ai_query() supports controlling model output behavior via parameters. To disable the thinking process and return only the final result, pass the following configuration: {"chat_template_kwargs": {"enable_thinking": false}}.
The
enable_thinkingparameter only applies to Qwen-series models. For other models, consult the specific model's documentation to configure thinking-mode disabling.You can adjust output randomness and diversity using parameters such as
temperatureandtop_k. Refer to your model's official documentation for supported parameters.