Built-in content filters cover common risk categories, but every platform has unique moderation needs — off-site traffic diversion, competitor attacks, or industry-specific violations. The Customize Detection Agent feature in AI Guardrails lets you describe what to detect in plain language, and uses large language models (LLMs) to apply your rules at scale.
When to use custom detection agents
Use the Customize Detection Agent feature when built-in content filters don't cover your use case. It's best suited for:
Topic-based detection: Detecting content that belongs to a semantic category you define, such as "off-site traffic diversion" or "malicious competitor comparisons"
Multi-class classification: Detecting and labeling content across several custom categories in a single pass
How it works
You define detection tags and their descriptions. Each tag is a category the LLM will look for.
The system combines your tags, a preset scenario template, and a fixed output format into a complete prompt.
The prompt is sent to the LLM you select. The LLM returns a structured moderation result.
The complete prompt looks like this:
You are a senior ****** moderation expert, specializing in ******.
The business problem you face is ******, and the task objective is ******.
The tags for moderation are as follows:
1. Off-site traffic diversion: Content that directs users to other platforms or off-site channels...
2. Malicious negative reviews for brand xx: Unfounded malicious comparisons, false negative reviews...
Here is a sample for moderation. Determine if the text matches any of the tags described above.
Strictly follow the format below for the output: ******.Set up custom detection agents
Step 1: Enable the Customize Detection Agent feature
Log on to the AI Guardrails console. Prerequisite: You need to activate AI Guardrails before proceeding.
In the left navigation bar, choose Protection Configuration > Configuration.
Select the service you want to configure. The following services support custom detection agents:
Service System identifier Use case AI Input Content Security Check query_security_check_intlModerate user-submitted queries AI Generated Content Security Check response_security_check_intlModerate LLM-generated responses AI Input Content Security Check (query_security_check_intl)
AI Generated Content Security Check (response_security_check_intl)
Click Actions in the Management column to go to Configuration.
If Customize Detection Agent is not enabled, enable it on this page. This feature is billed separately. For details, see Billing.
Step 2: Select a model
On the Customize Detection Agent card, click Configuration Management in the lower-right corner to open the configuration page.
Under Select large model, choose the model that fits your use case:
| Model | Best for | Notes |
|---|---|---|
| Qwen3Guard-Gen-4B | Security-focused scenarios requiring broad language support | Supports 9 risk categories, 3 risk levels, and 119 languages and dialects. Currently available only in Singapore. Requires a specific input format — see below. |
| Qwen3_Plus | Complex, high-performance scenarios where some latency is acceptable | Balanced performance, speed, and cost. |
| Qwen3_Flash | Simple tasks | Fast and cost-effective. |
The model you select affects billing. Different models have different billing methods. For details, see Activation and billing overview.
Input format for Qwen3Guard-Gen-4B:
To moderate a query: pass the query directly to the
contentfield.To moderate a response: concatenate the query and response as
query<|interval|>responseand pass the result to thecontentfield.
Step 3: Configure detection tags
Under Configure custom prompt, set up your detection tags.
Choose a scenario template
Select a template under Select a scenario template. The currently available template is:
Custom Tag Template: For general scenarios. Supports configuring custom detection tags.
Add detection tags
Under Detection configuration, add the categories you want to detect. Each detection tag requires two fields:
| Field | Description |
|---|---|
| Audit Tag | A noun or noun phrase naming the category (for example, Off-site traffic diversion). |
| Description | A precise definition of what the category includes. Optionally, include one to three example phrases to illustrate boundary cases. |
Example detection tags:
| Audit tag | Description |
|---|---|
| Off-site traffic diversion | Behavior that directs users to other platforms or channels off-site through direct guidance or subtle hints, including variations and metaphors. This includes explicitly mentioning competitor platform names or their variations, such as common competitor xx. It also includes mentioning other off-site platforms or their variations, or including explicit contact information. |
| Malicious negative reviews for brand xx | Unfounded malicious attacks, false negative reviews against brand xx, or false slander and rumors against the brand's founder that intentionally damage the image of the brand or founder. For example: "xx is all false advertising, far inferior to brand xx." |
Best practices for writing detection tags
Writing clear detection tags directly affects accuracy. Follow these guidelines:
Do:
Define the category precisely and concisely. A clear definition reduces false positives and false negatives. For example:
Content that directs users to third-party platforms through direct links, implied references, or variations of competitor names.Write the Audit Tag as a noun or noun phrase that names the category.
Write the Description as a definition — describe what the content *is*, not what to do with it.
Include one to three short example phrases in the Description if the boundary is ambiguous.
Don't:
Write the Description as an instruction. For example,
Block all content mentioning competitor namesis an instruction, not a definition.Define a category using negation. For example,
All content except product discussionsdoesn't define what to detect.Use the Audit Tag or Description to match specific words or entity names. Use the word filter for that.
Billing is based on the total character length of all your detection tags and descriptions combined, in increments of 3,000 characters. Lengths under 3,000 characters are rounded up to 3,000. You can configure a maximum of 30 detection tags. Longer descriptions increase detection latency.
The Model output format is preset and doesn't require configuration. For the output schema, see API reference.
Step 4: Test the configuration
Before publishing, test the configuration to make sure it detects what you expect.
Click Test in the lower-left corner of the configuration page. Enter one text entry or up to 10 entries at a time and review the results.
Testing is free of charge. Each account can test up to 1,000 texts per day.
Adjust your detection tags based on the results, then test again until the output meets your expectations.
Step 5: Publish the configuration
Click Publish to deploy the configuration to the production environment. The configuration typically takes 2 to 5 minutes to take effect.
To verify the live behavior, use the online testing feature.
Step 6: View detection results
In the left navigation bar, choose Test Results to view detection results and threat reports from the custom detection agent.
What's next
Billing overview — understand how model selection and prompt length affect cost
API reference — integrate the Customize Detection Agent into your application
Online testing feature — verify production behavior after publishing