Build Custom Detection Agents to Strengthen LLM Content Safety - Content Moderation - Alibaba Cloud - AI Guardrails

Built-in content filters cover common risk categories, but every platform has unique moderation needs — off-site traffic diversion, competitor attacks, or industry-specific violations. The Customize Detection Agent feature in AI Guardrails lets you describe what to detect in plain language, and uses large language models (LLMs) to apply your rules at scale.

When to use custom detection agents

Use the Customize Detection Agent feature when built-in content filters don't cover your use case. It's best suited for:

Topic-based detection: Detecting content that belongs to a semantic category you define, such as "off-site traffic diversion" or "malicious competitor comparisons"
Multi-class classification: Detecting and labeling content across several custom categories in a single pass

How it works

You define detection tags and their descriptions. Each tag is a category the LLM will look for.
The system combines your tags, a preset scenario template, and a fixed output format into a complete prompt.
The prompt is sent to the LLM you select. The LLM returns a structured moderation result.

The complete prompt looks like this:

You are a senior ****** moderation expert, specializing in ******.
The business problem you face is ******, and the task objective is ******.
The tags for moderation are as follows:
1. Off-site traffic diversion: Content that directs users to other platforms or off-site channels...
2. Malicious negative reviews for brand xx: Unfounded malicious comparisons, false negative reviews...
Here is a sample for moderation. Determine if the text matches any of the tags described above.
Strictly follow the format below for the output: ******.

Set up custom detection agents

Step 1: Enable the Customize Detection Agent feature

Log on to the AI Guardrails console. Prerequisite: You need to activate AI Guardrails before proceeding.
In the left navigation bar, choose Protection Configuration > Configuration.

Select the service you want to configure. The following services support custom detection agents:

Service	System identifier	Use case
AI Input Content Security Check	`query_security_check_intl`	Moderate user-submitted queries
AI Generated Content Security Check	`response_security_check_intl`	Moderate LLM-generated responses

AI Input Content Security Check (query_security_check_intl)

AI Generated Content Security Check (response_security_check_intl)

Click Actions in the Management column to go to Configuration.
If Customize Detection Agent is not enabled, enable it on this page. This feature is billed separately. For details, see Billing.

Step 2: Select a model

On the Customize Detection Agent card, click Configuration Management in the lower-right corner to open the configuration page.

Under Select large model, choose the model that fits your use case:

Model	Best for	Notes
Qwen3Guard-Gen-4B	Security-focused scenarios requiring broad language support	Supports 9 risk categories, 3 risk levels, and 119 languages and dialects. Currently available only in Singapore. Requires a specific input format — see below.
Qwen3_Plus	Complex, high-performance scenarios where some latency is acceptable	Balanced performance, speed, and cost.
Qwen3_Flash	Simple tasks	Fast and cost-effective.

Important

The model you select affects billing. Different models have different billing methods. For details, see Activation and billing overview.

Input format for Qwen3Guard-Gen-4B:

To moderate a query: pass the query directly to the content field.
To moderate a response: concatenate the query and response as query<|interval|>response and pass the result to the content field.

Step 3: Configure detection tags

Under Configure custom prompt, set up your detection tags.

Choose a scenario template

Select a template under Select a scenario template. The currently available template is:

Custom Tag Template: For general scenarios. Supports configuring custom detection tags.

Add detection tags

Under Detection configuration, add the categories you want to detect. Each detection tag requires two fields:

Field	Description
Audit Tag	A noun or noun phrase naming the category (for example, `Off-site traffic diversion`).
Description	A precise definition of what the category includes. Optionally, include one to three example phrases to illustrate boundary cases.

Example detection tags:

Audit tag	Description
Off-site traffic diversion	Behavior that directs users to other platforms or channels off-site through direct guidance or subtle hints, including variations and metaphors. This includes explicitly mentioning competitor platform names or their variations, such as common competitor xx. It also includes mentioning other off-site platforms or their variations, or including explicit contact information.
Malicious negative reviews for brand xx	Unfounded malicious attacks, false negative reviews against brand xx, or false slander and rumors against the brand's founder that intentionally damage the image of the brand or founder. For example: "xx is all false advertising, far inferior to brand xx."

Best practices for writing detection tags

Writing clear detection tags directly affects accuracy. Follow these guidelines:

Do:

Define the category precisely and concisely. A clear definition reduces false positives and false negatives. For example: Content that directs users to third-party platforms through direct links, implied references, or variations of competitor names.
Write the Audit Tag as a noun or noun phrase that names the category.
Write the Description as a definition — describe what the content *is*, not what to do with it.
Include one to three short example phrases in the Description if the boundary is ambiguous.

Don't:

Write the Description as an instruction. For example, Block all content mentioning competitor names is an instruction, not a definition.
Define a category using negation. For example, All content except product discussions doesn't define what to detect.
Use the Audit Tag or Description to match specific words or entity names. Use the word filter for that.

Important

Billing is based on the total character length of all your detection tags and descriptions combined, in increments of 3,000 characters. Lengths under 3,000 characters are rounded up to 3,000. You can configure a maximum of 30 detection tags. Longer descriptions increase detection latency.

The Model output format is preset and doesn't require configuration. For the output schema, see API reference.

Step 4: Test the configuration

Before publishing, test the configuration to make sure it detects what you expect.

Click Test in the lower-left corner of the configuration page. Enter one text entry or up to 10 entries at a time and review the results.

Note

Testing is free of charge. Each account can test up to 1,000 texts per day.

Adjust your detection tags based on the results, then test again until the output meets your expectations.

Step 5: Publish the configuration

Click Publish to deploy the configuration to the production environment. The configuration typically takes 2 to 5 minutes to take effect.

To verify the live behavior, use the online testing feature.

Step 6: View detection results

In the left navigation bar, choose Test Results to view detection results and threat reports from the custom detection agent.

What's next

Billing overview — understand how model selection and prompt length affect cost
API reference — integrate the Customize Detection Agent into your application
Online testing feature — verify production behavior after publishing