What is AI Guardrails - Content Moderation - Alibaba Cloud Documentation Center

AI Guardrails is a security protection product designed by Alibaba Cloud for artificial intelligence systems. It helps AI systems provide safe, compliant, and reliable services when responding to user instructions through high availability and high-precision risk detection solutions.

Features

When developing and operating AI applications and AI Agents, developers and AI companies often face security threats, including compliance risks, data breach risks, prompt injection attacks, hallucinations, and jailbreaking. These AI risks not only threaten normal business operations but also bring significant compliance and social risks to enterprises.

Alibaba Cloud AI Guardrails ensures the compliance, security, and stability of AI businesses by providing a comprehensive protection system for pre-trained LLMs, AI services, and AI Agents. Particularly in generative AI input and output scenarios, AI Guardrails provides accurate risk detection and proactive defense capabilities.

Risk detection capabilities
Includes comprehensive detection capabilities such as content compliance detection, sensitive content detection, and prompt injection attack detection.
- Content compliance detection: Conducts multi-dimensional compliance moderation of text content input and output by generative AI, covering risk categories such as politically sensitive content, illicit content, bias and discrimination, and harmful values. This ensures that AI-generated content complies with laws, regulations, and platform standards. Scenarios include chatbots, AI education, intelligent customer service, AIGC creation platforms, and more.
- Sensitive content detection: Deeply detects privacy data and sensitive information that may be leaked during AI interactions. It supports the identification of sensitive content involving personal privacy and corporate privacy, and prevents training data leakage and conversation information overflow risks. Scenarios include AI healthcare, AI financial services, enterprise knowledge base Q&A, and more.
- Prompt injection attack detection: Professionally defends against injection attacks targeting generative AI. It accurately identifies adversarial behaviors such as jailbreak instructions, role assumption inducement, and system instruction tampering, building an "immune defense line" for AI systems. Scenarios include instruction interaction security protection for AI Agents, adversarial attack defense for open-domain dialogue systems, permission control for third-party plugin calls, and more.
Custom protection configuration
Supports changing fine-grained risk detection items in protection configuration. You can open or close relevant risk detection content at any time through the AI Guardrails console to establish the most appropriate risk detection template.
- Custom detection items: Configure fine-grained tags in content compliance detection.
- Custom risk thresholds: Configure hit thresholds for fine-grained tags, supporting a minimum configuration step size of 1 in the model output's 0-100 confidence score.
- Custom filter words: Configure sensitive words (such as competitors' names) that need to be detected and blocked, supporting dictionary management operations such as adding, deleting, and modifying.

For more information about product features, see the Functions and Features page in the documentation.

Scenarios

The following lists some business scenarios where it is recommended to use AI Guardrails for risk detection:

User prompts submitted to generative AI for processing.
Multi-modal content output by generative AI, including text, images, videos, etc.
Scanning and detoxification of generative AI training corpus.
Risk detection of AI Agent user instruction input and output.

Features

Risk detection capabilities

Custom protection configuration

Scenarios