AI Gateway - API Gateway - Alibaba Cloud Documentation Center

Overview

AI is a key driver of business innovation. The growth of large language models (LLMs) is expanding the applications of AI. Commercial and self-built models are driving business advancements across many fields. Application architecture is also evolving from microservices and cloud-native architecture to AI-native architecture. During this process, businesses face challenges in AI integration, system stability, security, compliance, and management.

To meet these challenges, Cloud-native API Gateway provides AI Gateway that acts as a core connection between AI applications, model services, tools, and other agents. It helps businesses build and manage AI-native applications by providing protocol conversion, security protection, traffic governance, and unified observability.

Challenges of using AI in business scenarios

AI applications are widely used in business. Compared to traditional applications, AI applications have a distinct architecture: They are model-centric and use a model's inference capabilities, prompts, tool calling, and memory to support and respond to specific business needs.

Based on their traffic patterns, AI application scenarios fall into three categories:

Accessing model services: The core feature of AI applications is using model capabilities for inference and planning. Therefore, ensuring the security and stability of the model access path is critical.
Calling external tools: Tools act as a bridge between AI applications and external systems. Tool calling is typically performed using standardized protocols such as Model Context Protocol (MCP).
Being accessed externally: This includes access by end users or other AI applications. In these scenarios, AI applications often use protocols such as A2A for communication between applications.

When implementing these three types of scenarios, businesses often face engineering and technical challenges. These include the following:

Challenges of accessing model services

Multiple models: Different model providers have different API operation specifications, authentication mechanisms, and calling methods. This makes it difficult for callers to achieve unified integration and flexible switching across providers. There is no standard abstraction layer to support parallel calls to multiple models.
Multiple modalities: Unlike text-to-text LLMs that are compatible with the OpenAI standard, multi-modal models lack a unified standard. They vary in transport protocols, such as SSE, WebSocket, and Web Real-Time Communication (WebRTC). They also vary in communication modes, such as sync or asynchronous, and request-response structures. The diverse interfaces increase the complexity of system integration and operations management.
Multiple scenarios: Different business scenarios have different needs for model services. For example, real-time speech-to-text requires low response time (RT). Long-text understanding requires processing stability. Different scenarios have different requirements for throttling policies, fault tolerance mechanisms, and service quality. This requires custom adaptations.
High security requirements: Businesses face the risk of data breaches when calling model services, especially when using external or open-source models. The transmission and processing of sensitive data must meet strict data compliance requirements. These include privacy protection, audit trails, and access control measures.
High stability requirements: Model services are limited by underlying computing power resources. They generally have low API operation throttling thresholds. Compared to traditional API services, their RT and request success rate fluctuate more. Service availability is less stable. This challenges the continuity and user experience of upstream AI applications.

Challenges of accessing tools: Accuracy and security

The main challenge for AI applications when calling tools is balancing efficiency and security.

As the number of available tools grows, sending the tool list to an LLM for inference and selection significantly increases token consumption and inference costs. Additionally, many candidate tools can cause the model to make incorrect choices, which reduces execution accuracy.

In addition, tools are often directly linked to core business logic. Improper calls can expand the system's security risk surface. New attack methods, such as malicious MCP poisoning, have emerged. This places higher demands on the security design of tool access mechanisms.

Challenges of accessing AI applications: Stability and flexibility

Developers can build AI applications in several ways, which fall into the following three main categories:

High-code development: Use frameworks such as Spring AI Alibaba, ADK, and LangChain to write code. This method offers the highest flexibility and functional scalability. It also requires a higher level of technical skill from developers.
Low-code development: Use platforms such as Alibaba Cloud Model Studio to orchestrate application flows with a visual drag-and-drop interface. This method supports rapid building and iteration. It lowers the development barrier and is suitable for quick validation and prototyping.
No-code development: Use tools such as JManus to build AI applications by configuring prompts. No programming is required. This is suitable for rapid deployment in simple scenarios.

Because different development models have different implementations and architectural designs, there is no unified standard for connecting AI applications. This makes it difficult to achieve centralized governance and control similar to cloud-native applications.

In addition, the behavior and performance of AI applications highly depend on the capabilities of the underlying LLM. Their output stability is uncertain. Without effective isolation and fault tolerance mechanisms, a single point of failure can cause a chain reaction. This can lead to large-scale failures in business systems that depend on the application.

Typical practices for the three scenarios using AI Gateway

To solve these problems for customers, Alibaba Cloud provides AI Gateway. It acts as a bridge between AI applications, model services, tools, and other agents. The following three scenarios show typical practices using AI Gateway.

Model access

A business plans to build AI applications to improve operational efficiency and explore new business scenarios. On the Alibaba Cloud platform, it deploys a fine-tuned model on Platform for AI (PAI) and integrates Model Studio as a fallback service. For specific needs such as image generation, it uses an open-source model deployed on Function Compute. To ensure secure and efficient calls to LLM services from all AI applications, the business deploys AI Gateway. It configures a model API for different application scenarios and integrates control capabilities such as traffic governance and authentication into the API layer. This provides a unified entry point for model access.

AI Gateway effectively addresses the following problems:

Multiple models: AI Gateway supports multiple model routing policies. These include routing rules based on model name, request ratio, or specific request features, such as a header. The gateway can also convert protocols from different model providers into OpenAI-compatible interfaces. This allows AI applications to connect to a single standard to switch seamlessly between multiple models.
Multiple modalities: AI Gateway supports proxying multi-modal model calls over HTTP and WebSocket protocols. It provides a unified endpoint. This allows applications to call various models, such as text-to-text, text-to-image, and speech recognition, in a consistent way. Administrators can also use plugins to enhance the security and stability of multi-modal calls.
Multiple scenarios: Create a separate model API for each specific model application scenario, such as text generation, image generation, or speech recognition. Assign a unique consumer identity to each caller. Use the consumer dimension to implement call observability, throttling, security protection, and billing. This ensures resource isolation and fine-grained management.
High security requirements: AI Gateway provides comprehensive protection at the network security, data security, and content security layers.
- Network security: Integrates SSL certificates, WAF protection, and IP blacklists and whitelists. This defends against malicious traffic and attacks at the network entry layer.
- Data security: Supports consumer-side identity authentication to avoid exposing API keys directly. It performs backend authentication and API key management for backend model services. You can also host keys in Key Management Service (KMS) to prevent sensitive information from being stored locally on the gateway.
- Content security: Deeply integrates with AI security guardrails for real-time interception of non-compliant content and risky inputs. It works with a data masking plugin to remove sensitive information before forwarding requests, which ensures content compliance.
High stability requirements: AI Gateway improves system stability from observability and controllability.
- Observability: Records the source provider, target model, and consumer for each request. It also records key metrics, such as time to first byte and the number of tokens. The gateway marks events such as throttling, interception, and fallback. A built-in dashboard provides end-to-end visualization.
- Controllability: Provides load balancing, fallback mechanisms, throttling policies, and caching. Configure governance rules, such as token limits and concurrency controls, based on consumers. Administrators can use monitoring data to continuously optimize policies and dynamically adjust resources to ensure system stability.

Tool access

After building a unified access system for model services, a business identifies many problems with tool access. It especially faces high security risks that require focused governance. To address this, the business decides to unify the control of protocols and entry points for tool access. The architecture team selects MCP as the standard protocol for tool access. It uses the HTTP-to-MCP conversion capability of AI Gateway to automatically convert existing APIs into MCP Servers. This supports rapid business iteration and innovation.

AI Gateway ensures the accuracy and security of tool calls through the following mechanisms:

Accuracy:
AI Gateway supports connecting to existing HTTP services and hosting MCP servers. For existing HTTP services, users can dynamically update tool descriptions in the gateway. The gateway supports flexible orchestration of tools. You can create a virtual MCP server to combine tool lists as needed to meet different business scenario requirements. This allows the provider and consumer to define their own MCP servers independently. In addition, AI Gateway provides an intelligent tool routing feature. It can automatically filter relevant tool sets on the gateway side based on the request content. It returns only the tool list that matches the current task. This effectively reduces token consumption for model inference and improves tool selection accuracy.
Security: For tool access control, AI Gateway has a multilayer security mechanism. In addition to supporting call authentication at the MCP server level, it also supports fine-grained access permission configuration for individual tools. This enables fine-grained authorization management based on the caller's identity. It ensures that tools with different security levels can be assigned corresponding access permissions based on their risk level.

Agent access

As the number of AI applications grows, a business unifies them under AI Gateway to solve collaboration and management issues. It recommends using the A2A protocol for service registration and discovery through the Nacos AI Registry.

AI Gateway can act as a unified proxy service for AI applications, providing stability and flexibility.

Stability: AI Gateway supports direct connections to multiple Alibaba Cloud runtime platforms, such as Container Service for Kubernetes (ACK), Function Compute, and Serverless App Engine (SAE). It provides active and passive health check mechanisms to automatically isolate abnormal nodes. It reduces change risks using canary release capabilities. It also supports multi-dimensional throttling policies to prevent application overload and ensure service stability.
Flexibility: AI Gateway uses its service discovery feature to uniformly expose AI applications deployed on different computing platforms. It provides REST-to-A2A protocol conversion, which enables automatic upgrades of existing HTTP applications to the A2A protocol. For low-code AI applications built with Model Studio, AI Gateway supports unified proxy access and can extend secondary authentication mechanisms.

In addition, AI Gateway is deeply integrated with the Alibaba Cloud observability system. After an AI application is connected, you can enable end-to-end observability with one click. This covers the entire call chain from the application layer, MCP tools, to model calls, enabling end-to-end tracing and fault localization.

Core capabilities of AI Gateway

Unified proxy for models, MCP servers, and agents

AI Gateway provides proxy capabilities for models, MCP servers, and agents. It supports unified access and management for multiple service types, including the following:

AI services: Supports proxying for various model services. This includes model services from vendors such as Model Studio, OpenAI, Minimax, Anthropic, Amazon Bedrock, and Azure. It is also compatible with self-built models based on Ollama, vLLM, and SGLang. You can configure API keys in AI services and specify a custom DNS server for internal service addresses.
Agent services: Supports services on agent application platforms, including Model Studio, Dify, and user-defined agent workloads. You can configure an API key and app ID for identity authentication and access control.
Container services: Supports services running on Alibaba Cloud ACK or ACS clusters. A single AI Gateway instance can be associated with up to three container clusters.
Nacos services: Supports access to service instances registered in an MSE Nacos registry. This applies to regular microservices and MCP Servers.
DNS services: Supports accessing backend services through DNS parsing. You can specify a dedicated DNS server to resolve private networks or internal domain names.
Fixed addresses: Supports configuring backend service addresses as a list of fixed IP addresses. You can set multiple IP:Port addresses.
SAE: Supports services running on Alibaba Cloud SAE.
Function Compute: Supports access for Function Compute. AI Gateway can bypass the HTTP trigger and integrate directly with backend services to improve call efficiency.
Compute Nest MCP services: Supports MCP servers hosted by Compute Nest.

AI Gateway lets you configure health checks for services. It includes both active and passive health check modes.

Active health checks: The gateway periodically sends health probes to service nodes based on user-configured detection rules to determine their availability status.
Passive health checks: The gateway evaluates the health status of service nodes based on their performance in handling actual requests, according to user-configured detection rules.

Load balancing and canary release for models and agents

Load balancing and canary release for models

The model API provides three built-in model load balancing capabilities:

Single-model service: You can specify a single LLM service. It supports passing through the model name or specifying a model name. When a model name is explicitly specified, the model name passed in the user request is ignored.
Multi-model services (by model name): You can configure one or more LLM services and set matching rules for the model name for each service. For example, you can define a rule where requests with model names matching deepseek-* are sent to the DeepSeek LLM service, and requests with model names matching qwen-* are sent to the Model Studio LLM service.
Multi-model services (by weight): You can configure one or more LLM services and specify a corresponding model name and request allocation weight for each service. This is suitable for scenarios such as canary releases of new models.

The model API supports custom route configurations. You can forward requests to different backend services based on request features, such as a specific header.

Canary release for agents

Similar to the model API, the agent API supports canary release capabilities based on request features. You can route requests to different backend services based on specific features, such as a specific header.

Authentication, observability, throttling, and metering based on consumers and other dimensions

AI Gateway supports independent authentication, monitoring, throttling, and metering functions based on different business sources to meet users' fine-grained management needs.

Consumer authentication

In AI Gateway, you can create different consumers and assign request credentials to each consumer. You can also enable consumer authentication for model APIs, MCP servers, and agent APIs as needed. AI Gateway supports three authentication methods: API key, JWT, and HMAC. For security-sensitive scenarios, you can host consumer credentials in KMS.

You can create multiple consumers in AI Gateway and assign separate request credentials to each. For model APIs, MCP servers, and agent APIs, you can enable consumer authentication as needed. AI Gateway supports three authentication methods: API key, JWT, and HMAC. For scenarios with high security requirements, you can host consumer credentials in KMS to manage them securely.

Consumer observability and metering

AI Gateway provides multi-dimensional observability capabilities. It supports monitoring and analysis by dimensions such as consumer. Key metrics include the following:

QPS: The number of AI requests and responses per second, broken down into AI request QPS, streaming response QPS, and non-streaming response QPS.
Request success rate: The success rate of AI requests, with statistics available at 1-second, 15-second, and 1-minute granularities.
Tokens consumed per second: The number of tokens consumed per second, divided into input tokens, output tokens, and total tokens.
Average RT: The average response time (in milliseconds) of AI requests over a specified period (1-second, 15-second, or 1-minute statistics). Breakdowns include non-streaming RT, streaming RT (total time for the streaming response), and streaming time to first byte RT (latency of the first packet in a streaming response).
Cache hits: The number of cache hits and misses within a specified time period.
Throttling statistics: The number of throttled requests and normally processed requests within a specified time period.
Token statistics by model: The token consumption of different models within a specified time period.
Token statistics by consumer: The token consumption of different consumers within a specified time period.
Risk statistics: Based on content security detection results, statistics on identified risk requests by dimensions such as risk type and consumer.

Based on this observability data, AI Gateway can support metering and billing functions based on consumers. It provides detailed data, such as the number of tokens consumed by a specific consumer when calling a specific model within a specified time period. This helps users quickly achieve accurate resource usage metering and billing.

Consumer throttling

AI Gateway supports throttling policies based on multiple dimensions, such as consumer, model name, and request header. You can limit the number of requests, concurrency, connections, and tokens per unit of time.

Multidimensional and multi-modal AI security protection

AI Gateway integrates the content security protection feature to provide AI security capabilities. You can enable it per API to effectively prevent security risks during model calls. These risks include sensitive words, compliance issues, prompt injection attacks, and brute-force attacks. This improves the security and stability of AI applications.

AI Gateway supports configuring independent interception policies for different protection dimensions. The protectable dimensions include the following:

contentModeration: content compliance detection
promptAttack: prompt attack detection
sensitiveData: sensitive content detection
maliciousFile: malicious file detection
waterMark: digital watermarking

For each protection dimension, you can configure a corresponding interception policy. The interception policies include the following:

High: Requests with risk levels of low, medium, and high are all intercepted.
Medium: Requests with a medium or high threat level are intercepted.
Low: Only requests with a risk level of high are intercepted.
Monitor mode: Requests are not intercepted, only recorded.

Hot-swappable and hot-updatable policies and extension plugins

AI Gateway provides a rich set of built-in extension policies and plugins. It also allows users to develop custom plugins to meet specific business needs.

For example, the model API comes with five core built-in policies: tool selection, security protection, throttling, caching, and web search. You can enable more policies and plugins as needed.

All policies and plugins support hot-swapping and rolling updates. This ensures that service traffic is not affected during configuration changes.

What to do next

Learn about AI Gateway gateway types and billing.

Create a gateway instance to experience the features of AI Gateway.