×
Community Blog 10 Essential Capabilities of an AI Gateway

10 Essential Capabilities of an AI Gateway

The article discusses the key capabilities and importance of AI gateways in managing and optimizing large model applications.

The primary battleground for large models has shifted from training to inference, a consensus in the industry. More and more companies have started designing large model applications that cater to both internal needs and external commercial directions, deploying them in production environments. Throughout this process, a series of new demands have emerged that differ from those in the "proof-of-concept stage" of initial large model applications. These new demands are more about scalability and secure usage, with the AI gateway becoming one of the most discussed key components of AI infrastructure.

We believe that an AI gateway is not a new form independent of API gateways but rather an extension tailored for the new requirements of AI scenarios—an evolution and inheritance of API gateways. Thus, we classify the capabilities of AI gateways from the perspective of APIs to facilitate a common understanding.

1. Inheritance from API Gateways

Due to the multitude of capabilities provided by API gateways and the variety of roles involved, we categorize all capabilities based on their users, including three major scenarios: API development, supply, and consumption, corresponding to API interface R&D teams, API platform R&D and O&M teams, and external callers of API platforms, respectively.

API Development Scenario

API First involves defining API specifications before coding. Unlike directly coding without defining APIs, API First emphasizes designing and developing API interfaces before building applications. APIs are treated as core architectural components of the system, achieving modularity through well-defined interface specifications. For example, cloud products on public clouds all provide API invocation methods, and platforms like WeChat Mini Programs and DingTalk Open Platform offer API interfaces to developers. This modular system, similar to LEGO bricks, enables flexible combinations of services through standardized interfaces, enhancing system scalability and maintainability, thus improving ecosystem efficiency.

API Supply Scenario

The API Supply Scenario refers to the process by which API providers (such as enterprises, platforms, or services) expose data or functionalities through standardized interfaces. The core focuses on creating, managing, and maintaining APIs to ensure their availability, security, and efficiency. Key capabilities include:

API Security: Protecting APIs from various security threats, ensuring that only authorized users and applications can access the API, and safeguarding data confidentiality, integrity, and availability during transmission and storage. Examples include identity authentication, authorization management, data encryption/decryption, and anti-attack mechanisms.

Gray Release: A strategy for gradually introducing new API versions or features in production environments. It allows a portion of users or request traffic to be directed to the new version of the API while keeping the rest on the old version. This enables testing and validation of the new API without affecting overall system stability and user experience.

Caching: Temporarily storing API response results in a cache server. When identical requests arrive again, the response results are retrieved directly from the cache instead of re-accessing the backend server, thereby improving API response speed and system performance.

API Consumption Scenario

The API Consumption Scenario refers to the process where consumers (such as applications or developers) integrate external APIs to quickly implement functionalities or acquire data. The core focus is on utilizing the capabilities or data provided by the platform to meet business needs. Key aspects include:

Call Auditing: Comprehensive recording, monitoring, and analysis of API call activities. It meticulously records details of each API call, including call time, caller identity, called API interface, request parameters, response results, and response time.

Caller Quota Throttling: Mechanisms set by the API gateway to limit the number of API calls, traffic size, or resource usage by each caller (such as users, applications, IP addresses) within a certain period.

Backend Protection Throttling: Managing and controlling API access traffic to ensure stable and efficient operation of APIs. This prevents system crashes and performance degradation due to excessive or abnormal traffic. It includes load balancing, rate limiting, fallbacks, circuit breakers, and other capabilities.

2. Evolution of API Gateways

In the context of large models, richer demands emerge in development, supply, and consumption scenarios.

Large Model API Development Scenario

API First or treating APIs as first-class citizens is no longer just a slogan but is gradually becoming a practical application development standard. The development and operation of agents require calling APIs, and providing services through open platforms also necessitates offering APIs. An API gateway can cover all lifecycle stages of an API, from design, development, testing, release, monetization, operations monitoring, security management, to decommissioning. Enterprises' demands for such capabilities are becoming more pronounced. Based on the API gateway, multiple plugin capabilities can be provided to enhance agent development efficiency. Examples include AI prompt templates [1], API AI Agents [2], JSON formatting [3], which structure AI responses according to default or user-configured JSON schemas.

Large Model API Supply Scenario

Flexible Multi-Model Switching & Fallback Retry: Backend systems now commonly integrate multiple large models, serving both as options for users and as fallback mechanisms during faults or capacity limitations [4].

Content Safety and Compliance: Using content safety plugins to filter out harmful or inappropriate content, detect and block requests containing sensitive data, and review the quality and compliance of AI-generated content [5].

Semantic Caching: Pricing for large model API services often differentiates between cached (X yuan per million input tokens) and uncached (Y yuan per million input tokens) responses, with X being significantly lower than Y. For instance, in the Tongyi series, X is only 40% of Y. Semantic caching involves storing LLM responses in an in-memory database and improving inference latency and cost through gateway plugins. The gateway automatically caches user conversation histories and fills them into the context in subsequent interactions, enabling better understanding of conversational semantics [6].

Multi-API Key Load Balancing: API keys are used to identify and authenticate callers and control their access permissions. When multiple API keys exist, the API gateway distributes API requests evenly or according to specific rules across these keys for processing.

Large Model API Consumption Scenario

Token Quota Management and Throttling: Tokens are common units of measurement for large model applications, accurately quantifying the amount of data processed. Similar to traditional gateway management of service access volumes, AI gateways need token management capabilities, including usage monitoring and throttling features, as well as precise quota limits for calling tenants 7.

Traffic Gray Release: Both base models and large models are continuously improving content generation quality, leading to frequent changes. High-frequency model iterations rely heavily on A/B testing and gray release capabilities. As the entry point for traffic, AI gateways play a critical role in traffic gray release and monitoring, including gray labeling and monitoring metrics such as entry traffic delay and success rates.

Call Cost Auditing: The computational resources consumed by large model calls far exceed those of web application requests, making cost control more critical. This includes both direct economic costs, such as fees paid for using third-party API services or internal resource consumption (servers, storage, bandwidth), and indirect costs, such as resource costs due to API call errors.

3. Why Implement These Capabilities on the Gateway Rather Than the Large Model Service Layer

Architecture Design and Decoupling

Functional Separation: The gateway and the large model service layer each have distinct core functions. The large model service layer focuses on executing complex computational tasks such as natural language processing and image recognition, providing intelligent responses to users. In contrast, the primary function of the API gateway is to manage API access, including security authentication, traffic control, protocol conversion, etc. Implementing API gateway capabilities at the gateway level achieves clear functional separation, making the responsibilities of each component more explicit, facilitating system development, maintenance, and scalability.

System Decoupling: Implementing API gateway functionalities within the large model service layer would tightly couple the large model services with API management features. Adjustments to API management strategies (such as changing authentication methods or adjusting traffic limit rules) could affect the stability and performance of the large model services. By implementing API gateway capabilities at the gateway level, the large model services can be decoupled from API management, allowing them to develop and upgrade independently, reducing system complexity and maintenance costs.

Performance Optimization

Reducing Large Model Load: Large models typically require significant computational resources and memory to run, already consuming substantial system resources for complex inference tasks. Implementing additional API gateway functionalities like authentication, throttling, and caching within the large model service layer would further increase the load, impacting its processing speed and response time. By handling these functionalities at the gateway level, requests can be preprocessed and filtered before reaching the large model service layer, reducing unnecessary requests and improving the performance and efficiency of the large model.

Enhancing Concurrent Processing Capability: The gateway can distribute a large number of API requests evenly across multiple large model service instances using load balancing techniques, enhancing the system's concurrent processing capability. If each large model service instance had to handle API management tasks independently, it would limit the system's concurrency. Centralizing these tasks at the gateway allows better handling of high-concurrency scenarios.

Security Assurance

Unified Security Protection: As the entry point of the system, the gateway can perform comprehensive security checks on all incoming API requests, forming a unified line of defense. Implementing identity verification, authorization, and anti-attack security features at the gateway effectively prevents malicious requests from reaching the large model service layer, protecting the large model and related data. Implementing these security features within the large model service layer could lead to security vulnerabilities due to the dispersed nature of the services.

Data Protection: The gateway can encrypt and anonymize data in API requests and responses, ensuring data security during transmission and storage. Handling these data protection tasks within the large model service layer could increase complexity and computational burden. Centralized handling at the gateway better protects user-sensitive information and avoids security risks associated with the large model directly accessing sensitive data.

Scalability and Flexibility

Easy Integration of New Features: As business evolves, new features may need to be added to API management, such as supporting new security protocols or introducing new traffic control algorithms. Implementing API gateway capabilities at the gateway level makes it easier to integrate these new features without extensive modifications to the large model service layer. This allows for rapid response to business needs and improves system scalability.

Support for Multiple Model Access: In practical applications, multiple different large model services might be used simultaneously. The gateway can serve as a unified entry point, providing consistent API management services for different large model services, simplifying the management and scheduling of multiple large models. Implementing API gateway functionalities separately in each large model service layer would increase system complexity and management difficulty.

Observability and Monitoring

Centralized Monitoring and Analysis: The gateway can centrally monitor and analyze all API requests, collecting various metrics such as request response times, call frequencies, error rates, etc. Analyzing this data helps identify issues like performance bottlenecks and security vulnerabilities, enabling timely optimization and fixes. Implementing monitoring functionalities within the large model service layer would make it difficult to comprehensively understand and analyze the entire system's API call situation.

Fault Diagnosis and Localization: When API call failures occur, it is easier to diagnose and locate the issue at the gateway level. The gateway records detailed information about each API request, including the source, request parameters, and response results. Analyzing this information quickly identifies the cause and location of the failure, reducing the time and cost required for troubleshooting and repair.

4. Future Directions for AI Gateways

Thanks to the dynamic expansion capabilities of Wasm plugins, Higress is rapidly evolving in the AI era. All mentioned underlying capabilities for managing large model APIs are already available on the open-source Higress and Alibaba Cloud's cloud-native API gateway:

Additionally, Alibaba Cloud's cloud-native API gateway provides AI API management capabilities, making it easier and more efficient to manage APIs in the AI era.

1

In the future, we will focus on evolving towards standardized content output, reducing hallucinations, and enhancing stability and usability. We welcome developers to join our community, share your needs or challenges with us, and work together to develop more user-friendly open-source products and commercial services.

[1] https://higress.cn/docs/latest/plugins/ai/api-dev/ai-prompt-template/

[2] https://higress.cn/docs/latest/plugins/ai/api-dev/ai-agent/

[3] https://higress.cn/docs/latest/plugins/ai/api-dev/ai-json-resp/

[4] https://higress.cn/docs/latest/plugins/ai/api-provider/ai-proxy/

[5] https://higress.cn/docs/latest/plugins/ai/api-provider/ai-security-guard/

[6] https://higress.cn/docs/latest/plugins/ai/api-provider/ai-cache/

[7] https://higress.cn/docs/latest/plugins/ai/api-consumer/ai-token-ratelimit/

[8] https://higress.cn/docs/latest/plugins/ai/api-consumer/ai-quota

0 1 0
Share on

You may also like

Comments