All Products
Search
Document Center

API Gateway:AI overview

Last Updated:Jan 07, 2025

This topic provides an overview of AI.

In AI scenarios, traffic that passes through a Cloud-native API Gateway instance has characteristics that are different from other business traffic. The traffic has the following characteristics:

  • Persistent connections: Persistent connections account for a large proportion due to the WebSocket and Server-Sent Events (SSE) protocols that are commonly used in AI scenarios. Updates of Cloud-native API Gateway instance configurations have no adverse impact on persistent connections or your business.

  • High latency: The response latency of large language model (LLM) inference is higher than that of common applications. As a result, AI applications are vulnerable to attacks, and slow requests are likely to be constructed for concurrent attacks. The costs of attackers are low, and the overheads of the server are high.

  • Large bandwidth: In combination with the round-trip transmission of LLM context and high latency, Cloud-native API Gateway instances in AI scenarios consume more bandwidth resources than common applications. If the Cloud-native API Gateway instance does not implement a good stream processing capability and memory reclaim mechanism, the memory usage may surge.

Higress gateways have the following native advantages in handling gateway traffic:

  • Seamless rolling updates for persistent connections: In NGINX Ingress gateways, the reloading process is required for configuration updates. This may result in disconnection. Unlike NGINX Ingress gateways, Higress gateways implement real rolling updates based on Envoy. Such rolling updates have no negative impact on connections.

  • Security gateway: Security gateway capabilities based on Higress can provide HTTP flood protection in multiple dimensions such as IP addresses and cookies. In AI scenarios, queries per second (QPS) and token rate limiting are supported.

  • Efficient streaming transmission: Higress gateways support full streaming forwarding. The data plane is an Envoy that is written in C++. In large bandwidth scenarios, the memory usage is extremely low. Compared with GPUs, memory resources are more cost-effective. However, if memory resource consumption is not properly controlled, out of memory (OOM) issues may occur. As a result, services are interrupted, and unpredictable loss occurs.

Cloud-native API Gateway provides a comprehensive set of out-of-the-box AI plug-ins for various fields such as security protection, multi-model adaptation, observability, caching, and prompt engineering. The following content describes the core capabilities that are provided by AI plug-ins:

  • AI proxy plug-in: This plug-in is compatible with protocols from multiple vendors and supports 15 LLM providers, which covers most major LLM vendors.

  • AI content moderation plug-in: This plug-in can be connected to Alibaba Cloud Content Moderation to block content such as harmful information, misleading information, discriminating information, and illegal content.

  • AI statistics plug-in: This plug-in can be used to calculate the throughput of tokens, generate the data of Prometheus metrics in real time, and print relevant information in access logs and spans of Managed Service for OpenTelemetry.

  • AI rate limiting plug-in: This plug-in supports token-based rate limiting at the backend. This plug-in also allows you to configure a specific upper limit of call quotas for caller tenants.

  • AI development plug-ins: These plug-ins provide capabilities, such as LLM result caching and prompt decorators, to help you develop and build AI applications.