We previously shared "The Ten Essential Capabilities of AI Gateways" from the supplier's perspective. Today, we introduce eight common application scenarios of AI gateways from a consumer's perspective. Since most enterprises currently deploy large models mainly for internal use, most application scenarios listed in this article stem from internal demand, where external service provision becomes more reliant on AI gateways in terms of granularity and intensity of demand.
Large models do not have a monopoly; enterprises often implement a multi-model strategy. Employees at companies can choose from various large models on the front end, allowing them to switch backend large model services freely. For instance, a company may deploy multiple large models internally, such as DeepSeek, Qwen, or self-built models, allowing employees to choose from them to obtain richer and more optional generative outcomes. The more diversified the enterprise, the stronger the demand for multi-model services.
● Multi-modal business integration, where enterprises need to process text, images, audio, 3D, etc. Research and product teams require models with strong inference capabilities; customer service, marketing, and graphic design teams have more scene-based needs for image large models; industrial design and film production teams have needs for audio and video large models.
● Enterprises operating in multiple vertical sectors need to call specialized models based on industry characteristics. Particularly, companies on the supply chain side often serve multiple industries, which may involve the demand for several vertical large models.
● Collaborative scenarios for complex tasks, where a single task requires multiple models to work together to enhance outcomes. Multiple large models generating content cooperatively yield the best results.
● Scenarios with dual requirements for safety and efficiency, such as those in medical institutions, where proprietary private models are used to analyze patient data, while general models are used for other unrelated needs to avoid mixing sensitive and non-sensitive data when writing to the database.
Alibaba Cloud's native API Gateway offers AI gateways that support switching between different backend models based on model names, enabling the same interface to connect to multiple large model services. These large model services can be deployed on different platforms, such as Bai Lian, PAI, IDC self-built models, etc. Even when different models belong to different development and operations teams, there are no collaboration costs.
Multi-tenant model service sub-renting scenarios: When companies provide shared large model services to different departments or teams, they distinguish tenants through API Keys to ensure data isolation and permission control. Specific requirements include:
● Assigning independent API Keys to each tenant to control their access permissions and resource allocation, e.g., Department A has a limit of 20 calls per person per day, while Department B has a limit of 30 calls per day.
● Supporting tenant-defined model parameters (such as temperature coefficients, output length), but needing verification of permissions through the gateway.
Internal role-based access control: Different internal roles need differentiated access to model capabilities. Specific requirements include:
● Restricting sensitive functionalities (such as model fine-tuning and data export) based on RBAC (Role-Based Access Control).
● For cost considerations, multi-modal large models are only available for the design department.
● Recording operation logs and associating them with user identities to meet internal audit requirements. For instance, financial enterprises restrict risk assessment models to remain within the risk control department to prevent abuse by regular employees.
Alibaba Cloud's native API Gateway provides AI gateways that support routing configuration authentication and consumer authentication, allowing access control, security, and policy management for API access through processes of generating, distributing, authorizing, enabling, and verifying API Keys to ensure that only authorized requests can access the service.
● Identity Authentication: Ensuring that the requesting party is a registered/authorized user or system.
● Risk Interception: Preventing malicious attacks, illegal calls, and resource abuse.
● Compliance Assurance: Meeting data security regulations and enterprise audit requirements.
● Cost Control: Implementing precise billing and API quota management based on authentication.
● Exceptions caused by inherent model characteristics: Large models may produce results with probabilistic fluctuations, leading to unstable random outputs; publication of new versions may cause traffic losses.
● Exceptions caused by improper user behavior: User requests with parameters that do not conform to API specifications may lead to timeouts or interruptions, or inputs containing maliciously constructed prompts may trigger model safety protection mechanisms, returning empty results or error codes.
● Resource and performance limitations: Excessive request frequency triggering throttling strategies may make services unavailable, while long requests occupy too much memory, causing subsequent requests to be blocked and ultimately leading to timeouts.
● Dependency service failures: External APIs, such as inaccessible RAG retrieval databases, may hinder the model from obtaining necessary context.
Alibaba Cloud's native API Gateway provides AI gateways that support fallback to specified other large model services when service requests for one large model fail, ensuring the robustness and continuity of service.
Although there may not be frequent concurrent demands in internal use, setting rate limiting capabilities allows for a more economical hardware resource configuration. For instance, a company with 10,000 employees does not need to configure hardware resources that support 10,000 simultaneous online users; resources for 7,000 users are sufficient, with excess portions subject to rate limiting to prevent resource idleness. Other needs include:
● Enhancing resource management: The consumption of computing resources by large models can be uncontrollable; rate limiting can prevent system overload, ensuring all users receive stable performance, especially during peak periods.
● Specific user stratification: Token rate limiting can be based on ConsumerId or API Key.
● Preventing malicious use: Limiting the number of tokens reduces junk requests or attacks to prevent resource damage.
Alibaba Cloud's native API Gateway offers AI gateways that include the ai-token-ratelimit plugin, providing token rate limiting based on specific key values sourced from URL parameters, HTTP request headers, client IP addresses, consumer names, and cookie key names.
The workplace is serious within enterprises, and self-built large models need to ensure the security and compliance of generated content, including filtering out harmful or inappropriate content, detecting and preventing requests that contain sensitive data, and conducting quality and compliance audits on AI-generated content.
● Sensitive data processing in the financial industry: Auditing user-input financial transaction instructions and investment consulting content to prevent fraud, money laundering, and other violations.
● Medical health information exchange: Generating electronic medical records while preventing leaks of patient privacy (such as ID numbers, diagnostic records) and ensuring AI-generated medical advice complies with relevant regulations. Utilizing multi-modal large models to identify sensitive information in medical imaging and combining it with compliance rule databases for automated interception.
● Social media and UGC content management: Real-time auditing of user-published text and video content to intercept pornographic, violent, or false information. Conducting compliance checks on AI-generated recommendation content (such as short video titles and comments).
● Public service platform interaction: Auditing public submissions for government consultations to prevent malicious attacks or the spread of sensitive information, ensuring AI-generated policy interpretations and service guidelines comply with relevant regulations.
● E-commerce and live streaming platform risk control: Auditing product descriptions and live streaming content to intercept false advertising and prohibited information while conducting advertising regulation and compliance checks on AI-generated marketing copy.
Alibaba Cloud's native API Gateway offers AI gateways integrated with Alibaba Cloud's content security, providing review services for input instructions facing large language models and generated text separately. Their functions include:
● Preventing attacks: Validating inputs can stop harmful prompt injections, preventing the model from generating harmful content.
● Maintaining model integrity: Avoiding input manipulation that may lead to erroneous or biased outputs.
● User safety: Ensuring outputs are free from harmful or misleading content, protecting users from negative effects.
● Content moderation: Filtering out inappropriate content, such as hate speech or vulgar language, especially in public applications.
● Legal compliance: Ensuring outputs meet legal and ethical standards, particularly in medical and financial fields.
Pricing for large model API services is divided into a lower cost per million input tokens when caching hits (X) versus a higher cost when misses (Y), with X typically being much lower than Y. For instance, in the case of the Tongyi series, X is only 40% of Y. By caching large model responses in an in-memory database and using it as a gateway plugin, the inference delay and cost can be improved. The gateway layer can automatically cache historical conversations for users and fill them into context in subsequent dialogues, thereby enhancing the large model's understanding of contextual semantics. Examples include:
● High-frequency repetitive query scenarios: In customer service systems, users often ask repetitive questions (e.g., "How do I reset my password?" "What is the refund process?"). By caching responses to frequently asked questions, redundant calls to the model can be avoided, lowering invocation costs.
● Fixed contextual repeated call scenarios: In legal document analysis (e.g., interpreting contract terms) or educational document parsing (e.g., knowledge point Q&A), the same long text may be queried multiple times. By caching the context, redundant data transmission and processing can be avoided, improving response speed and reducing invocation costs.
● Complex computation result reuse scenarios: In data analysis and generation scenarios (e.g., financial report summarization, scientific report generation), caching the results of multiple analyses on the same dataset can prevent unnecessary recalculation.
● In RAG (Retrieval-Augmented Generation) scenarios: Caching knowledge base retrieval results (e.g., company internal FAQs) can accelerate responses to subsequent similar queries.
Alibaba Cloud's native API Gateway offers AI gateways with extension points to cache request and response contents in Redis, supporting configuration of Redis service information and setting cache duration.
Online search has become a standard feature for large models. If online search is not supported, or if only the capability to obtain webpage titles, summaries, and keywords rather than full text is provided, the content generation effect will be significantly diminished.
Alibaba Cloud's native API Gateway provides AI gateways that enhance online search by enabling full text retrieval of web pages. This includes:
● LLM Rewriting Queries: Utilizing LLMs to identify user intent and generate search commands can significantly enhance search results.
● Keyword Extraction: Different engines require different keywords; for example, many papers in Arxiv are in English, necessitating keywords to be in English.
● Domain Recognition: For instance, Arxiv categorizes various fields under computer science, physics, mathematics, biology, etc. Searching designated fields can improve search accuracy.
● Long Query Splitting: Long queries can be split into multiple short queries, enhancing search efficiency.
● High-quality Data: While Google, Bing, and Arxiv may only output article summaries, integration with Alibaba Cloud's information query service (IQS) allows for full text retrieval, thereby improving the quality of LLM-generated content.
Observability is commonly found in cost control and stability scenarios. Due to the more sensitive and fragile resource consumption of large model applications compared to web applications, the demand for observability in cost control becomes more pronounced. In the absence of comprehensive observability capabilities, abnormal calls can result in losses amounting to tens of thousands or even hundreds of thousands of dollars.
In addition to traditional observability metrics like QPS, RT, and error rates, the observability of large models should also include:
● Token consumption statistics based on users (consumer).
● Token consumption statistics based on models.
● Rate limiting metrics: How many requests per unit time are intercepted due to throttling, and which consumers are being throttled.
● Cache hit rates.
● Security statistics: Risk type statistics and risk consumer statistics.
Alibaba Cloud API Gateway supports viewing gateway monitoring data, enabling logging delivery and traceability in the native API Gateway, as well as monitoring REST API and interface data. These features will assist in more efficiently managing and optimizing interface performance while enhancing overall service quality. Additionally, through SLS, it can aggregate Actiontrail events, Cloud product observable logs, LLM gateway detail logs, detailed dialogue logs, prompt tracing, and real-time inference call detail logs, thus building a complete unified observability solution.
The Era of Self-Built DeepSeek Has Arrived: How to Efficiently Implement Web Search
Development Trends and Open Source Technology Practices of AI Agents
549 posts | 52 followers
FollowAlibaba Cloud Native Community - March 27, 2025
Alibaba Cloud Native Community - April 15, 2025
Alibaba Cloud Native - December 10, 2024
Alibaba Cloud Native Community - February 18, 2025
Alibaba Cloud Native Community - March 10, 2025
Alibaba Container Service - April 24, 2025
549 posts | 52 followers
FollowAccelerate AI-driven business and AI model training and inference with Alibaba Cloud GPU technology
Learn MoreTop-performance foundation models from Alibaba Cloud
Learn MoreAccelerate innovation with generative AI to create new business success
Learn MoreA platform that provides enterprise-level data modeling services based on machine learning algorithms to quickly meet your needs for data-driven operations.
Learn MoreMore Posts by Alibaba Cloud Native Community