By Ananda Budi, Head of Solution Architect Alibaba Cloud Indonesia
Every company building on large language models eventually hits the same fork in the road: do we run inference on our own GPUs, or do we call an API and let someone else handle the hardware? It sounds like a simple cost question, but the answer touches on engineering capacity, data governance, product roadmap, and long-term vendor strategy. Getting it right can save hundreds of thousands of dollars a year. Getting it wrong can lock you into a path that becomes increasingly painful to reverse.
This post lays out a practical framework for thinking through the decision. No product pitches, just the tradeoffs as they actually play out in production.
Self-hosted GPU deployment means provisioning and managing your own inference infrastructure, whether on bare-metal servers, cloud GPU instances, or a managed Kubernetes cluster. You pick the GPU hardware (A100s, H100s, and so on), deploy your model of choice, and handle scaling, monitoring, and failover yourself. In return, you get full control over latency, throughput, data residency, and model selection.
Model-as-a-Service (MaaS) means consuming inference through a provider's API. You send tokens in, you get tokens back, and you pay per use. Alibaba Cloud - Model Studio, OpenAI, Anthropic, Google, AWS Bedrock, and Azure AI are the major players. The operational burden is near zero, but you're constrained to whatever models the provider offers, and your data passes through their infrastructure.
Neither approach is categorically better. The right choice depends on where you sit across several dimensions.
GPU infrastructure has a high fixed cost and a low marginal cost. Once you've provisioned a cluster and your models are deployed, serving an additional million tokens is essentially free. This means self-hosting becomes increasingly attractive as your token volume grows. The crossover point varies, but most teams find that somewhere above 10 million tokens per month the economics start to shift decisively toward self-hosted.
If you're operating in a regulated industry, finance, healthcare, government, or anywhere with strict data residency requirements, sending prompts and completions to a third-party API may not be an option. Self-hosting lets you keep every byte on infrastructure you control, which simplifies compliance and eliminates the need to negotiate data processing agreements with API providers.
MaaS providers give you access to their model catalogue, and some offer fine-tuning on top of their base models. But if you've trained your own model from scratch, or you need to run a specialised open-weight model like a domain-specific Llama variant, self-hosting is the only game in town. It also gives you full control over model versioning, so you can roll back without waiting on a provider's deprecation schedule.
When you own the inference stack, you control every millisecond. You can co-locate compute with your application, optimise batching strategies, and tune serving frameworks like vLLM or TensorRT-LLM to your exact workload profile. For real-time applications where p99 latency matters, say a conversational agent or a search autocomplete system, this level of control is hard to replicate with a shared API endpoint.
If you're still iterating on your product, testing different models, or haven't locked in your usage patterns, MaaS is almost always the right starting point. The zero upfront cost and instant access let you move fast without committing capital or engineering time to infrastructure you might not need in six months.
GPU clusters don't scale to zero. If your traffic is spiky, with heavy usage during business hours and near silence overnight, you'll pay for idle capacity with self-hosted infrastructure. MaaS pricing is purely per-token, so you only pay for what you use. For teams whose workloads fluctuate significantly day-to-day or week-to-week, this elasticity can be worth a premium on the per-token rate.
Running GPU infrastructure reliably is not trivial. It requires expertise in CUDA drivers, container orchestration, model serving frameworks, load balancing, and GPU health monitoring. If your team doesn't have this expertise, the staffing and ramp-up costs can dwarf the compute savings. MaaS lets you ship AI features without building a platform engineering team first.
The most capable models, GPT-4 class and beyond, are only available through their respective providers' APIs. If your use case requires frontier-level reasoning, coding ability, or multimodal understanding, MaaS may be your only option, at least until open-weight models close the gap.
Raw compute cost is the number everyone focuses on, but it's rarely the full picture. On the self-hosted side, teams frequently underestimate three things:
● Infrastructure overhead: networking, storage, monitoring, and redundancy can add 20-40% on top of raw GPU cost.
● Staffing: a production GPU cluster typically requires at least a portion of a dedicated ML platform engineer's time. At market rates, that's $150K-$250K per year in loaded cost.
● Opportunity cost: every hour your team spends debugging CUDA issues or tuning autoscaling is an hour not spent on product differentiation.
On the MaaS side, the hidden costs are different but equally real:
● Token pricing compounds quickly at scale. What looks cheap at 1 million tokens per month can become eye-watering at 100 million.
● Vendor lock-in creeps in through prompt engineering, fine-tuning investments, and model-specific behaviours that don't transfer cleanly between providers.
● Rate limits and availability are outside your control. If a provider has an outage or throttles your account, your product goes down with it.
Here is a concise view of how the two approaches stack up across the dimensions that matter most:
| Self-Hosted GPU | Model-as-a-Service | |
|---|---|---|
| Best for volume | High, predictable throughput | Variable or growing workloads |
| Cost model | Fixed monthly + staffing | Pay-per-token, scales linearly |
| Break-even | Favourable above ~10M tokens/mo | Favourable below ~10M tokens/mo |
| Data residency | Full control on-premise | Depends on provider region |
| Operational burden | Requires infra + ML ops team | Near-zero ops overhead |
| Time to production | Weeks to months | Hours to days |
| Model flexibility | Any open-weight model | Provider catalogue only |
| Vendor lock-in risk | Low (you own the stack) | Moderate to high |
Rather than treating this as an either-or choice, many mature AI teams adopt a hybrid approach. They route high-volume, latency-sensitive, or privacy-critical workloads to self-hosted infrastructure, and use MaaS for everything else: prototyping, low-volume features, and access to frontier models.
If you're making this decision today, start by answering four questions honestly:
If you answered yes to two or more of those, you likely have a strong case for self-hosting, at least for part of your workload. If you answered no to most of them, MaaS is probably the right default, and you can revisit as your usage matures.
Infrastructure choices compound over time. A MaaS dependency that's cheap today can become your biggest line item in 18 months. A GPU cluster that seemed like overkill can become the backbone of your competitive advantage. The key is to make the decision deliberately, with real numbers, rather than defaulting into one path because it's what you started with.
Whatever you choose, revisit the analysis every quarter. The pricing landscape for both GPUs and API providers is shifting fast, and the best infrastructure strategy is one that adapts with it.
Migrate from Nginx Ingress to Alibaba Cloud Cloud Native Solution
122 posts | 21 followers
FollowAlibaba Cloud Community - March 3, 2025
Justin See - March 19, 2026
Alibaba Cloud Community - December 5, 2024
JwdShah - February 13, 2024
Farah Abdou - December 1, 2025
Alibaba Cloud Community - December 25, 2024
122 posts | 21 followers
Follow
Container Compute Service (ACS)
A cloud computing service that provides container compute resources that comply with the container specifications of Kubernetes
Learn More
Container Service for Kubernetes
Alibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn More
Apsara Stack
Apsara Stack is a full-stack cloud solution created by Alibaba Cloud for medium- and large-size enterprise-class customers.
Learn More
Tongyi Qianwen (Qwen)
Top-performance foundation models from Alibaba Cloud
Learn MoreMore Posts by Alibaba Cloud Indonesia