Yahaha Studios[1] is a metaverse platform that lowers the barrier to 3D content creation, enabling users to build, share, and play 3D games and interactive experiences without writing any code.

STRIDEN[2] is an upcoming UE5-based online FPS jointly developed with our key partner studio, 5 Fortress. We are now racing to release its Early Access version on Steam.

At the beginning of the project we faced a core challenge: our partner’s existing tech stack relied on a public-cloud Auto Scaling Group (ASG) solution.While ASG can scale out automatically, it falls short on graceful scale-in, in-place rapid updates, and multi-region releases— all common operational scenarios for online games.For stateful game servers, refined lifecycle control, zero-downtime updates, and standardized operations are essential; the ASG setup was far from our cloud-native vision.
Compounding the issue, we had only two months to completely refactor the deployment into a cloud-native architecture to meet the Steam launch schedule with high quality.
Thus, STRIDEN became not only an anticipated title but also a litmus test for our engineering team.It forced us to confront the limitations of traditional deployment models and rapidly turn our cloud-native game-hosting vision into a stable, efficient production solution.
Below we share in detail the thinking, challenges, and choices behind our cloud-native journey.
Before adopting OpenKruiseGame[3] (hereafter “OKG”) we suffered two major operational pain points:
Consequently, for STRIDEN we had a clear goal: find a unified platform that gracefully manages both stateless back-ends and stateful game servers.The answer ultimately pointed to the Kubernetes cloud-native ecosystem.
Kubernetes’ declarative APIs and controller pattern let us treat game servers as first-class workloads, bridging the two stacks and enabling true Infrastructure-as-Code.
With Kubernetes chosen, we looked for a mature, game-oriented solution in the community. We finally settled on OKG based on several key considerations:

After theoretical selection, the real test was migrating the complex FPS game STRIDEN from the ASG stack to a cloud-native architecture.Along the way we confronted several concrete technical challenges that further validated our choice of OKG.

In the ASG setup, bringing up a new game server instance was long and cumbersome:
Create EC2 → download game package → extract → fetch Steam dependencies → launch UE game server.
This average cold start took roughly three minutes per server.During steady traffic it was tolerable, but burst arrivals made minute-level spin-up unacceptable, leaving players queued and undermining experience.One core cloud-native advantage is second-level container startup, so efficiently containerizing our hefty game server stack became challenge number one.
For an FPS like STRIDEN, low latency demands that clients connect directly to each server’s public IP and port.Under the ASG model this led to a dilemma.
● Cost: Assigning a dedicated Elastic IP to every server wastes IP resources and inflates the bill.
● Operational overhead: Managing hundreds of lifecycle-coupled public IPs and their security rules is labor-intensive and error-prone.This extra layer hampered automation during bulk server churn and hurt operational efficiency.
The legacy platform had two fatal deficiencies:
● No auto scaling-in/out: operators had to manually add or remove servers one by one.This was inefficient and its risk multiplied across multiple global regions.
● No global synchronous releases: version updates were manual per region.Such manual regional rollout was time-consuming and prone to version skew—unacceptable for a competitive title.
Because game servers are stateful, scaling in without hurting players is a major challenge.The legacy platform could not detect whether a server still had active players.Hence scale-in decisions were made blindly:
● Conservative strategy keeps many idle servers, wasting cost.
● Aggressive strategy may terminate live sessions, harming user experience.
This inability to balance cost and experience drove us to search for smarter scheduling.We needed a controller that marks reclaimable servers and performs lossless scale-in.
These legacy pains are exactly what cloud-native frameworks like OKG aim to solve.It offers standard tools and declarative APIs, turning manual, error-prone ops into automated, reliable controller logic.

Confronting the challenges above, we were not helpless.On the contrary, they are the exact scenarios OKG is built for. Its toolset, tailored for stateful games, helped us conquer each hurdle.
To cut the three-minute start time, we adopted a “process-to-image” strategy.We baked all steps that used to run post-VM boot into a pre-built, standardized container image.
● Pre-packaging: the full server binary, all dependencies (including Steam SDK), and optimized start scripts are packaged into the image.
● Streamlined scaling: scaling now just pulls the pre-built image from registry and starts a container within seconds — no EC2 creation, download, or extraction.
This reduced server boot time from minutes to seconds, enabling real elastic scaling.
OKG offers automated ingress networking[4], removing the need to manually create or tear down networks per game shard, and supports multiple models for different scenarios. Given our multi-cloud deployment, Yahaha chose the Kubernetes-HostPort model.
Because the Steam SDK requires the in-container port to match the public port, we leveraged the SameAsHost extension of the HostPort model.
SameAsHost is non-intrusive—just add certain parameters to the GameServerSet YAML.As shown in the pseudo-YAML below, modify networkConf; DownwardAPI then passes the network info into the container, allowing the game server to pick up the port.
apiVersion: game.kruise.io/v1alpha1
kind: GameServerSet
// ..
spec:
// ..
network:
networkType: Kubernetes-HostPort
networkConf:
- name: ContainerPorts
value: "striden-server:SameAsHost/UDP,SameAsHost/UDP"
gameServerTemplate:
spec:
// ..
containers:
volumeMounts:
- name: podinfo
mountPath: /etc/podinfo
volumes:
- name: podinfo
downwardAPI:
items:
- path: "network"
fieldRef:
fieldPath: metadata.annotations['game.kruise.io/network-status']
OKG introduces two key CRDs—GameServer and GameServerSet—central to our automation.
● GameServerSet enables auto scaling: akin to Deployment/ReplicaSet but built for game servers.
Ops engineers no longer handle individual machines; instead a YAML declares the desired state, e.g., “keep 100 v1.2 STRIDEN servers running globally.”OKG controllers continuously reconcile actual state to desired state by creating or deleting GameServers.Combined with HPA we can autoscale on CPU, memory, or custom metrics such as player count.
● Declarative API enables global release: to ship a new version we simply bump the image tag in the GameServerSet and use GitOps-based CI/CD to apply it to all clusters worldwide.Kubernetes and OKG controllers then perform standardized rolling (or advanced) updates automatically and safely, eliminating manual release risk.The declarative workflow also prevents vendor lock-in, allowing us to run on multiple clouds.
This is arguably OKG’s greatest benefit to us. Its elegant design solves blind scale-in.
● State awareness: we patched the server to periodically write player count and match status to a local file (custom QoS[6]); probe.sh then updates the GameServer’s opsState.Thus the platform knows each server’s live status in real time.
● Custom scale-in policy: OKG lets us define priority[7]; for STRIDEN we delete servers with playerCount == 0 first.
● Custom scale-out policy: beyond CPU/Memory, OKG supports a minimum number of opsState == None servers[8].If available servers fall below the threshold, OKG automatically spawns new ones.When they exceed maxAvailable, OKG scales in, preventing waste.
● Graceful deletion: when a server is selected, its opsState is first set to WaitToBeDeleted. Using the latest Lifecycle Hook[9] and a delete-block label, our match-maker stops assigning new games and waits for existing players to exit before deletion.
This “mark → isolate → drain → close → recycle” loop offers precise, player-safe scale-in while maximizing cost savings.
| Metric | Traditional ASG Solution | Cloud-Native OKG Solution | Improvement |
|---|---|---|---|
| Server Startup Time | ~3 minutes | <10 seconds | Second-level startup |
| Scale-out Response | Minutes | Seconds | Real-time elasticity |
| Release Model | Manual, region-by-region | Declarative, global sync | Automated, high efficiency |
| Scale-in Strategy | Empirical “guesswork” | Player-state-aware reclamation | Lossless scale-in, cost optimization |
| Ops Manpower Cost | High (scripts & manual) | Low (Infrastructure as Code) | Significantly reduced |
In short, OKG is more than a tool; it is a methodology and arsenal that elevated STRIDEN’s operations from traditional reactive mode to a modern, automated cloud-native paradigm.
Our two-month cloud-native transformation for STRIDEN demonstrates a clear, viable path to solving modern online games’ operational complexity via cloud-native tech.
STRIDEN’s success is only the start.This battle-tested architecture will become the technical foundation for all future Yahaha online titles. Next, we will keep innovating in cloud-native, exploring possibilities such as:
● Deeper AIOps: combining observability data with AI for smarter failure prediction and capacity planning.
● More flexible scheduling: hybrid scheduling based on player profiles, latency, and other factors to further optimize global matchmaking.
Ultimately, technology serves products and users.We believe that by embracing cloud-native, Yahaha will build a more agile, stable, and scalable platform, empowering creators and delighting players worldwide with richer interactive experiences.
The stability and smoothness of our cloud-native stack are embodied in STRIDEN.Theory is proven in practice.We warmly invite you to visit Steam on July 12 and experience the second-level matchmaking and stable service powered by our cloud-native architecture.Every smooth match you play is both a stress test and the best validation of our solution.
STRIDEN Steam page: https://store.steampowered.com/app/2052970/STRIDEN/
OKG Community Discussion Group: DingTalk Group ID 44862615
[1] https://yahaha.com/
[2] https://store.steampowered.com/app/2052970/STRIDEN/
[3] https://openkruise.io/kruisegame/introduction
[4] https://openkruise.io/zh/kruisegame/user-manuals/network
[5] https://mp.weixin.qq.com/s/TOPcOsE5WCIIXkgo9jujlA
[6] https://openkruise.io/kruisegame/user-manuals/service-qualities
[7] https://openkruise.io/kruisegame/user-manuals/gameservers-scale/#sequence-of-scale-down
[8] https://openkruise.io/kruisegame/user-manuals/gameservers-scale/#set-the-maximum-number-of-game-servers-whose-opsstate-is-none
[9] https://openkruise.io/kruisegame/user-manuals/lifecycle
ACK Gateway with AI Extension: Model Canary Release Practice for Large Model Inference
222 posts | 33 followers
FollowAlibaba Cloud Native Community - March 21, 2024
Alibaba Container Service - July 16, 2024
Alibaba Container Service - July 5, 2024
Alibaba Cloud Native Community - November 15, 2023
Alibaba Cloud Native Community - November 15, 2023
Alibaba Container Service - July 4, 2024
222 posts | 33 followers
Follow
Gaming Solution
When demand is unpredictable or testing is required for new features, the ability to spin capacity up or down is made easy with Alibaba Cloud gaming solutions.
Learn More
Managed Service for Prometheus
Multi-source metrics are aggregated to monitor the status of your business and services in real time.
Learn More
Container Service for Kubernetes
Alibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn More
ACK One
Provides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn MoreMore Posts by Alibaba Container Service