×
Community Blog From Minutes to Seconds: Yahaha's Cloud-Native UE5 Game Practice Powered by OpenKruiseGame

From Minutes to Seconds: Yahaha's Cloud-Native UE5 Game Practice Powered by OpenKruiseGame

The article introduces how Yahaha migrated its UE5 game STRIDEN to a cloud-native architecture powered by OpenKruiseGame.

About Yahaha

Yahaha Studios[1] is a metaverse platform that lowers the barrier to 3D content creation, enabling users to build, share, and play 3D games and interactive experiences without writing any code.

1

Preface

STRIDEN[2] is an upcoming UE5-based online FPS jointly developed with our key partner studio, 5 Fortress. We are now racing to release its Early Access version on Steam.

2

At the beginning of the project we faced a core challenge: our partner’s existing tech stack relied on a public-cloud Auto Scaling Group (ASG) solution.While ASG can scale out automatically, it falls short on graceful scale-in, in-place rapid updates, and multi-region releases— all common operational scenarios for online games.For stateful game servers, refined lifecycle control, zero-downtime updates, and standardized operations are essential; the ASG setup was far from our cloud-native vision.

Compounding the issue, we had only two months to completely refactor the deployment into a cloud-native architecture to meet the Steam launch schedule with high quality.

Thus, STRIDEN became not only an anticipated title but also a litmus test for our engineering team.It forced us to confront the limitations of traditional deployment models and rapidly turn our cloud-native game-hosting vision into a stable, efficient production solution.

Below we share in detail the thinking, challenges, and choices behind our cloud-native journey.

Why OpenKruiseGame?

Before adopting OpenKruiseGame[3] (hereafter “OKG”) we suffered two major operational pain points:

  1. Both version updates and routine maintenance required lossless server rotation via complex, high-maintenance custom scripts.
  2. Our backend microservices were already on Kubernetes, whereas game servers lived in a separate stack, creating a “dual-track” operation with high overhead and coordination cost.

Consequently, for STRIDEN we had a clear goal: find a unified platform that gracefully manages both stateless back-ends and stateful game servers.The answer ultimately pointed to the Kubernetes cloud-native ecosystem.

Kubernetes’ declarative APIs and controller pattern let us treat game servers as first-class workloads, bridging the two stacks and enabling true Infrastructure-as-Code.

With Kubernetes chosen, we looked for a mature, game-oriented solution in the community. We finally settled on OKG based on several key considerations:

  1. Feature fit and flexibility: Designed for stateful game workloads, OKG’s advanced networking, QoS descriptors, and scaling strategies offer a more elegant and reliable path to lossless scale-in/out.
  2. Community responsiveness: During evaluation the OKG community proved highly responsive and professional, a critical factor for adopting core technology in production.
  3. Localization and integration: Originating from Alibaba Cloud and battle-tested by well-known game studios, OKG offers documentation, community interaction, and integrations naturally aligned with our needs.

3

Challenges in Cloud-Native Practice

After theoretical selection, the real test was migrating the complex FPS game STRIDEN from the ASG stack to a cloud-native architecture.Along the way we confronted several concrete technical challenges that further validated our choice of OKG.

Legacy Architecture

4

1. Cold-Start Pain: Minutes-long Server Boot

In the ASG setup, bringing up a new game server instance was long and cumbersome:

Create EC2 → download game package → extract → fetch Steam dependencies → launch UE game server.

This average cold start took roughly three minutes per server.During steady traffic it was tolerable, but burst arrivals made minute-level spin-up unacceptable, leaving players queued and undermining experience.One core cloud-native advantage is second-level container startup, so efficiently containerizing our hefty game server stack became challenge number one.

2. Public IPs: Costly and Complex Connectivity

For an FPS like STRIDEN, low latency demands that clients connect directly to each server’s public IP and port.Under the ASG model this led to a dilemma.

Cost: Assigning a dedicated Elastic IP to every server wastes IP resources and inflates the bill.

Operational overhead: Managing hundreds of lifecycle-coupled public IPs and their security rules is labor-intensive and error-prone.This extra layer hampered automation during bulk server churn and hurt operational efficiency.

3. Hand-Crafted Ops: Inefficient and Risky

The legacy platform had two fatal deficiencies:

No auto scaling-in/out: operators had to manually add or remove servers one by one.This was inefficient and its risk multiplied across multiple global regions.

No global synchronous releases: version updates were manual per region.Such manual regional rollout was time-consuming and prone to version skew—unacceptable for a competitive title.

4. Blind Scale-In: Cost vs. Experience

Because game servers are stateful, scaling in without hurting players is a major challenge.The legacy platform could not detect whether a server still had active players.Hence scale-in decisions were made blindly:

● Conservative strategy keeps many idle servers, wasting cost.

● Aggressive strategy may terminate live sessions, harming user experience.

This inability to balance cost and experience drove us to search for smarter scheduling.We needed a controller that marks reclaimable servers and performs lossless scale-in.

These legacy pains are exactly what cloud-native frameworks like OKG aim to solve.It offers standard tools and declarative APIs, turning manual, error-prone ops into automated, reliable controller logic.

Cloud-Native Architecture

5

Confronting the challenges above, we were not helpless.On the contrary, they are the exact scenarios OKG is built for. Its toolset, tailored for stateful games, helped us conquer each hurdle.

1. Cracking Cold Start: Second-Level Deployment via Container Images

To cut the three-minute start time, we adopted a “process-to-image” strategy.We baked all steps that used to run post-VM boot into a pre-built, standardized container image.

Pre-packaging: the full server binary, all dependencies (including Steam SDK), and optimized start scripts are packaged into the image.

Streamlined scaling: scaling now just pulls the pre-built image from registry and starts a container within seconds — no EC2 creation, download, or extraction.

This reduced server boot time from minutes to seconds, enabling real elastic scaling.

2. Network Standardization: Adopting OKG Network Models

OKG offers automated ingress networking[4], removing the need to manually create or tear down networks per game shard, and supports multiple models for different scenarios. Given our multi-cloud deployment, Yahaha chose the Kubernetes-HostPort model.

Because the Steam SDK requires the in-container port to match the public port, we leveraged the SameAsHost extension of the HostPort model.

SameAsHost is non-intrusive—just add certain parameters to the GameServerSet YAML.As shown in the pseudo-YAML below, modify networkConf; DownwardAPI then passes the network info into the container, allowing the game server to pick up the port.

apiVersion: game.kruise.io/v1alpha1
kind: GameServerSet
// ..
spec:
// ..
  network:
    networkType: Kubernetes-HostPort
    networkConf:
      - name: ContainerPorts
        value: "striden-server:SameAsHost/UDP,SameAsHost/UDP"
  gameServerTemplate:
    spec:
    // ..
      containers:
          volumeMounts:
            - name: podinfo
              mountPath: /etc/podinfo
      volumes:
        - name: podinfo
          downwardAPI:
            items:
              - path: "network"
                fieldRef:
                  fieldPath: metadata.annotations['game.kruise.io/network-status']

3. Farewell to Hand-Crafted Ops: Declarative APIs for Automated Operations

OKG introduces two key CRDs—GameServer and GameServerSet—central to our automation.

GameServerSet enables auto scaling: akin to Deployment/ReplicaSet but built for game servers.

Ops engineers no longer handle individual machines; instead a YAML declares the desired state, e.g., “keep 100 v1.2 STRIDEN servers running globally.”OKG controllers continuously reconcile actual state to desired state by creating or deleting GameServers.Combined with HPA we can autoscale on CPU, memory, or custom metrics such as player count.

Declarative API enables global release: to ship a new version we simply bump the image tag in the GameServerSet and use GitOps-based CI/CD to apply it to all clusters worldwide.Kubernetes and OKG controllers then perform standardized rolling (or advanced) updates automatically and safely, eliminating manual release risk.The declarative workflow also prevents vendor lock-in, allowing us to run on multiple clouds.

4. Smart Scale-In: Precise, Lossless, Cost-Effective

This is arguably OKG’s greatest benefit to us. Its elegant design solves blind scale-in.

State awareness: we patched the server to periodically write player count and match status to a local file (custom QoS[6]); probe.sh then updates the GameServer’s opsState.Thus the platform knows each server’s live status in real time.

Custom scale-in policy: OKG lets us define priority[7]; for STRIDEN we delete servers with playerCount == 0 first.

Custom scale-out policy: beyond CPU/Memory, OKG supports a minimum number of opsState == None servers[8].If available servers fall below the threshold, OKG automatically spawns new ones.When they exceed maxAvailable, OKG scales in, preventing waste.

Graceful deletion: when a server is selected, its opsState is first set to WaitToBeDeleted. Using the latest Lifecycle Hook[9] and a delete-block label, our match-maker stops assigning new games and waits for existing players to exit before deletion.

This “mark → isolate → drain → close → recycle” loop offers precise, player-safe scale-in while maximizing cost savings.

Results Comparison

Metric Traditional ASG Solution Cloud-Native OKG Solution Improvement
Server Startup Time ~3 minutes <10 seconds Second-level startup
Scale-out Response Minutes Seconds Real-time elasticity
Release Model Manual, region-by-region Declarative, global sync Automated, high efficiency
Scale-in Strategy Empirical “guesswork” Player-state-aware reclamation Lossless scale-in, cost optimization
Ops Manpower Cost High (scripts & manual) Low (Infrastructure as Code) Significantly reduced

In short, OKG is more than a tool; it is a methodology and arsenal that elevated STRIDEN’s operations from traditional reactive mode to a modern, automated cloud-native paradigm.

Summary & Outlook: Only the Beginning

Our two-month cloud-native transformation for STRIDEN demonstrates a clear, viable path to solving modern online games’ operational complexity via cloud-native tech.

What did we gain?

  1. A standardized methodology: we seamlessly integrated stateful game servers into Kubernetes’ declarative ecosystem.From containerization and automated deployment to intelligent scaling and lossless updates, we now have a full reusable solution.
  2. Ops philosophy shift: we moved from “human-driven, reactive” to “declarative, proactive” operations.The team is freed from repetitive tasks to focus on stability, cost optimization, and toolchain innovation.
  3. Win-win on cost and experience: OKG lets us strike the optimal balance between server spend and player satisfaction.Precise, lossless scale-in delivers stable gameplay while materially reducing resource costs.

Looking Ahead

STRIDEN’s success is only the start.This battle-tested architecture will become the technical foundation for all future Yahaha online titles. Next, we will keep innovating in cloud-native, exploring possibilities such as:

Deeper AIOps: combining observability data with AI for smarter failure prediction and capacity planning.

More flexible scheduling: hybrid scheduling based on player profiles, latency, and other factors to further optimize global matchmaking.

Ultimately, technology serves products and users.We believe that by embracing cloud-native, Yahaha will build a more agile, stable, and scalable platform, empowering creators and delighting players worldwide with richer interactive experiences.

Experience the Power of Cloud-Native

The stability and smoothness of our cloud-native stack are embodied in STRIDEN.Theory is proven in practice.We warmly invite you to visit Steam on July 12 and experience the second-level matchmaking and stable service powered by our cloud-native architecture.Every smooth match you play is both a stress test and the best validation of our solution.

STRIDEN Steam page: https://store.steampowered.com/app/2052970/STRIDEN/

OKG Community Discussion Group: DingTalk Group ID 44862615

References

[1] https://yahaha.com/
[2] https://store.steampowered.com/app/2052970/STRIDEN/
[3] https://openkruise.io/kruisegame/introduction
[4] https://openkruise.io/zh/kruisegame/user-manuals/network
[5] https://mp.weixin.qq.com/s/TOPcOsE5WCIIXkgo9jujlA
[6] https://openkruise.io/kruisegame/user-manuals/service-qualities
[7] https://openkruise.io/kruisegame/user-manuals/gameservers-scale/#sequence-of-scale-down
[8] https://openkruise.io/kruisegame/user-manuals/gameservers-scale/#set-the-maximum-number-of-game-servers-whose-opsstate-is-none
[9] https://openkruise.io/kruisegame/user-manuals/lifecycle

0 1 0
Share on

Alibaba Container Service

222 posts | 33 followers

You may also like

Comments

Alibaba Container Service

222 posts | 33 followers

Related Products

  • Gaming Solution

    When demand is unpredictable or testing is required for new features, the ability to spin capacity up or down is made easy with Alibaba Cloud gaming solutions.

    Learn More
  • Managed Service for Prometheus

    Multi-source metrics are aggregated to monitor the status of your business and services in real time.

    Learn More
  • Container Service for Kubernetes

    Alibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.

    Learn More
  • ACK One

    Provides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources

    Learn More