×
Community Blog Powered by Alibaba Cloud, POPUCOM Delivers a Zero-latency Adventure for Players Worldwide

Powered by Alibaba Cloud, POPUCOM Delivers a Zero-latency Adventure for Players Worldwide

This article showcases how POPUCOM leverages Alibaba Cloud’s cloud-native architecture and observability tools to deliver a stable, low-latency global multiplayer gaming experience.

About POPUCOM

POPUCOM is a multiplayer cooperative party-adventure game developed by HYPERGRYPH. The game combines a variety of gameplay elements, including color-based shooting, match-three mechanics, physics-driven interactions, and puzzle-oriented level design, requiring players to collaborate closely to solve map mechanisms and defeat powerful bosses across different stages.

In addition to the main story levels, POPUCOM features a specially designed Arcade Room within its central hub area. This space includes a variety of interactive mini-games, providing players with differentiated social and interactive experiences beyond cooperative puzzle-solving.

Players can choose to use either a keyboard and mouse or a controller. The game is easy to pick up overall, with a brisk pace and light, enjoyable level design, offering strong playability. POPUCOM supports both local two-player split-screen gameplay and online party modes for three to four players, allowing friends or family members to team up and fight side by side as they take on a continuous stream of challenges.

1

To ensure that players around the world consistently enjoy a stable, low-latency, and uninterrupted online experience, POPUCOM has developed a technical architecture centered on cloud-native design, driven by automation, and underpinned by observability. This architecture provides full-stack control from infrastructure to the application layer, enabling global deployment, continuous iteration, and long-term live operations.

Cloud-native Architecture of POPUCOM

As illustrated in the architecture diagram, POPUCOM runs in a fully containerized environment on Alibaba Cloud Container Service for Kubernetes (ACK). It further integrates OpenKruiseGame (OKG) to enable the fine-grained governance of game-specific workloads. Overall, the architecture has four key characteristics:

Distributed: Service modules are deployed independently, reducing coupling and enhancing system resilience.

High-availability: Containerized and multi-node deployments across multiple zones, together with automatic failover, ensure continuous service availability.

Scalable: Individual service modules can be easily scaled horizontally in response to changing business requirements.

Operable and maintainable: A well-established observability system enables O&M engineers to track system status in real time and respond rapidly to issues.

Based on this architecture, POPUCOM supports high concurrency across multiple regions while maintaining stable and reliable service operations. For game server management, POPUCOM uses ACK and OKG to deliver a smooth online multiplayer experience. In terms of observability, POPUCOM builds its game O&M system by integrating Simple Log Service (SLS), CloudMonitor, and Managed Service for OpenTelemetry, a sub-service of Application Real-Time Monitoring Service (ARMS), providing comprehensive visibility for day-to-day operations.

OKG Enables a New Online Gaming Experience: Low Latency, Elastic Scaling, and Zero-downtime Upgrades

2

Global Multi-region Direct-connect Architecture for Ultra-low Network Latency

To deliver a true "global server" experience, POPUCOM has deployed four regional data centers in China and seven outside China. Each regional data center runs room server clusters hosted and orchestrated by ACK. Based on ACK's cross-region cluster capabilities, unified resource scheduling, and automated O&M, these room server clusters can be rapidly deployed and efficiently managed at a global scale. At the room server layer, OKG's capability to automatically generate public network access addresses is combined with a dual strategy of geographic prioritization and network quality probing to form a matching system that dynamically selects the service node with the lowest latency in real time. ACK's networking and service orchestration capabilities enable end-to-end direct connections without relying on traditional proxy gateways, avoiding additional network hops and unnecessary jitter. This significantly reduces network round-trip time (RTT) and meets the strict real-time requirements for action synchronization and accurate skill resolution.

Automated Elastic Scaling Based on KEDA and OKG Triggers

To address sharp fluctuations in player concurrency, POPUCOM integrates ACK's highly elastic container scheduling with the Kubernetes Event-driven Autoscaling (KEDA) framework and OKG custom triggers, forming an event-driven auto scaling mechanism for online game rooms. In each region, the system maintains a minimum threshold of available rooms. During peak hours, ACK's multi-node auto scaling capabilities enable the launch of preconfigured standby servers within seconds, ensuring that players can enter rooms immediately without waiting in queues. During off-peak hours, ACK's container orchestration works in conjunction with OKG to intelligently evaluate the status of game rooms, including idle duration and player exit history, automatically reclaim idle resources, and release computing capacity to prevent prolonged resource occupation. OKG's custom quality-of-service capabilities enable precise control over the room lifecycle, achieving an effective balance between resource utilization efficiency and player experience.

Zero-downtime Version Updates: Progressive Release Based on Multi-version Routing

Depending on ACK, Alibaba Cloud Container Registry's accelerated image distribution, and OKG's multi-state management, POPUCOM achieves complete decoupling between version updates and room server operations. The RoomManager serves as the central versioning hub, maintaining room servers across different versions and directing players to the appropriate version by using routing policies. During updates, ACK's cross-zone deployment and optimized image distribution ensure that room servers of new versions can be quickly synchronized and brought online across multi-region clusters. At the same time, OKG manages a progressive replacement process, allowing room servers of earlier versions to naturally retire after they complete ongoing matches. The entire process occurs without interrupting gameplay. By combining version isolation, progressive replacement, and routing control, POPUCOM eliminates the need for traditional "downtime maintenance," greatly enhancing both player satisfaction and operational flexibility.

Game O&M Mid-end Based on SLS and CloudMonitor

In complex distributed environments, monitoring alone is insufficient to handle unexpected issues. POPUCOM implements a three-pronged observability and O&M system, which is centered on logs, metrics, and traces and augmented with unified semantic modeling and intelligent analysis. This system provides a comprehensive view of the overall operational status, enables precise root-cause attribution, and supports proactive alerts.

3

SLS: Global Log Collection and Behavioral Insights Hub

As the first line of observability, SLS serves as the central hub for aggregating and parsing logs, and correlating logs with business operations for analysis. Based on a unified multi-region collection architecture, SLS efficiently manages logs across countries. With SLS LoongCollector deployed across 11 global regions, SLS captures key game server logs in real time, including error stacks, state changes, and abnormal disconnects. Innovatively, SLS adopts a "local storage + global query" model: Logs are written to local storage within each region, whereas SLS StoreView enables seamless cross-region and cross-project queries, significantly improving multinational troubleshooting efficiency. Combined with a dynamic threshold alert engine, SLS provides real-time alerts for anomalies such as high-frequency crashes and logon failures. Integrated with HYPERGRYPH's SRE platform, SLS supports automated authentication, policy distribution, and collection management, forming a highly stable, self-managed PaaS-style log hub that enhances both global player experience and O&M efficiency.

Beyond ensuring service stability, SLS is deeply integrated with game-specific scenarios, synchronously recording behavioral data such as player actions, level progress, and item usage. Such data drives level difficulty adjustments, item distribution strategies, and tutorial iterations. When players report issues such as "lost items" or "progress rollback," the system can quickly trace the full behavior chain by using a unique session ID, correlating room server status with database transaction logs, thereby assisting customer service in precise responsibility assignment and data restoration.

CloudMonitor: Panoramic Visibility into Cloud Resource Health

In the context of the high complexity of global game operations, POPUCOM relies on CloudMonitor to achieve the out-of-the-box monitoring of core cloud resources used by the game. This enables a true leap from simply "making resource status visible" to "being able to assess system health, anticipate risks, and manage anomalies proactively."

At the infrastructure level, CloudMonitor provides one-click visibility into globally deployed critical components such as ACK clusters, cloud-native PolarDB databases, Network Load Balancer (NLB) instances, Elastic Compute Service (ECS) instances, and Redis caches. It collects key performance metrics in real time, including CPU, memory, network I/O, disk latency, and connections, and aggregates them by using a unified data pipeline. Using customized Grafana dashboards, the O&M team can monitor the operational status of all regions on a single GUI. Whether it is container scheduling pressure on an overseas node or slow query trends in a regional database, all critical operational patterns are immediately visible, significantly improving situational awareness in a multi-center, multinational deployment.

Building on this foundation, CloudMonitor deeply integrates technical metrics with business metrics. Core business metrics, such as concurrent players, logon success rates, room creation rates, and matching latency, are aligned and correlated with underlying resource usage (such as pod load, database QPS, and network bandwidth), forming a three-pronged "Resource-Service-Experience" health assessment model. Potential capacity bottlenecks are automatically identified and alerted, enabling the O&M team to scale or optimize scheduling proactively. This approach prevents service degradation due to resource saturation, achieving a shift from "reactive firefighting" to "proactive risk management."

ARMS: End-to-End Trace Diagnostics and Analysis

As POPUCOM launched globally, the technical team simultaneously upgraded the application performance monitoring architecture of their public service platform. Trace analysis was fully migrated to Managed Service for OpenTelemetry by using the standard OpenTelemetry stack without modifying any business code. The migration simply updates the backend endpoint configuration of the OpenTelemetry Collector to smoothly switch from the previous self-managed Jaeger storage solution. This reduces trace storage and O&M costs by nearly 90% and eliminates the maintenance burden of self-managed clusters.

Managed Service for OpenTelemetry offers capabilities such as service-to-service call chain reconstruction, P99 latency analysis, and service topology visualization. It clearly reveals service interactions, traffic flows, and performance bottlenecks at each stage. By correlating these traces with metrics and logs, the system enables the precise root cause analysis of anomalies in core business traces such as logon, payment, and updates. This supports efficient O&M across a global, multi-region architecture. Managed Service for OpenTelemetry effortlessly handles high-traffic scenarios, such as version releases and holiday events, delivering a "usable, resilient, and transparent" observability foundation that is critical for maintaining game stability.

Based on the coordination of SLS, ARMS, and CloudMonitor, POPUCOM achieves full-stack visibility from infrastructure to application logic to user behavior. This system supports global game operations and establishes a reusable technical paradigm for real-time, interactive entertainment scenarios. Its elastic cloud-native architecture, intelligent diagnostics with full-stack observability, and continuous delivery by using hot updates make the "zero-loss player experience" possible.

Looking ahead, POPUCOM will continue to enhance its observability capabilities by exploring AI-driven anomaly prediction, root cause recommendations, and automated remediation. This will advance the game toward a higher level of operational intelligence, enabling its system to not only see, but also predict and act.

Welcome Players to Experience

POPUCOM is a cooperative adventure game that combines dual-color shooting with match-and-clear mechanics, offering a subtly innovative gameplay experience. Its youthful art style and user-friendly design make it accessible to players of all levels, including those who are new to gaming. The carefully crafted, interactive levels are full of fun, engaging challenges that deliver a rich gameplay experience. POPUCOM is ideal for gathering friends or family members after work for a collaborative puzzle-solving adventure, providing a joyful and relaxing social gaming experience.

4

0 1 0
Share on

You may also like

Comments

Related Products