This document presents a reference architecture for gaming using Alibaba Cloud services.
This solution presents an overview of common components and design patterns used to host a game infrastructure on the Alibaba Cloud platform.
The gaming industry has evolved into a successful entertainment business over the past few decades. Also, with the proliferation of broadband Internet, a major pattern in the gaming industry has become online gaming.
Online gaming comes in several forms, including session-based multiplayer matches, massively multiplayer virtual worlds, and intertwined single-player experiences.
In the past, after configuring a data center for infrastructure operation in the form of a client-server model, it was necessary to purchase and manage a dedicated on-premises server, so only large studios and publishers could handle it. Extensive forecasting using pre-configured hardware was also required to handle all customer requests.
In today’s cloud-based computing resource environment, game planners and developers of all sizes can request and receive all necessary resources, reducing excessive initial expenses and helping reduce the issue of an infrastructure not accepting when requests explode. This is a business change from CAPEX-based operation to OPEX-based operation, so risks can be reduced significantly.
In general, the process until a game is operated is listed below:
In the past, we maintained a waterfall-type process that proceeds from planning-development to operation in a series of time flows. An agile-type process for judging the quality of a game in a business way is being operated.
Recently, users maintain online games using various types of clients (PC, mobile, and console) while presenting very strict requirements and challenges for the infrastructure environment.
Users say this when they experience delays while enjoying the game. Phrases like “I’ve got lag” or “I’m lagging” are common. If we interpret this lag phenomenon in our IT terms, we would call it latency.
The latency of the game cannot be interpreted as a network-only problem. For example, even offline single-player games that do not use a relatively heavy network environment may experience lag.
Some potential causes are:
These problems can be solved by performing cloud server hardware maintenance, management, and upgrades.
Cloud services provide customers with various types of infrastructure services and more edge servers to give customers shorter and more efficient network paths. In addition, the cloud provides a hardware lineup of various vendors to provide client hardware optimized for service or hardware that customers want.
In an online gaming environment, network quality is the most important factor. A user’s brain can only tolerate a network delay small enough that it is not perceived by the user.
Research papers suggest that the maximum network latency for many games should not exceed 200ms.
It is assumed that shooting games and fighting games should have a latency of less than 100 ms because they require fast reflexes and quick action.
Latency should not exceed 500ms for role-playing games (RPGs) like World of Warcraft. These games do not require immediate trigger reaction times but require network quality enhancements to reliably perform tasks, such as casting skills or spells and reacting to events within the universe.
Asynchronous single-player gameplay, which includes most forms of mobile gaming, allows for latencies of up to 1000ms. Users are generally more tolerant than other game types in terms of network quality because they are only interested in the local environment changing and maintaining status.
However, this is general, and some people may be more sensitive to increased latency.
The user always judges the service quality in order to experience the good service of the game. This service quality cannot be improved by simply increasing network or hardware performance.
Here are some key factors that cloud gaming service providers should consider:
Modern games offer very good graphical interfaces. Even in a 4K environment, it requires a graphics infrastructure environment that can play smoothly without breaking pixels It is necessary to provide a powerful instance node with a built-in graphics card with high performance and a robust network backbone to ensure the consistency of the video streaming service.
As I mentioned before, the most important quality is the response time of the game server. If it is a global game service that deals with users distributed in multiple regions, it is necessary to reduce latency and add redundancy by distributing game servers worldwide. In addition, logic is required to prevent the degradation of service quality from malicious attacks, verify the behavior of the game client on the server, and protect it from abuse or fraud.
Game users expect game downtime to approach zero. This means the individual server instances where user sessions are stored must be protected from external influences.
For this situation, it is recommended that the game maintain the player state in the one game-server process. This is called a Sticky Session. It has many advantages, such as failover in the event of a server process crash and effective load balancing.
As mentioned in the previous chapter, it is difficult to satisfy the requirements of the current gaming industry without the cloud.
Cloud service providers can access many resources using vast infrastructure and do not place restrictions on users.
Also, this resource is elastic. You can only use a portion of the resource, or you can overflow it as needed. Instances configured on the cloud platform scale up or down automatically to meet your needs. You don’t have to upgrade your hardware every time you need more processing power or storage.
Major cloud service providers usually have many servers worldwide, so they are not tied to a single region.
In particular, the cloud provides game service players with the following advantages:
Provides accelerated networks across multiple regions and accelerated content distribution for content, such as textures, UI, audio, sound, and special effects
High clock speed CPU instances or GPU instances to allow multiple concurrent users while ensuring minimal latency
Ensure data security with multi-layered security (network, application, session, hardware) and secure connections
Eliminates single point of failure (SPOF) by providing high-performance databases, automatic failover, and disaster recovery
Alibaba Cloud, which has the advantages above, provides a standard architecture optimized for the gaming industry using the following services:
The next chapter introduces a reference architecture that allows easy planning, development, and operation of gaming services using Alibaba Cloud’s services.
A common name for a frontend is a platform service or online service. The platform service provides an interface for essential gaming functions, either allowing players to join the same game server instance or containing a friend list social graph in the game. In general, this service is often used as a client when accessing online games. This service includes the following elements:
In general, it is common for users to communicate with the backend through the frontend service.
However, it has the potential to be exposed to various attacks because the frontend service can communicate with the Internet. It must be strengthened using security services against DDoS and various network attack patterns to solve these security issues.
1) I divided the VPC into development / test / operation. For a stable development-deployment process, the stability can be strengthened by dividing the process according to each development stage.
2) You can load balance service ingress traffic using SLB. This SLB service can perform ALB for application load balancing and CLB for network load balancing.
3) Each gaming service is deployed as an ECS. Alibaba Cloud ECS is characterized by being very light and fast to deploy. We can achieve stability and scalability using instances to maintain the sessions of users accessing gaming services.
Backend services usually have no externally exposed network routes and IP addresses and only provide interfaces to frontend and other backend services. For that reason, external clients cannot communicate directly with the backend service.
On the backend, you typically place services that connect directly with data, such as game state data in a database or data warehouse and analytics events.
As games grow in popularity, they can leverage stateless protocols to make them easy to scale. Creating HTTP/JSON APIs for most gaming functions (running the game, exiting, etc.) allows you to add instances dynamically and get away with temporary network issues.
1) In the architecture above, the instance type is not indicated. However, in most game industries, GPU instances are mainly used for graphics rendering. Alibaba Cloud offers GPU instances that perform well and are relatively inexpensive compared to other platforms. In addition, any game company that uses ML to implement functions, such as NPCs and virtual user AI, can use AIACC, which can accelerate ML learning by 20% or more with software logic.
2) Depending on the type of service, the service located in the backend may also communicate with the outside. To prepare for such a case, secure Internet communication using a NAT gateway can be implemented by placing a public subnet in the backend VPC.
3) You can use Function Compute (FC), an Alibaba Cloud serverless service, to implement the stateless protocol mentioned above. You can create a more flexible and scalable environment by implementing one-time workloads (such as game access, release, execution, and termination) as logic connected to DB and OSS using FC.
4) Among many games implemented recently, most of the modules that emphasize scalability are implemented in a container environment. On Alibaba Cloud, the ACK service provides a Kubernetes environment that can manage containers efficiently. We can efficiently configure the basis of ACK using various types of instances, such as CPU, GPU, and High Memory.
5) We need to create an OSS bucket in our backend environment. This OSS bucket stores binary gaming content, such as patches, levels, and inventory. OSS uses APIs to upload and download data. In addition, OSS can be used as file storage or shared storage, providing cost-effective handling of the stateful storage required by a variety of workloads.
When operating in the existing on-premises environment, it was easy to maintain the network separation architecture, maintain the network topology, and use multi-layered security.
However, cloud-native networks and security must be considered by default in the current cloud environment. In particular, the following points should be considered.
In particular, Alibaba Cloud provides services for multi-layered security. A list of services can be found in the figure below:
1) We can use three services for network security.
First, most online gaming services are subject to more DDoS attacks than other industries. Anti-DDoS that can handle a lot of traffic is an essential component to defending against large-scale traffic attacks without affecting the service.
Next, we are still exposed to the risk of attack at the application layer even if we prevent DDoS. In particular, the OWASP Top 10 attacks are changing every year and evolving into more threatening attacks. WAFs can be configured to escape these threats. WAF attacks the top level of existing applications but also has an AI function, so it can defend against continuously changing attack types.
Finally, threats to the most basic network levels still exist even with a successful defense against the most daunting DDoS and application attacks. We can configure a North-South / West-East bound network firewall using Cloud Firewall, which is a basic L4 firewall.
2) Among the workloads configured in the backend, we implemented a serverless workload using Function Compute to configure the stateless protocol. This Function Compute is configured to be called in the form of HTML/JSON API, so it is necessary to manage the open API.
We can use the API Gateway service to manage our API. API Gateway can manage version and permission for API and monitor all traffic.
3) As the graphic quality of the game improves and the content increases, the capacity for installation becomes bigger. Currently, most PC games have a capacity of more than 50 GB, and users spend a lot of time downloading and installing them.
In particular, downloadable content (DLC) that can be expressed as expansion packs has become a major source of revenue in recent years. Users continue to expect new characters, levels, and quests for months (if not years) after the game’s initial release. DLC, which can deliver this content quickly and cost-effectively, is having a major impact on the profitability of the gaming industry.
Gaming clients are usually distributed through app stores on specific platforms, but updating new versions of games to deliver new levels can be cumbersome and time-consuming.
When distributing such content to a large number of clients (game patches, extensions, or betas), it is more efficient to use DCDN than to access the story directly. DCDN is very useful because it can accelerate static binary files and dynamic content easily. We can deliver binary content to clients cost-effectively by creating an OSS bucket on the backend of this DCDN.
The database that stores the game world state and player’s game progress data is the most important element of the game infrastructure.
The database should plan the capacity to handle the expected workload under normal circumstances and the workload required when the game is a huge success. It was designed and tested based on the expected number of players, but when the database suddenly crashes under a much higher load, the game becomes unplayable, and the entire business collapses. Therefore, database design in the gaming industry should be treated as the most important.
The advent of horizontally scaling applications has changed the traditional approach of application tiers and relational databases. Many new databases have become popular, avoiding the existing ACID (Atomicity, Consistency, Isolation, Durability) concept and favoring lightweight instances and distributed storage. Such a NoSQL database would be suitable for instant games with simple character characteristics, items, and game structures, rather than complex relational data with a data structure.
In general, the biggest bottleneck for online games is database performance. A typical web-based app reads more and writes less, but the game is the opposite. Reads and writes to the database crash frequently due to the game’s constant state changes. Considering this characteristic, the database type and form must be designed.
1) Games with a large worldview and a large amount of data to deal with (such as massively multiplayer online role-playing games – MMORPGs) should use a relational database as the main stream. Relational databases can provide traditional forms of reliable data management.
Traditional relational databases focus on vertical scaling. However, it can be difficult to add schemas to a running database without downtime. We can use Alibaba Cloud PolarDB and RDS to achieve stability, ease of use, and scalability. This can be a great choice when you want to keep your existing data structures but still take advantage of cloud-native.
2) NoSQL can provide a solution for scalable operations for write-heavy workloads. However, you need to understand the NoSQL data model, access patterns, and transaction guarantees.
In particular, this NoSQL is designed with horizontal scaling in mind. Resizing a cluster is usually something that can be done with no downtime, but there is some performance loss sometimes until the additional nodes are fully consolidated.
Alibaba Cloud MongoDB is a document-oriented database. Data is stored in nested data structures similar to structures used in general programming.
MongoDB is widely used as the primary data store for games and is often used with Redis. Transient game sessions are kept in Redis, and then progress is stored in MongoDB at certain logical points (auto-saves and checkpoints.) Redis provides high-speed access to latency-sensitive game data, while MongoDB provides simplified persistence.
Analytics has evolved into an important component of today’s games. Both online services and gaming clients can store analytics events in a centralized database. Then, anyone from programmers and designers to business intelligence analysts and service representatives can query these events. As the complexity of the analytics data collected increases, it is necessary to maintain an architecture that allows quick and easy querying of these events.
Alibaba Cloud provides analysis, management, and monitoring using each data service in the following topology.
1) Data stored in PolarDB and RDS can be migrated quickly to AnalyticDB Data Warehouse as DTS.
2) Log data can be stored in OSS as a Log Service of Alibaba Cloud. The stored data can use the DLA service for ETL and data analysis. You can take advantage of organizing and maintaining this data into a data lake.
3) Data lake and DB migration data can be analyzed in a low latency and high concurrency environment with an AnalyticDB service suitable for each type.
4) AnalyticDB, where data is stored, can be managed efficiently using DMS and DataWorks. In addition, you can configure a dashboard for data or perform control operations using QuickBI and DataV.
The entire architecture configured using the Alibaba Cloud service configured above is managed and operated in the cloud environment. We need to take full advantage of this cloud in terms of operations.
1) In the past, monitoring and logging of our on-premises environment had to provide a function so the operator could see the fixed environment intuitively. However, the number of instances in the cloud environment continues to increase and decrease. There may be no externally exposed IP, and the server may be created according to an event and then disappear immediately. Dynamic Configuration became the most important function for cloud monitoring and logging in this environment.
Alibaba Cloud has a Cloud Monitor service that can perform infrastructure and application monitoring for the entire workload. It can also send alarms for operators to react immediately when a problem arises.
In addition, logs for all services can be collected, analyzed, and monitored using the SLS service.
2) If you look at the backend area, most of the core game functions consist of containers. Containers must be created from container images, and companies must manage these important images.
Alibaba Cloud provides ACR service for container image management. Operators can use ACR to manage container image versions and control access rights. In addition, the security of the base image can be strengthened using the image vulnerability scanning function.
3) Looking at the environment configured so far, it is necessary to manage a multi-layered access right to access a VPC configured differently for each Dev/Test/Prod and various services.
Alibaba Cloud RAM Service provides account access rights, role, and group management and performs access control in the connection between each API and service. An operator that operates a gaming system using this service can manage access to services and connections efficiently between services using RAM.
4) If we were to pick the most important topic in a gaming service, we need to talk about performance.
If you check all the network steps from when the player clicks on his character’s information view to check the data, you can see that database access takes the most time.
We can use Alibaba Cloud Redis to accelerate the data reading process. Redis provides basic data types, such as counters, lists, sets, and hashes, which are accessed using high-speed, text-based protocols. These unique data types make Redis an ideal choice for rankings, lists, player counts, stats, inventory, and simple data.
5) When providing online game services, there are cases where it is necessary to send text messages, such as OTP authentication for registration and mobile push for events to players worldwide. In the past, an SMS could be sent by contract with a telecommunication service provider to provide text messages in each country, but this method requires a complicated process.
Alibaba Cloud uses the SMS service to provide a function to send SMS easily using communication channels already established with telecommunication operators in various countries worldwide.
6) With Alibaba Cloud, you no longer have to waste resources on building expensive infrastructures, such as purchasing hardware and software licenses. Alibaba Cloud translates initial costs into lower operating costs and allows you to only pay for what you use.
When purchasing an instance, an operator can save money by purchasing a Reserved Instance or Spot Instance. An operator can also reduce the usage fee for each service using a Saving Plan.
Alibaba Cloud provides a stable and fast infrastructure platform for services in many countries worldwide. Alibaba Cloud is contributing a lot to the development of the gaming industry using the following advantages:
If the person reading this is in charge of developing and operating a cloud-native gaming service, I hope this article will help you compose an efficient architecture.
If you have any questions about this architecture or would like any consultations, please send an email to email@example.com
Alibaba Clouder - May 20, 2020
Alibaba Clouder - November 5, 2020
JJ Lim - December 3, 2021
Alibaba Clouder - March 30, 2020
Alibaba Clouder - November 12, 2019
Alibaba Clouder - October 26, 2020
When demand is unpredictable or testing is required for new features, the ability to spin capacity up or down is made easy with Alibaba Cloud gaming solutions.Learn More
Customized infrastructure to ensure high availability, scalability and high-performanceLearn More
Alibaba Cloud’s world-leading database technologies solve all data problems for game companies, bringing you matured and customized architectures with high scalability, reliability, and agility.Learn More
Accelerate software development and delivery by integrating DevOps with the cloudLearn More
More Posts by JJ Lim