Changyou: Best practices of game O&M
Created#More Posted time:Apr 13, 2017 14:00 PM
Guests in this issue:
Li Zhigang, director of Changyou O&M Department;
Xiang He, senior architect in Alibaba Cloud.
The cloudization trend is inevitable, and more and more enterprises are embarking on the cloudization path. In an era where cloud computing benefits all, all walks of life are changing.
Alibaba Cloud Industry Roundtable pools 12 major categories of industries including apps, websites, games, finance, e-businesses, audio and video, health, education, energy, government affairs, transportation, and manufacturing, inviting Alibaba Cloud's long-time customers to share their cloudization paths and on-cloud technical practices.
Changyou business briefing
Changyou has been engaged in mobile games and PC games for a long time, and accumulated a dozen years of experience in game O&M, said Li Zhigang, adding that he was very pleased to receive the invitation of Alibaba Cloud to share his O&M experience with everyone. We launched the 2D Hortensia SAGA last year, winning several rounds of recommendations on Apple official websites; Tianlongbabu, our flagship product, whether the mobile edition or the client edition, is doing well, especially in the MMORPG category. You can also try it, introduced Li.
What solutions does Alibaba Cloud offer for hybrid clouds?
Xiang He said that, thanks to Changyou's trust, it has been difficult debugging full-chain compatibility, from hardware to virtualization to the operating system, and it is also a painstaking process to apply the old operating system to the formal production environment.
The hybrid cloud and Changyou are very close. Changyou has very clear requirements for the hybrid cloud. First, the quality of the managed physical IDC must be excellent, and second, on-cloud and off-cloud environments must be connected using high-bandwidth, high-stability and high-availability links. We will work with third-party partners to build infrastructure together. At the front-end, the hybrid cloud boasts a very important advantage in security, Alibaba's security system is very complete. The hybrid cloud model not only meets the core requirements of offline scenarios for the user, but also reinforces security by one layer through relying on Alibaba's security system. The hybrid cloud may be more practical than the conventional on-cloud and off-cloud connectivity solutions with more obvious advantages.
Changyou's cloud practices
What is the difference after the game migrated to the cloud? What is Changyou's cloudization process?
Li Zhigang holds that the biggest difference between the cloudization of games and that of other industries is that the game development and version as well as applications have a lot of special features, resulting in its unique requirements on the cloud operating system, drive and special configuration with other applications. The cloud requirements are more demanding, and different versions and types of the game have varied requirements for the public cloud. During our cooperation with Alibaba Cloud to deploy the PC games, the transformation on the old operating system of products was required, as the old operating system was too old to enable the application to pull data from the cloud. However, applications for new versions of operating systems failed to be established. Therefore, Alibaba Cloud made a customized operating system version for us.
At the very beginning, the use of cloud was convenient, with great progress made in reducing maintenance costs and in many other aspects. Our fault fixes and application launches were accelerated by dozens of or even hundreds of times more than before. Amid the explosive development of mobile games and considering the resources for the mobile games, we prioritized the public cloud, using all the services of Alibaba Cloud. At present, many of our business are basically running in the cloud.
What are the trends of gaming on the cloud?
Li Zhigang thinks that the use of the cloud generates a lot of convenience. While the mobile games are basically entirely on the cloud, we also hope to migrate PC games to the cloud too, so as to improve the redundancy and efficiency, and get maintenance costs under control. We have made a lot of attempts to migrate the PC games and test old PC games on the cloud. The current progress has been relatively smooth. In the future, our mobile games and webpage games will all run on the cloud, and PC games will also be migrated onto the cloud step by step.
Xiang He also said that the cloudization is a gradual process. At the beginning nobody dared to put their games on the cloud. Gradually they tried to trust the environment step by step, and now a certain degree of trust has been built. When we promote our cloud products, we will not blindly try to pursue users into migrating all of their core businesses onto the cloud in one go. This is the value of the existence of hybrid cloud. We are also constantly learning the user demands for the game scenarios, including the DB layer, operating system compatibility, and performance presentation layer. We will enrich our products. Now we have launched the multi-frequency type which adapts better to the old PC games and scenarios that are highly dependent on the CPU's computing performance. We have made a lot of effort.
The current game industry may be in touch with a number of cloud computing vendors. What about a brief introduction in this respect?
There are only two ways to make a game accessible to users: iOS and Android. Li Zhigang said that because all channels are related to the amount of linked users, there may be some differences in cloud vendor selection when better promotion services are available for a channel to ensure access by users, and this is subject to market factors.
Changyou: Preference to MySQL as the database
Which databases does Changyou use more often? Do you have any considerations for the choice of self-built MySQL and Alibaba Cloud RDS?
Changyou's preference for RDS may be attributed to the evolution of its game architecture. At the very beginning, the databases and applications were placed on one or two machines for interaction, and data consistency and file integrity may have issues. After using the cloud, we found that a full set of solutions are in place whether from the network layer, the data interaction and the data usage at large. Redis can provided functions that MemoryCache can not. The applicable scenarios of this program are far more than the self-compiled MemoryCache. We wrote MemoryCache for only our own games, while Redis needs to go through the validation of a variety of applications. The robustness and redundancy of Redis in all aspects are more powerful than MemoryCache, and this is why we are more inclined to use Redis services.
Changyou uses MySQL more often, and sometimes uses SQL Server and MongoDB. Changyou has its own set of MySQL configuration standards after so many years of engagement in the game industry. We have optimized the various parameters of this version to support the running of all the games within the company. This version is also used directly in the Alibaba Cloud instances. Our other games may also require customized databases in a specific version or supporting a specific data type. We will communicate with the developer's teams for adjustment and we will follow the opinions of the developer.
SQL Server database is optimized primarily targeting stored procedures and SQL statements. During the migration process to the cloud, a certain stored procedure or SQL statement may take very long to process, resulting in very slow access. This requires someone who is familiar with SQL Server to quickly locate the problem for troubleshooting. We can also draw reference from the MySQL troubleshooting methods, but the troubleshooting durations may vary between the two databases because of their operating differences. Games in different areas correspond to different operating systems and database versions selected by the developer. When we make the introduction, we always need to communicate with the developer.
About the choice, we mainly look at the type of game. If it is a PC game, we will give priority to the Tianlongbabu-verified MySQL. If it is a web application or a mobile game, we will directly use the services provided by Alibaba Cloud.
Xiang He added that from the statistics of projects handled by him, databases used in the domestic game industry can be divided into several types, SQL Server, MySQL, MongoDB and self-built databases. From the perspective of collaborative development, a very accurate measurement can be guaranteed about the amount of concurrent writes to the database, and the number of writes to data blocks within the scope of experience for a project. If a brand new technology is used, the measurement may be beyond his control. Alibaba Cloud RDS puts high availability as top priority from the first day of design, followed by data reliability. RDS is able to fully meet the needs of the game industry. For example, we also built high-specification instances for Redis and MemoryCache in local cache scenarios to meet the needs of heavyweight games.
What are your ideas on the database deletion incident of a game before the Spring Festival and the GitLab database deletion incident?
Li Zhigang thought for a while and then said we also had a similar incident and we made the optimization and adjustment immediately. We initiated regular checks, simulated failures for immediate recovery, and checked whether the backup policies would work fine. However, recovery and checks alone are not enough. During the process of deletion and formatting, how we can prevent the accident from happening, or how we can recover the database in case the accident happens became our concerns. To this end, we transformed our operating system at the installation initialization, encapsulating some commands so that these commands and scripts cannot be executed by engineers on the system. They can only be executed together with the matched scripts and customized functions. At the same time, we ensured that the deleted files are moved to another location when the deletion operation is performed so that the files can be retrieved.
Changyou business structure and O&M practices
What is the architecture evolution process of Changyou with the business development?
From the basic C/S architecture at the very beginning, to the load balancing-enabled architecture, we split a lot of application functions, such as login, scenario and verification. Our payment operations and resources are split to different servers, so that the service architecture will be relatively robust. After migration to the cloud, the business can be switched over within minutes or even seconds to the backup machines. If there is no backup machine, the switchover can be completed within ten minutes. With Redis, the integrity of data files will be greatly improved.
The game may witness some high traffic and high concurrency scenarios. What needs to be done for the architecture?
Li Zhigang said that in order to eliminate file data losses and minimize the impact from network jitters, a cache MemoryCache layer is added between the intermediate database and the application during game designing. This mechanism can eliminate data losses during all data interactions, solving the high concurrency issue. Load balancing is currently widely applied to mobile games. More can be done in load balancing in the case of high concurrency. Previously a single server room only has a defense capacity of 10 GB. After the architecture is migrated to the cloud, the entire defense bandwidth can reach 300 GB.
With regards to the global server sharing, it is highly demanding for the sharing of data files, and for the combination of network accesses. Now there are many cloud practices for global server sharing.
Speaking of overseas data nodes developed by Alibaba Cloud, Xiang He said that Alibaba Cloud's overseas strategic goal is to spread data centers to across the world. It has put into service multiple data centers in North America, Southeast Asia and Europe. Another very important technical feature is that Alibaba Cloud is planning to connect the data centers via high-speed channels, including the domestic cross-city leased lines, and domestic-to-overseas international leased lines. This connectivity renders the fundamental condition mature for global server sharing in terms of technical resources connectivity. The global server sharing has requirements on the game types, such as SLG, chess and cards. With the future development of overseas strategies of the game industry, such projects will become more and more. Alibaba Cloud hopes to build the fundamental conditions ready before the outbreak of such projects to facilitate construction of future projects on the cloud.
What O&M tools are used inside Changyou?
Over the past five or six years, we have always wanted to automate O&M, and ultimately we formed a complete O&M operation platform. On the one hand, the O&M of games is automated, so are releases of game versions, and online updates and maintenance. On the other hand, all the asset information is incorporated into the data bus to build a CMDB system. The system will build and label all the online asset information in the database, and all the tools, scripts and systems required for O&M have to go through this bus. If the online applications have changes, the changes are made here first. We also made a version release platform where all the scripts running on the servers are registered and automatically released by the system, so that the execution risks on all the servers are reduced. When an issue occurs, the system automatically calls the O&M personnel. We also developed an internal app through which all the warnings are issued. We also made some optimizations to be linked to WeChat.
IT cost center is changed into a profit center. Today Changyou's system has become very mature. How can we open it up for other game industry customers to use? We hope to extract our own systems into an open-source structure to be tuned by all the counterparts in the same industry. The more people involved, the better optimization result on the framework. First, our products or systems should be modularized and productized; second, we will open up our entire cultivation program of O&M talents.
Regarding the future O&M development trend, Li Zhigang holds that in order to shake off the heavy business pressure on the shoulders, first of all the business should be simplified, namely adopting automated O&M, and the intelligent O&M and unattended O&M will also be a future direction. Most of the O&M is passive, and we also consider the use of big data for situation awareness O&M.
At the same time, Changyou also made some explorations on the Container Service. When Docker was up-and-coming, we made containers quite suitable for Changyou businesses in combination with other companies' experience to give priority to solving business problems. At this time, our containers can be connected with the procedures in the core production environments of the R&D team. After the R&D completes writing the code, they can directly upload it to the directory, and the system will automatically synchronize the code to the pre-release environment for functionality checks in the pre-release environment. Our Docker containers draw heavily on the Docker community and open-source resources. It is still difficult to popularize customized products, so we want to work with Alibaba to use Docker containers. Thanks to this, when our containers cannot satisfy the needs, the Docker container can be used instead quickly. This is also related to the concept of hybrid clouds.
Xiang He continued that Alibaba Container Service solves the compatibility problem first. It will support localized deployment to connect the offline and online instances, so that applications and code that run in the offline vertical environments and pre-release environments can be launched on the cloud platform with one click directly to achieve overall continuous delivery capability. Second, Alibaba and Docker have carried out in-depth strategic cooperation. In the future we will continue to develop in the Docker container field, and it is likely that we will launch some container services exclusive for the game industry.