Best practices of moving to the cloud - The e-commerce industry
Created#More Posted time:Mar 22, 2017 9:41 AM
I. Beginning of the story
One day in mid-December 2014, a customer contacted us. As the customer’s storage period of cabinets would expire in February 2015, he wanted to migrate his resources to the cloud. The customer came to us and told us that he needed our help. His demand was clear. He wanted a perfect cloud architecture/resources plan, which was built in view of the current data and characteristics of the platform, taking into account both cost and performance.
Customer background: A well-known Hangzhou-based online custom gift shopping platform established in 2006.
II. First-phase research
Number of members in the operation and maintenance team (4): 1 person for operation and maintenance, 1 operation and maintenance architect, 1 network engineer
Number of members in the customer research and development/test team: 30 members
Telecommunications room resources of the customer:
• 3 racks
• 40 or so hardware servers (Mainly Dell R410, 2 with 16 cores / 96 GB RAM were used for xen virtualization)
• 200Mbps (Exclusive)
Customer environment profile:
From the chart above we can see the architecture profile of traditional e-commerce:
1. As a great deal of product pictures existed in the e-commerce environment, so CDN was indispensable.
2. On the server side, Nginx+Varnish were used on the front-end as the second-level cache, whose main purpose was to reduce the CDN source access pressure.
3. There were more than 10 backend business systems, including designe/kderp/seach/res/seo/oc/img; the development languages used here were mainly Java, PHP and Python. The main OS here was CentOS. Several Windows environments also existed.
4. The image source files and other files were mainly shared through NFS, with mounted disks. The image data volume was more than 2TB.
5. On the database cache end, Redis was mainly used as the database cache to reduce the pressure on the database.
6. MySQL was the main application on the database end, with MySQL master/slave deployed on the hardware server.
III. To confirm the cooperation intention
From the customer perspective, there were 3 pain points:
1. Although the cloud does have great advantages in terms of cost, expansion, flexibility, efficiency and so on, there is a certain technical threshold from the perspective of the flexible use of both cloud products and architecture. How to design a low-cost, high-performance architecture with cloud resources is an experience-demanding technique.
2. The customer did not have a response center operating 24 hours a day, 7 days a week. As a result, if an alarm is issued, there was often challenge contacting members in the operation and maintenance team in time. The response and solution were often delayed, and the operation and maintenance demand of 24 hours a day, 7 days a week could not be guaranteed.
3. The customer had 4 operation and maintenance personnel. Cost is also the most substantial pain point.
Through negotiation, we eventually determined the intention of cooperation at the end of December (Details of the specific business aspects will not be mentioned here). In order to overcome the customer's pain points, we would deliver the cloud architecture solutions + cloud migration + monitoring on a 7*24 basis + operation and maintenance services (our operation and maintenance service would be the main one, and the customer's as a supplement) on a 7*24 basis to the customer.
IV. Challenges of the migration to the cloud
• Challenge 1: Time was short. The customer’s rack spaces would expire in February. Furthermore, February 14, Valentine's Day, is the peak business period of the year. Since the agreed time point of business cooperation was scheduled at the end of December, we needed to finish the project of cloud migration, testing and formal launch by mid-January (only 2 weeks) according to the project schedule. Besides, we set aside another 2 weeks as an observation period for the transition.
• Challenge II: There were too many business systems and technology environments. Through analysis, we found that the customer had more than ten business systems. There were various technology environments, including Nginx, Varnish, Tomcat, PHP, Python, Redis and MySQL , and consequently they greatly increased the difficulty of migration.
• Challenge III: Both configuration documentations and specifications were absent. In fact, when it comes to this challenge, I absolutely have complaints. It is hard to imagine that a system with 8 years' operation and maintenance had nothing in the field of configuration documentations and manuals but a few bits and pieces of architecture diagrams. What's more, the host names, firewall and configuration file specifications were more disorganized. During the migration we also met a strange situation. They forgot the switch password of the cabinet room, and their network engineer personally tried to crack it and get the latest password. Through all of this, you can imagine the difficulty of the migration and the challenges we met.
V. Migration to the cloud
After the New Year's Day, we formally started work on January 4, 2015. Upon arriving at the company (in Shanghai), I made some simple preparations and then packed my luggage. As the operation and maintenance leader, I drove to Hangzhou with 2 architects, 1 DBA, 1 senior operation and maintenance engineer and 2 intermediate operation and maintenance engineers. Our task was to perform the 2-week project of migration to the cloud.
5.1 Project initiation: January 5, 2015
It was the first day that we were at the customer's company to formally carry out the work. During the day, we spent most of our time determining the responsibilities of both parties' staff involved in the project, and developing the project address book. We also determined the project implementation plan. The project cycle would be 12 days.
5.2 System architecture reorganization and evaluation: from January 6, 2016 to January 7, 2016
Next is the implementation of the project migration. First we needed to assess the original system and develop the on-cloud architecture. The content of assessing the original system included the system structure, software module architecture, business architecture, interface and calling dependencies, performance evaluation and the objective of the migration to the cloud.
The content of the on-cloud architecture included system structure, software architecture, business architecture, and performance target after the migration to the cloud, as well as difficulties with the cloud migration. The on-cloud architecture diagram is as follows:
The differences between it and the IDC architecture were:
• Practice 1 of the migration to the cloud: SERVER LOAD BALANCERSs were added to secure a flexible and scalable architecture. At the frontend we added SERVER LOAD BALANCERSs to secure load balancing. In the original IDC architecture, DNS resolved to different Nginx + Varnish systems, into the front-end static cache and then forwarded to the corresponding backend business systems. Adding SERVER LOAD BALANCERS would make the architecture more flexible. All domain names were bound to SERVER LOAD BALANCERS, and then forwarded to Nginx on the backend. With Nginx we were able to make a virtual host and other flexible Layer-7 controls.
• Practice 2 of the migration to the cloud: Using TCP layer SERVER LOAD BALANCERS to protect performance. In practice, facing scenes with high concurrent performance requirements, we found a wide gap between HTTP layer load balancing compared with the TCP layer load balancing. The load balancing of HTTP layer could only reach a concurrency of 10,000, while TCP layer concurrency could reach several hundred thousand, even millions. So in the e-commerce as well as other website applications, we would preferentially select the TCP layer when it came to SERVER LOAD BALANCERS.
• Practice 3 of the migration to the cloud: Calculated on the basis of bandwidth usage - low cost and high efficiency. In the IDC room, the one year cost of the 200Mbps exclusive telecommunications bandwidth was about 1Mpsb/100yuan/month x 12 months x 200 = 240000. While on the cloud, the BGP multi-line SERVER LOAD BALANCERS bandwidth with 1Gpbs peak increased bandwidth quality by several orders of magnitude. Besides, the bandwidth costs were calculated by the amount, which greatly reduced our cost.
• Practice 4 of the migration to the cloud: RDS with low cost and high efficiency was preferentially used in the database. MySQL was used in the master-slave manual deployment and maintenance model in terms of IDC hardware, which brought a lot of maintenance and management costs in the later phase. We needed to monitor and maintain the master-slave state in case something went wrong, in order to ensure the database read and write continuity to support the business needs. After RDS was implemented, all these problems were solved automatically. That is, all services were fully automated, including monitoring the master-slave state, backup, maintenance, failover and so on.
5.3 Migration plan: from January 6, 2016 to January 7, 2016
In the process of analyzing and evaluating the system architecture, we were confirming the migration plan, that is, how to migrate both the applications and data to the cloud. At the same time we were also determining the process of system cutover and launch as well as the corresponding time frame. In the migration plan, we confirmed the customer's on-cloud resource list (23 ECS, 2 RDS, 1 SERVER LOAD BALANCERS) and the specific server configurations.
• Practice 5 of the migration to the cloud: The advantage of cloud computing is distributed. Many users tend to compare a single cloud host with a traditional physical server of the same configuration, with the result that they often make complaints about the poor performance of the cloud host. A traditional physical server equipped with multi-core high-frequency CPU with powerful performance is far better than a cloud host. But what is cloud computing? The key word here is "cloud". That is to say, the distributed computing is the biggest advantage of cloud. Therefore in practice, we don't pursue the performance of a single computer. Instead, we need to secure the high performance of business through distributed design ideas. So in this project, our standard server configurations were 4 cores, 8GB. Many servers were also equipped with 2 cores and 4GB. Through distributed computing, we fully played the resource power of a single server, securing the low final cost to the maximum extent (details of the cost are given below).
In the migration plan, there was a certain degree of difficulty with the migration of image files. On the one hand, the amount of data in the offline image data directory was more than 2TB, whereas a single online disk could only accommodate 1TB at best (The current official website states that a single disk can accommodate 32TB). On the other hand, among the 2TB files, most of the image files were small in size. How should we migrate these files to the cloud?
• Practice 6 of the migration to the cloud: The application of LVM in disk management. For the implementation of the cloud migration, we bought 4 1TB data disks (Each ECS could only accommodate 4 data disks at most). Through LVM logical volume they were virtualized into a 4TB disk. This secured more than 2TB redundant space for storing data on the cloud. Using LVM is not formally recommended. Because Alibaba Cloud snapshot mainly aims a single disk, it cannot make snapshots for several disks at the same time. On the contrary, LVM mainly targets several disks (physical volumes). On this basis it is abstracted into logical volumes. The reading and writing of LVM are for the logical volumes. Data is scattered on the underlying physical volumes (disk). If the data on a disk is damaged, and it is restored by snapshot, the integrity of the data on the LVM logical volumes cannot be guaranteed. With LVM, the disk IO performance can be improved. For example, we need to buy a data disk with a 100 GB volume. According to the conventional configuration, we only need to buy a 100 GB data disk. And yet we can also buy 4 25 GB data disks and virtualize them into a 100 GB disk with LVM. Both can functionally meet the demand. But when we talk about the disk IO performance, with LVM the IO performance can be raised by at least 20%-40%.
• Practice 7 of the migration to the cloud: The application of Rsync in the cloud. How should we migrate offline data to the cloud in real time without service downtime? Rsync is the optimal solution for the synchronous migration of incremental files. But in this project, the data transmission had to use the public network, while the amount of data was large. According to our preliminary statistics, it would take at least a week or more to complete the incremental data migration. Given the long period of the data migration in this aspect, we had to do it in advance to avoid affecting the overall migration progress.
5.4 Migration implementation: from January 6, 2016 to January 7, 2016
There were more than 20 cloud hosts related to various deployment environments, including Nginx, PHP, Tomcat, Redis, Varnish and so on. We secured maximum deployment efficiency through automated deployment methods. There were 23 online servers to be deployed in various environments. We did it within 30 minutes.
• Practice 8 of the migration to the cloud: The domain name registration must be done in advance. The last step of the migration to the cloud is to resolve the IP address of the domain name to the SERVER LOAD BALANCERS public network IP address (or to the ECS public network IP address). But the premise is that the domain name registration must be in Alibaba Cloud. If you have failed to do that, your domain name will be blocked and your business access will be denied when your domain name is finally resolved to Alibaba Cloud. This will become very troublesome. We need to register the domain name through Alibaba Cloud in advance. If we have registered it through other suppliers, we have to transfer it to Alibaba Cloud.
• Practice 9 of the migration to the cloud: Enhance cloud deployment efficiency with mirroring. At the beginning we started an ECS and did some optimizations for the ECS in terms of operation and maintenance specifications, such as system tuning, security reinforcement and so on. Then we made the ECS a basic mirror and started 22 servers with the same environment in a batch. In this way we greatly improved the efficiency of the deployment.
• Practice 10 of the migration to the cloud: The application of automated operation and maintenance tools. As for the corresponding software installation script, our internal team stored them all in the internal GitLab. With Ansible tools, we customized corresponding PlayBook, then pushed the corresponding install script to the target server. Within 5 minutes, we finished the installations corresponding to various environments, such as Java, PHP, Python and so on.
From then on, we ushered in the most painful period of the migration. Because of the lack of operation and maintenance configuration manuals and documents, we needed to carefully debug each parameter and configuration after the deployment of application code to the environments we had built up. We 3 operation and maintenance colleagues, along with our customer's operation and maintenance personnel and R&D team, completed the debugging of all the code and the corresponding configuration files, sleepless for 1 day and 1 night. So far, we had completed most of the migration work. Subsequent core work was mainly focused on functional testing, performance testing as well as on-line cutover.
5.5 Migration test: from January 9, 2016 to January 11, 2016
The main tasks in this stage were functional and performance testing. The main executive personnel came from the customer's test team.
5.6 On-line cutover: from January 13, 2016 to January 15, 2016
Before the on-line cutover, both the customer and the company's internal maintenance announcements needed to be made well. The systems, code and files had been migrated when the formal migration was started. What's more, the customer had many databases and it was impossible to migrate them in real time. Consequently we adopted a conservative approach: downtime migration. The last step in the migration was to resolve the domain name to Alibaba Cloud. As mentioned earlier, the domain name needed to be registered in advance.
Did we eventually complete the migration process? In fact, we didn't. The domain name had been resolved to the latest IP address, and the current time period for Net.cn to refresh the latest analytical records was as short as 10 minutes. But we were not able to control the client's local DNS cache, that is, there would be some users visiting the old site. So we still had the last step to take in order to complete the migration process.
• Practice 10 of the migration to the cloud: Nginx reverse proxy, to guide old users' requests to Alibaba Cloud. For the users who still visited the applications of the IDC server room, we made a 302 redirection in Nginx in front of the IDC server room to route requests from old users to the IDC to Alibaba Cloud. It was worth noting that since Nginx adopts Layer-7 load balancing, we needed to match the domain name. The server_name of Nginx here was the same one as that of the redirection link configuration. In order to ensure the address of the resolved redirection domain name was in Alibaba Cloud, we could forcefully set the domain name resolution IP address in the hosts of the Nginx server, making it the corresponding IP address of Alibaba Cloud.
5.7 Project delivery and post-monitoring operation and maintenance
The follow-up work was project delivery, the main task being to write the document summary. We collected more than 30 documents for the project, mainly including system software architecture, system structure, migration plan, operation and maintenance implementation configuration documents, operation and maintenance manuals, troubleshooting documents, resource list and so on.
After the delivery of documents, we came into the follow-up 7x24 daily monitoring and operation and maintenance phase, which we won't discuss further here.
VI. Comparison of before and after the migration to the cloud
Writing this article, I have been searching my mind for the "best" practice of migration to the cloud. For me, among hundreds of thousands of customer practice cases, my most profound experience comes from this one, and no words can fully express my feeling. All I want to say is included in the following comparison chart:
• Practice 11 of the migration to the cloud: IT is changing to DT. With the arrival of cloud computing, the age of traditional IT is changing to the age of Big Data (DT). Cloud computing has a low cost, a high efficiency, flexible expansion and many other advantages, so it is gradually phasing out the traditional IT model of IDC. As shown in the above comparison table of migration to the cloud, we can see the cost. Before the migration, there were 4 operation and maintenance personnel. After the migration, the customer has no operation and maintenance personnel. In the first year of migration to the cloud, the customer kept only 1 operation and maintenance person to deal with daily matters. In the second year, the customer has eliminated the remaining operation and maintenance person. In some ways, the cloud era makes an impact on the operation and maintenance industry. Many people are facing unemployment. Traditional small and medium-sized internet companies no longer need them to deal with trivial matters, as these problems are solved with the cloud platform. On the other hand, it will bring us new opportunities and challenges - it requires technical staff to be more comprehensive. This is the root cause why a lot of people say that DevOps is the way to the future.