Best Practices for Efficient Use and Management of Large-Scale Cloud Servers

01 How to quickly go to the cloud

We divide cloud deployment into four stages: overall evaluation before cloud deployment, process of cloud migration, verification of cloud migration, and online business switching. The server migration center product we are bringing to you today is to help you optimize the migration process and validation, making this part faster and more efficient.

There are three existing ways to migrate:

◾ The first option is to redeploy the migration. It is to operate the original offline environment step by step on the cloud again, which is not a recommended method in terms of ease of use, speed, and restore.

◾ The second method is to export the image. It is to export an image in your local environment according to the Alibaba Cloud image specifications, and then upload it to Alibaba Cloud for use. The system restoration can be guaranteed, but the ease and speed are not the optimal methods yet.

◾ The third option is to use Alibaba Cloud's server migration center. You only need to download a client to run locally, and then create a migration task. The server migration center product will automatically perform the entire migration work for you.

What are the advantages of Alibaba Cloud Server Migration Center?

◾ Firstly, it is a highly mature product that supports various mirrors in the industry.

◾ Secondly, it is highly automated. A line of commands, the entire process unattended. We provide APIs and consoles for you to observe the entire process and results.

◾ Thirdly, it is highly intelligent. From the beginning of migration to any issues during execution, relevant repair work will be automatically carried out, making the entire process more efficient and smooth.

Users can also migrate to multiple forms based on their own scenarios. We also support incremental and full migration to achieve complete consistency between online and offline environments; Users can also choose multiple replication modes based on their own situation.

The Server Migration Center is a highly automated product that supports batch multi instance migration. Regardless of the scale of resource migration, it can efficiently support it. If you encounter migration issues during the subsequent use of Alibaba Cloud, we strongly recommend using this product.

02 How to Build Large Scale Resource Scenarios with Low Cost Locality

How to low-cost build large-scale servers? There are two core keywords here: low-cost, large-scale. Let's see how to use Alibaba Cloud ECS with the least amount of money?

If ECS is used on a large scale, the first question is how to be efficient? For example, there is a business peak today that requires 1000 machines. Can we quickly deliver these 1000 machines in the shortest possible time? Secondly, can we use these 1000 machines at a lower cost? Thirdly, can this machine be automated to reduce human involvement and lower the cost of management and maintenance processes?

First of all, let's talk about the efficiency part. We recommend that you use the ECS startup template function. I don't know which of the guests here has used the ECS startup template function, which is a persistence tool for ECS configuration data. Any ECS instance created on Alibaba Cloud can use it to save all the configurations of the ECS instance. At any time in the future, instances can be quickly created through this configuration without the need for reconfiguration. And every change can be managed through version management. Even if you haven't used it before, it's easy to use it. You can quickly generate a startup template from any existing instance, and the corresponding configuration is the configuration of that instance.

With the launch template, we have other ways to use it besides quickly creating instances. For example, you currently need to create a highly elastic web application, such as the scenario of providing web services online, where there are peak hours every day. Use more resources during peak periods, and use fewer resources during low peak periods. In this way, you can quickly create an elastic scaling group using existing startup templates.

For example, it has a scheduled mode, where during the peak business period at 8am, it will be scheduled to expand capacity at 8am. The low peak period of business is at 6pm, and there will be a decrease in machines at 6pm; Secondly, it can be in dynamic mode, increasing the machine when the CPU exceeds 50% and reducing the machine when the CPU is below 40%; Thirdly, in manual mode, users can trigger scaling activities through their own locally built system.

In addition, if you want to have more comprehensive control over the entire process, we also provide the ability to hook up the lifecycle. For example, when the scaling group helps you shrink resources, if you find that there are still some log files on the instance that need to be backed up, you can reject the current shrinking behavior through the lifecycle hook. The scaling group can help to continue retaining resources; There is also notification capability, and any expansion or reduction can be notified to you through pinning, SMS, or email. Moreover, the scalability group can also help you connect instances with SLB and RDS simultaneously, helping users quickly build highly elastic web capabilities through this method.

If you don't need a solution with sustained resilience, you just need to use large-scale computing resources in bulk, such as using 1000 machines. We recommend using a flexible supply group. The flexible supply group is designed to meet the scenario of large-scale computing power delivery in batches. For example, if 10000 CPUs are currently needed, it can be set to 10000 CPUs based on the capacity mode of using elastic provisioning groups. The system will automatically determine how many instances need to be created based on 10000 CPUs. At the same time, you can choose whether to use on-demand or Spot instances to match and support your business needs based on your own cost considerations.

In addition, we also have multiple delivery types. Among them, there is a cost optimization mode, where the system creates instances with the lowest price each time, to minimize your costs; The balanced mode can help you create multiple zones and improve the high availability of the system. In order to meet more scenarios, the elastic supply group provides three delivery modes to meet different needs. There is a continuous delivery maintenance mode that helps you maintain the amount of resources you need, as well as a one-time delivery request and instant mode. The instant mode can be understood as an upgraded version of the RunInstances interface capability, which is based on the original runInstance only supporting a single instance specification and a single availability zone, Added more comprehensive capabilities.

Elastic supply groups make the delivery process smoother and have a higher success rate.

If everyone uses the above resilience capabilities to create resources, it can easily guarantee a 99.9% success rate of resilience and achieve the effect of delivering 1000 ECS units in one minute. On this basis, you can quickly build your own elastic scene, and any fast and highly demanding extreme elastic scene can be quickly built in this way.

Just now we talked about reducing costs and using these resources at a low cost. Let me briefly introduce the Spot instance, which is a post paid instance. It has two characteristics, one is its low price, which is between a discount on quantity instances and the original price. The other is easy to release. You can bid according to your own acceptable price. If the current bid is lower than the market price, this instance may be released by the system. The key feature is that it is cheap but has the possibility of being released.

If the current business scenario is based on a full volume based model or partially built on volume. You can gradually try to replace existing on-demand instances with some Spot instances. As the Spot ratio increases, the cost will infinitely approach the lowest, achieving a discount effect. At this point, you must have asked, what if I use so many Spot instances and the price changes cause instances to be released? Isn't my business affected? So on this basis, we have provided more capabilities to avoid this problem.

Firstly, all Spot instance specifications carry their own business scenarios. If the Spot instance price is too high, all business will be released. So we have launched optimization for Spot scenarios. When you use Spot instances, you can set multiple instance specifications with the lowest price for creation, such as three, as shown on the left in the figure. By dispersing multiple instance specifications, you can avoid problems caused by the release of a single instance.

At the same time, we also added a second capability, the Spot automatic compensation mechanism. If the Spot compensation mechanism is not enabled, there will be a 2-minute cliff like anomaly after all Spots are released, and all business will be damaged. If the compensation mechanism is enabled, our system will automatically determine and create some replacement instances 5 minutes in advance. Before these instances are released, they are created and automatically replaced. So there will be no more cliff like anomalies in the middle. Through these two methods, you can more easily use spot instances to host business scenarios, while achieving the effect of reducing overall resource costs.

In addition to the basic abilities mentioned above, there are also some automation capabilities. Here are a few simple examples. Firstly, we provide the scaling rule capability of elastic scaling groups, which come in various types.

◾ Normal scaling rules. Its definition is to expand the capacity of 4 ECSs when the CPU is greater than 20%. This mode is generally suitable for scenarios where business changes are not frequent, and can be likened to manual air conditioning.

◾ Step by step scaling rules. It is an enhanced mode based on ordinary scaling rules, which can set multiple intervals, and different intervals can respond in different ways. In this way, we can accumulate our own experience and determine how much capacity needs to be expanded according to different load situations, in order to bear business pressure and have higher flexibility, which can be compared to semi-automatic air conditioning.

◾ Target tracking scaling mode. A fully automatic scaling capability, using this strategy you only need to know at what level the current load is maintained. For example, if the CPU remains at 50%, the system will automatically determine how many machines are added or reduced. In this way, the entire process does not require manual intervention and is smoother.

We have added further scaling rules based on these, namely predictive scaling rules. If predictive scaling rules are enabled for any scaling group, we will use machine learning models to learn the overall resource usage and load changes over the past 1 to 14 days. Then predict the load changes in the next 2 days and generate scheduled tasks for the scaling group based on the predicted results, in hours, to prepare resources in advance. This scenario is very suitable for cyclical business scenarios. For example, if your website has a relatively fixed daily visit time and scale of hotspots, you can use this mode, and once enabled, there is no need for manual intervention at all.

If there is some sudden traffic during this process, how can we predict it? While enabling predictive mode, existing target tracking modes and various other modes can be overlaid. Ensure daily periodicity through predictability, and respond to unexpected situations through goal tracking mode. By stacking multiple modes, an effective and stable effect is ultimately achieved.

Next, let me share with you the rolling upgrade function. Rolling upgrade mainly solves publishing problems that are often encountered in daily work. We provide rolling upgrades and will automatically assist you in doing so. You only need to configure how many batches of machines will be divided today. Before the update, the machine entered a standby state and no external services were provided at this time. After the update, exit the standby state and provide external services. Then, proceed to the next batch. You can also determine whether to retry, rollback, or continue at the current time. Through the overall process, the final release effect is achieved. Through this approach, the overall release cost can be reduced, helping everyone to more conveniently complete daily application release work without the need to build their own release system.

Just now, after discussing efficiency, low cost, and automation, let's take a look at two customer examples. Firstly, Huiliang Technology focuses its online advertising business on elastic contraction products. Because its ultimate advertising revenue is advertising revenue minus the cost of resources, its resource cost is very important. At the same time, it also uses a large amount of resources, so it uses elastic scaling products. Then, by setting a combination of on-demand and Spot, and activating the Spot automatic compensation mechanism, the overall cost is controlled at a 30% to 40% discount.

The second objective example is Shenshi Technology, a company that specializes in artificial intelligence and molecular simulation algorithms. Its characteristic is that it mainly focuses on interactive tasks. Every task requires a lot of resources and strict cost control. So in this scenario, the full Spot mode was chosen. Minimize costs while also setting its Spot maximum value each time to ensure that it does not exceed the overall cost boundary and ultimately meet its overall business scenario.

03 How to efficiently manage resources

After you have more resources on Alibaba Cloud, how can you efficiently manage them next?

Because there are many scenarios for managing resources, here are only three: cost, efficiency, and security.

◾ Cost. When there are many teams involved in resource utilization and there are many resources, how do you know which resources have been spent and how much money? How to know the cost of each team's resources?

◾ Efficiency. How to quickly connect resources and efficiently carry out some daily operation and maintenance work?

◾ Safety. How to control the call permissions between controls and ensure security when there are more and more sub accounts?

Today we are bringing the best practices recommended by Alibaba Cloud. We hope that everyone can use Tags to group resources.

For example, you have purchased various types of resources from Alibaba Cloud, and these resources belong to different teams and environments. For example, one of the teams is the information department in Beijing, and the production environment of this team uses a batch of resources. From a resource perspective alone, it is not possible to clearly distinguish which are the production environments of the Beijing information department. However, if you define the region, department, and environment as labels, Label the instance and then switch to a clear label perspective, automatically grouping your resources based on the labels, even in cross product situations. You can group with one tag or multiple tags, which can be defined according to your scene. A resource can add up to 20 custom tags.

Once you label resources, many things become easier, and the ability to label allows for easy accounting, operation, and security control.

After grouping, it is easy to achieve the effect of accounting and operation and maintenance. After typing the relevant tags, you can enter the expense center console and use the tags to query the cost status of all resources under the corresponding tags. It can display the detailed information of expenses by month, day, and hour, thus achieving the effect of rapid accounting division. If you need to view the resource situation under the intersection of multiple sets of labels, you can start expense analysis by adding a new financial unit. The financial unit function supports binding multiple labels for expense filtering. One thing to pay attention to here is that the label billing is T+1. If you add a label to the resource, the billing data can only be viewed after T+1.

After labeling, enter the operation and maintenance orchestration console, where you can quickly perform maintenance related tasks on resources. We can find operations such as sending commands, executing scripts, batch restarts, and batch renewals on the operation and maintenance console.

Similarly, after labeling, you can enter the access control backend. By implementing some strategies and overlaying information related to Tags, the current operation is carried out. API calls must include a certain Tag. If not, the entire request will be rejected. By doing so, permissions between accounts can be isolated.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us