Community Blog Saving Cost By Resource Scheduling- Part 6 of About Distributed Systems:

Saving Cost By Resource Scheduling- Part 6 of About Distributed Systems:

Divide and conquer, is to solve the problem that the data is too big to store and to calculate. This is also the core problem to be solved by the basic distributed system.

Disclaimer: This is a translated work of Qinxia's 漫谈分布式系统, all rights reserved to the original author.

As mentioned in the previous two blogs on distributed storage, not only do we need to save the data but to save them well. For the latter, it's mainly to save costs.

As with the massive amount of data, comes the massive cost.

Similarly, it is not enough to calculate fast, but also to squeeze the machine's computing performance as much as possible to save costs.

Different from the distributed storage engine which focuses on the ways of storage, distributed computing frameworks try to optimize resource scheduling. This is because the space for computing logic optimization is limited, and what also needs to be considered is the application layer (of course. Of course, some general optimization can also be done to fix this and we will discuss it later). . So distributed computing frameworks contributed to a series of so-called Resource Manager, for example, the birth, typical Google the Borg and K8S, Apache under the banner YARN, Mesos, can be classified into this category.


In the past, each company had its own dedicated servers and clusters for different businesses. This is well understood, as each business hopes to be able to operate independently, without interference from other programs, and without security hazards.

However, the business load differs dynamically, with regular peaks and troughs. Irregular business grows and declines, and frequent expansion and contraction follow. In order to reduce the impact of the changing workload on the business caused by expansion and contraction, there is generally a margin of resources.

The so-called margin, to put it another way, is a waste.

Therefore, how to use these wasted resources flexibly and in a timely manner, and reduce the maintenance cost caused by expansion and contraction along the way, is the main problem to be solved by various resource managers.


The most intuitive fix is to mix. Computing machines will no longer be exclusive to a certain business but shared by everyone. When Jack's business load is not occupied, Harry can make use of it, and vice versa. In this way, the overall utilization rate will come up.

This is called muti tenancy.

How do multi-tenants share resources then? And if everyone is submitting their tasks, how should resource scheduling be done?

The simplest way is the queued first-in first-out, the so-called FIFO. After the previous tasks are finished and the resources are released, and then the resources can be allocated to the following tasks.

Obviously, the benefit of multi-tenancy is the increase in the overall resource rate, which is beneficial to the all; the disadvantage, however is that it sacrifices the individual. And that means all of them will be impacted in the end.

So how to ensure the needs of individual resources?


To ensure the needs of individuals, there is only one way, isolation.

Therefore, various resource scheduling systems proposed concepts such as pool and queue to allocate resources logically. You can set a quota of computing resources for each pool, and then only allow a certain business to use this pool. Nested and multi-layered pools also facilitate isolation within the business.

But after isolation, resources can no longer be shared, and the overall utilization rate drops.

Therefore, it cannot be hard-isolated, but only soft-isolated.
Therefore, the quota for each pool is dynamically balanced and supports borrowing and returning.

But if the borrowed share is not returned for a long time, for example, a resource is being borrowed by a program that runs for 3 days, should we wait for 3 days?

So preemption comes into play. "Excuse me, I can't wait that long, please return the part that exceeds your quota immediately."
To make preemption not sound so harsh, as well as not to cause a large area of missions to hang up, we support the preemption to be activated only over a certain percentage or (and) a certain period of time.

Preemption is a second-mover, and you can also move first.

That is, to set a hard resource limit for the pool. No matter how free others are, you can only borrow so much at most.

Put together, there are various schedulers such as Capacity Scheduler and Fair Scheduler. After learning from each other and improving, they gradually converge.

For example, Fair Scheduler also supports setting weight quotas, so that relative fairness is feasible fairness.
For example, the fairness strategy has been improved from prioritizing tasks with low memory usage to a DRF (Dominant Resource Fairness) strategy that considers both memory and CPU.
Another example is Google's Borg clearly supports the concept of priority in order to support the streaming batch mixing department.
A third example: different allocation strategies are used on weekdays and weekends, and business peaks are used to effectively utilize resources.

All of them are doing trade-offs between improving the overall utilization rate and ensuring individual quotas.


Now note that the above-mentioned solutions are all just ideas that avoid the structure and principle of the specific framework. Because those are all realizations, they will be mentioned later when necessary. You need to understand what is more important than realization is the process of designing ideas and solving problems.

Having said so much, it is all technical. So how do we set up so many mechanisms and parameters?

Take quotas as an example. With so many businesses, who have more needs and who less?

This is a problem of resource allocation. Just like the resource allocation in real life, one needs to have convincing rules, and then all the business will discuss and come to the same conclusion (which is usually hard), or let the boss decide.

If you are in an infrastructure team such as a data platform, remember that you are only an administrator of resources, and you are holding management responsibilities, not distribution rights.

What you have to do is to explain the rules, use technical means to ensure the overall utilization, and then provide various indicators to help decision-making. Don't go overboard and let yourself be on the cusp.

Below is a brief list of some indicators worthy of attention and reference (incomplete, just for reference):
The overall computing resource utilization trend of the platform
The trend of the number of running and pending tasks on the platform as a whole
The total amount of computing resources used by each business/queue, including CPU and memory, or replace it with computing as a whole like Alibaba Cloud (it is better to convert the currency to RMB, which is more impactful)
Distribution of computing resource utilization ratio by business/queue
Trends in the number of running and pending tasks in each business/queue
Amount and frequency of preempted/preempted resources of each business/queue


Here are two previous examples from our operating system to illustrate exactly how to use some indicators to guide the allocation of resources.

The following figure intercepts the distribution of the overall utilization rate of each queue in a certain time interval.


It can be seen from this figure that most of the queues have a usage rate of less than 50% most of the time, indicating that the quota setting at that time is inappropriate, and more idle resources should be allocated to the busier queues.

The following figure shows the details of the calculation resource usage on some queues. The calculation formula is actual resource usage/resource quota.


It can be seen that the first queue is in constant insufficient resources so it used more than 1 times the quota through preemption; while the last queue only used a little over half of the quota on average. It is precisely because there is a situation like the last queue, a queue that doesn't use up its resource, that makes the first queue possible to be preempted.

The preemption is a lagging process and will be constrained by the upper limit of the hard cap, resulting in low overall utilization. Therefore, the solution should be to reduce the quota for the last queue and increase the quota for the first queue.

In summary, from the first graph, we know that the allocation of resources is unreasonable, and there is a general phenomenon of not having enough and having more than enough. Furthermore, from the second graph, we know which queues don't have enough resources to use, and which one does, so we know how to adjust the quota.

The rest is to allow the various businesses to discuss or ask the boss to make a decision to quantify this adjustment and implement it.



Distributed computing frameworks try more to optimize the scheduling of resources to save costs
The way to improve the overall utilization of resources is to mix parts and use resources off-peak
However that will cause an influence among each other, so resource isolation is required
Isolation will lead to limited resource borrowing, so soft isolation is required
There is a time limit for resource borrowing, so it is necessary to support preemption
Resource borrowing requires a capacity limit, so a hard cap must be set
There is no absolute fairness. Weighted and DRF are both a compromise and an improvement
The platform department is an administrator and does not have the right to allocate resources, but should provide sufficient indicators to assist in decision-making


In summary, through recent blogs, you should have a basic understanding of distributed storage engines and distributed computing frameworks.

Understand why there is a distributed system, how the distributed system is designed, and how to better use the distributed system to save costs.

It looks beautiful, without a problem.

However, there is no free lunch in the world, and with benefits, there must be problems. Distributed systems are far from being hands-free operated while waiting for the API to be adjusted.

In the following blogs, we will take a look at a series of problems introduced by the distributed system itself and how to solve them. Stay tuned.

This is a carefully conceived series of 20-30 articles. I hope to let everyone have a basic and core grasp of the distributed system in a story-telling way. Stay Tuned for the next one!

0 0 0
Share on

Alibaba Cloud_Academy

32 posts | 24 followers

You may also like


Alibaba Cloud_Academy

32 posts | 24 followers

Related Products