Today, many powerful Internet applications are run in large-scale data centers, but how much do you actually know about these data centers? In technical documents for data centers, you often see terms such as "large scale" or even obscure ones like "a sea of requests." Besides reading technical articles, you may find it hard to learn more about data centers.
How well does each machine run in a data center? What kind of applications run on these machines? What are the characteristics of these applications? While experienced professionals and senior experts in this field may be able to provide clear answers, a typical technology practitioner or enterprise researcher may not be able to do so.
In 2015, we put forth a plan to deploy latency-insensitive offline batch computing tasks and latency-sensitive online services in the same batch of machines in Alibaba data centers, so that idle resources not being used by online services can be used offline to improve the overall machine utilization rate. After over three years of experimental reasoning, architecture adjustment, and resource isolation and optimization, the plan has been put into large-scale production. By using the co-location technology, we improved the average cluster resource utilization rate significantly from 10% to 45%. In addition, with a variety of optimization methods, we can have more tasks running in data centers. For example, we reduced the resource consumption cost of every 10,000 transactions during Double 11 events by 17%.
So, what exactly does a computer cluster look like after receiving these optimizations? How well does co-location technology perform? In addition to articles, directly publishing data can help bridge the knowledge gap between many of us and academic researchers/industrial experts. We released this dataset to give interested students and researchers a more thorough understanding of large-scale data centers from the data perspective. This dataset contains details about servers running tasks in a production cluster. It provides insights such as how we use co-location technology to increase the resource utilization rate to 45%, exactly how many tasks we run every day, and the characteristics of our business resource requirements. How you use this dataset depends completely on your needs.
We have just released Alibaba Cluster Data V2018, which contains six files (270+ GB uncompressed; 50 GB compressed) with 8-day running information about 4,000 servers and their corresponding online application containers and offline computing tasks. You can find detailed information in GitHub.
With this copy of data, you can do the following:
Despite the preceding description, you may still be wondering what you can do with this data if you have no background in similar data. Let us take a look at several simple examples:
Scholars can even use this copy of data to make better analyses.
In 2017, we published our first wave of data (Alibaba Cluster Data V2017), which contributed to many excellent academic papers. The following are examples showing Alibaba Cluster Data V2017 being referenced in academic papers, many of which are included in the world-leading OSDI symposium. We look forward to seeing what sort of achievements that you can implement using this data!
"LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation, Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang, Purdue University. OSDI'18" (Best paper award!)
"Imbalance in the Cloud: an Analysis on Alibaba Cluster Trace, Chengzhi Lu et al. BIGDATA 2017"
"Characterizing Co-located Datacenter Workloads: An Alibaba Case Study, Yue Cheng, Zheng Chai, Ali Anwar. APSys2018"
"The Elasticity and Plasticity in Semi-Containerized Co-locating Cloud Workload: a View from Alibaba Trace, Qixiao Liu and Zhibin Yu. SoCC2018"
This section shows the two most distinct differences between V2018 and V2017.
We added the DAG information about offline tasks, which is reportedly the largest DAG data from an actual production environment.
A DAG (Directed Acyclic Graph) is often used for orchestrating offline computing tasks, like common tasks in Map Reduce, Hadoop, Spark, and Flink, and involves concurrence, dependencies, and other aspects between tasks. The following is an example DAG.
V2017 includes the content data of around 1,300 servers within about 24 hours, while Cluster Data V2018 includes the data of 4,000 servers within 8 days.
Visit http://alibabadeveloper.mikecrm.com/BdJtacN and complete the questionnaire to obtain the download link to the data and data format description.
Surpassing Best-Fit: Optimizing Online Container Scheduling Policies for Large Transactions
Alibaba Dragonfly DCOS Case Study: China Mobile (Zhejiang Branch)
376 posts | 41 followersFollow
digoal - October 31, 2022
Alibaba Clouder - January 2, 2019
Alibaba Clouder - April 8, 2019
Alibaba Clouder - May 27, 2019
Alibaba Cloud MaxCompute - March 20, 2019
JeffLv - December 2, 2019
376 posts | 41 followersFollow
Super Computing Service provides ultimate computing performance and parallel computing cluster services for high-performance computing through high-speed RDMA network and heterogeneous accelerators such as GPU.Learn More
ApsaraDB Dedicated Cluster provided by Alibaba Cloud is a dedicated service for managing databases on the cloud.Learn More
Deploy custom Alibaba Cloud solutions for business-critical scenarios with Quick Start templates.Learn More
Provides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resourcesLearn More
More Posts by Alibaba Cloud Native Community
Raja_KT March 18, 2019 at 4:21 pm
Usage patterns can give insights of infra, applications.....But not sure how far it can extend to be real recommendations.