By Sun Wenjie, Director of O&M Department of LAIX, and Yuan Yi, Technical Expert of Alibaba Cloud Intelligence
Contributed by Alibaba Cloud Storage Team
This article contains excerpts from the LAIX Best Practices Speech entitled Unified Monitoring and Operation Practice Based on SLS Ten-million Level Online Education Platform from the Digital Intelligence Innovation Event - Intelligent O&M Special Session (Shanghai Station).
Affected by the COVID-19 pandemic, the online education market has increased rapidly under the slogan of suspension of classes and non-stop learning, with a market size of 485.8 billion CNY. After the rapid expansion of the online education industry over the past few years, the market has become relatively mature. Users have also put forward different demands for various types of online education institutions. Therefore, traffic alone cannot be exchanged for loyal users. However, the core competitiveness is high-quality content and services for the education industry. Enterprises can only achieve long-term development through high-quality courses, personalized plans based on the learning habits of customers, good product experience, stability, and high business operation efficiency.
After the recent adjustment of the industry, online education enterprises will gradually focus on content construction rather than the increment of courses. However, the syllabuses are the same, but the teaching methods are widely divergent in the overall environment. There are differences in the courses, but the content is average. Most enterprises cannot rely solely on the content to stand out.
LAIX is different. In this era of artificial intelligence, LAIX relies on characteristic intelligent teaching courses and innovative technologies, such as artificial intelligence, to provide users with personalized teaching courses, helping more users improve their English levels. As of March 31, 2021, the total number of registered users of LAIX has exceeded 0.2 billion. Its large-scale "database of Chinese people speaking English" can evaluate users’ pronunciation according to their real-world situations. In the process of learning and pronunciation, the system can dynamically capture the key points of the mouths of users through the intelligent mouth recognition and correction system. Therefore, it can compare the data with advanced technology to analyze their pronunciation problems. This way, it can offer specific instructions to solve the oral expression problems and help students fundamentally improve their oral proficiency.
LAIX’s business has grown rapidly, and the number of users has grown from a few million users initially to over 200 million today. The change of data traffic during the high and low peaks of the business, the complexity of the business, and the difficulty of analysis have posed huge challenges to the O&M work. In the overall Internet environment, experience is the most core competitiveness. According to statistics, every instance of one-second latency will lead to an average of 7% loss of users. As a company without a separate O&M department, the LAIX O&M system of the basic platform is mainly operated by the research and development of the Cloud-Infra Team. The core demands of the team include SLA, performance monitoring, alerting, providing relevant data for problem positioning, and technical value operation of Cloud-Infra, such as utilization, cost-saving, and business relationship network.
The requirements for intelligent O&M platforms under these core requirements are listed below:
LAIX built an intelligent O&M platform. It needs to process time series-related data and core business availability data that needs to be calculated and analyzed through various types of logs. Therefore, Logs and Metrics data schemes are required as a whole. There are different community or business schemes for these two types of data,, such as ES, Loki, SLS, Prometheus, OpenTSDB, and InfluxDB. Alibaba Cloud Log Service (SLS) was selected for the final log scheme, and Prometheus + SLS was selected for the time series scheme. The main reasons are listed below:
At the same time, Alibaba Cloud Log Service (SLS) has developed a set of mechanisms for dynamic discovery of IaaS and PaaS resources suitable for cloud scenarios to realize maximize automation. It can add newly purchased and created resources to monitoring and collection in real-time to avoid most manual operations.
SLS has also made special customization to meet the requirements of LAIX in each data scenario:
Currently, this intelligent O&M platform system carries almost all the core O&M of the enterprise. It has been operating stably since its launch. It can easily cope with the sudden increase in data volume during various activities. The overall business value is mainly reflected in:
Team | Percentage of utilization under 30% | p 80 | p 90 | p 99 |
cloud-infra | 92.89% | 44.10% | 72.35% | 100% |
Interpretation: The resource data sources for all apps under the team come from the catalog. | 1. Merge multiple metrics and show the maximum value. 2. The ratio of data with utilization distribution of 0-30% takes up 92.89% of data, indicating that the vast majority of resource utilization under this team is 0-30%. They are in a relatively idle state and can be considered to improve their utilization. |
The utilization of p80 is distributed in the range of 0-44.10%, which means only 20% of data may be greater than the 44% utilization rate. | The utilization of p90 is distributed in the range of 0-72.1%, which means only 10% of data may be greater than the 72% utilization rate. The peak business already has a relatively high load. | The utilization distribution range of p99 is 0-100%. It shows how the utilization rate during peak hours has reached 100% and is relatively busy. |
In the cloud-native era, digitalization is driving business innovation in various industries. We can only stand out in the overall environment by improving the user experience, accelerating innovation, updating infrastructure and architecture, and making good use of diversified data. The intelligent O&M platform launched by Alibaba Cloud can help engineers reduce their workload and free O&M engineers from various mechanized tasks. The platform will undertake all the dirty work, reduce the time of failure, and allow the O&M personnel to put more creativity on digital innovation and enterprise business innovation to provide enterprises with better competitiveness.
How to Use Alibaba Cloud to Implement the Full Tracing Analysis
1,066 posts | 262 followers
FollowAlibaba Clouder - October 26, 2020
Alibaba Cloud Community - March 8, 2022
Alibaba Cloud Community - March 2, 2022
Alibaba Clouder - October 15, 2020
Aliware - June 23, 2021
Alibaba Cloud Community - July 27, 2022
1,066 posts | 262 followers
FollowManaged Service for Grafana displays a large amount of data in real time to provide an overview of business and O&M monitoring.
Learn MoreA unified, efficient, and secure platform that provides cloud-based O&M, access control, and operation audit.
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreMore Posts by Alibaba Cloud Community