×
Community Blog The Intelligent O&M Platform Helps LAIX Improve Its Core Competitiveness through AI and Big Data

The Intelligent O&M Platform Helps LAIX Improve Its Core Competitiveness through AI and Big Data

This article summarizes the LAIX Best Practices Speech based on SLS.

By Sun Wenjie, Director of O&M Department of LAIX, and Yuan Yi, Technical Expert of Alibaba Cloud Intelligence

Contributed by Alibaba Cloud Storage Team

This article contains excerpts from the LAIX Best Practices Speech entitled Unified Monitoring and Operation Practice Based on SLS Ten-million Level Online Education Platform from the Digital Intelligence Innovation Event - Intelligent O&M Special Session (Shanghai Station).

High-Quality Content and Customized Services Can Enhance the Core Competitiveness of Enterprises

Affected by the COVID-19 pandemic, the online education market has increased rapidly under the slogan of suspension of classes and non-stop learning, with a market size of 485.8 billion CNY. After the rapid expansion of the online education industry over the past few years, the market has become relatively mature. Users have also put forward different demands for various types of online education institutions. Therefore, traffic alone cannot be exchanged for loyal users. However, the core competitiveness is high-quality content and services for the education industry. Enterprises can only achieve long-term development through high-quality courses, personalized plans based on the learning habits of customers, good product experience, stability, and high business operation efficiency.

Combined with Artificial Intelligence (AI), Characteristic Teaching Is Unique

After the recent adjustment of the industry, online education enterprises will gradually focus on content construction rather than the increment of courses. However, the syllabuses are the same, but the teaching methods are widely divergent in the overall environment. There are differences in the courses, but the content is average. Most enterprises cannot rely solely on the content to stand out.

LAIX is different. In this era of artificial intelligence, LAIX relies on characteristic intelligent teaching courses and innovative technologies, such as artificial intelligence, to provide users with personalized teaching courses, helping more users improve their English levels. As of March 31, 2021, the total number of registered users of LAIX has exceeded 0.2 billion. Its large-scale "database of Chinese people speaking English" can evaluate users’ pronunciation according to their real-world situations. In the process of learning and pronunciation, the system can dynamically capture the key points of the mouths of users through the intelligent mouth recognition and correction system. Therefore, it can compare the data with advanced technology to analyze their pronunciation problems. This way, it can offer specific instructions to solve the oral expression problems and help students fundamentally improve their oral proficiency.

Product Experience Is the Key Point, and Learning How to Improve System Stability Becomes a Difficulty

LAIX’s business has grown rapidly, and the number of users has grown from a few million users initially to over 200 million today. The change of data traffic during the high and low peaks of the business, the complexity of the business, and the difficulty of analysis have posed huge challenges to the O&M work. In the overall Internet environment, experience is the most core competitiveness. According to statistics, every instance of one-second latency will lead to an average of 7% loss of users. As a company without a separate O&M department, the LAIX O&M system of the basic platform is mainly operated by the research and development of the Cloud-Infra Team. The core demands of the team include SLA, performance monitoring, alerting, providing relevant data for problem positioning, and technical value operation of Cloud-Infra, such as utilization, cost-saving, and business relationship network.

The requirements for intelligent O&M platforms under these core requirements are listed below:

  1. Collect and monitor various disparate data sources, including machine metrics and utilization on K8s and ECS, call logs related to Istio, metrics related to user-created middleware, metrics provided by cloud services, and Trace data of businesses. It also includes the real-time collection of various cost data.
  2. Dynamic discovery, collection of various resources, and data related to departments, such as organizational relationships, also need to be updated in real-time. This way, the most accurate relevant metrics and affiliation relationships can be fed back in real-time.
  3. It requires large-scale data storage and analysis. Due to the large scale of LAIX’s business, the various cloud resources used and the amount of data generated by the business are huge, with tens of terabytes per day. The solution requires the ability of real-time analysis and presentation on this scale.
  4. The monitoring platform is responsible for stability issues, and its stability is also important. Therefore, it needs to eliminate single-point problems in each part and have the ability to recover quickly from exceptions.

The Overall Intelligent O&M Solution Connects Data Collection to Computing

LAIX built an intelligent O&M platform. It needs to process time series-related data and core business availability data that needs to be calculated and analyzed through various types of logs. Therefore, Logs and Metrics data schemes are required as a whole. There are different community or business schemes for these two types of data,, such as ES, Loki, SLS, Prometheus, OpenTSDB, and InfluxDB. Alibaba Cloud Log Service (SLS) was selected for the final log scheme, and Prometheus + SLS was selected for the time series scheme. The main reasons are listed below:

  1. SLS can store and analyze all kinds of data in a unified way and associate Metrics and Logs data on SLS, which is not available on other platforms. SLS platform can adapt to a very large data scale.
  2. Compared with ES, the SLS platform is a maintenance-free service with better performance, thus avoiding the problem of maintaining ES with high reliability.
  3. Prometheus is the main time series scheme. Prometheus has a complete ecosystem, and PromQL is also simple to use. The time series library of SLS can be used as the remote and high reliability storage of Prometheus, which can solve the reliability issue of Prometheus.
  4. SLS has the function of data processing, which allows Join analysis and processing with external data sources. Therefore, it can process various complex logs and add catalog-related information to logs in a better way.

1

At the same time, Alibaba Cloud Log Service (SLS) has developed a set of mechanisms for dynamic discovery of IaaS and PaaS resources suitable for cloud scenarios to realize maximize automation. It can add newly purchased and created resources to monitoring and collection in real-time to avoid most manual operations.

SLS has also made special customization to meet the requirements of LAIX in each data scenario:

1. Log

  • Logs of various businesses are directly collected to different log stores through SLS logtail.
  • Not all logs are to be stored and indexed for a long time. Therefore, we classify the logs. The ones that need to be audited will be delivered to OSS for long-term storage. Logs for business troubleshooting are only stored for two weeks, and a full-text index is enabled. AccessLog only enables an index of some fields, which can save a lot of index costs.
  • Data processing and some Catalog data, such as mapping rules, departments, and applications already stored in RDS, are used for NGINX access logs that need to calculate SLA and PXX metrics. The URLs in the NGINX access logs are mapped to the corresponding departments, applications, and methods.

2

2. Data Monitoring

  • Prometheus is selected for the monitoring program. For LAIX scenarios, we have developed some Exporters to obtain Metrics from various cloud products and self-built components.
  • At the same time, we added a Sidecar to Prometheus to use Prometheus better, integrate with the internal CICD system, monitor the changes of the Git warehouse, and dynamically adjust Reload Prometheus according to the changes.
  • Various Recording Rules were configured to improve the query speed on Prometheus. These are managed by Git.
  • AlertManager alerting is directly connected to the internal alerting center. It can be used for advanced functions, such as typesetting and upgrade.
  • We used the SLS time series library and directly made the Prometheus Remote Write into the SLS time series library to solve the problems of Prometheus single point and correlation analysis with Catalog.

3

3. Metrics Computing

  • Part of the computing of core indicators comes from the AccessLog of NGINX. Users can get the QPS, error rate, and latency (average and PXX) of each business from the portal, which is not intrusive to the business.
  • Metrics, such as resource utilization, middleware, and infrastructure, are derived from the time series library written by Prometheus. Based on Catalog, users can aggregate and calculate metrics of each department and business.
  • After calculation, the metrics information can be easily stored in MySQL and ES and shipped to OSS for backup due to the small amount of data.

4

Build a Unified Intelligent O&M Platform and Change from a Cost Center to an Innovative Productivity Tool

Currently, this intelligent O&M platform system carries almost all the core O&M of the enterprise. It has been operating stably since its launch. It can easily cope with the sudden increase in data volume during various activities. The overall business value is mainly reflected in:

  • Monitoring: The first value of monitoring is to perform all kinds of monitoring and warnings, especially SLA-related issues. Since the data have been linked to specific departments and business applications, the SLA of each department and application can be obtained easily. Unified promotion and improvement can be carried out in the enterprise.
  • Troubleshooting and Fault Isolation: Based on the access logs of Istio and the Catalog information, the call relationship of each application can be calculated. Therefore, the business relationship grid can be generated in real-time, and the quality of each relationship (edge) can be obtained. After understanding the business relationship, you can locate the root cause and isolate the fault when problems occur.
  • FinOps: The most challenging issue in the Cloud-Infra department is the overhead. Therefore, cost optimization is also one of the core tasks. The main method is to calculate the resource utilization of each department and team, including the average utilization and the utilization of various PXX types (shown in the table below), to judge the resource usage of each department and promote the optimization cost of each department.
Team Percentage of utilization under 30% p 80 p 90 p 99
cloud-infra 92.89% 44.10% 72.35% 100%
Interpretation: The resource data sources for all apps under the team come from the catalog. 1. Merge multiple metrics and show the maximum value.
2. The ratio of data with utilization distribution of 0-30% takes up 92.89% of data, indicating that the vast majority of resource utilization under this team is 0-30%. They are in a relatively idle state and can be considered to improve their utilization.
The utilization of p80 is distributed in the range of 0-44.10%, which means only 20% of data may be greater than the 44% utilization rate. The utilization of p90 is distributed in the range of 0-72.1%, which means only 10% of data may be greater than the 72% utilization rate. The peak business already has a relatively high load. The utilization distribution range of p99 is 0-100%. It shows how the utilization rate during peak hours has reached 100% and is relatively busy.

Finally

In the cloud-native era, digitalization is driving business innovation in various industries. We can only stand out in the overall environment by improving the user experience, accelerating innovation, updating infrastructure and architecture, and making good use of diversified data. The intelligent O&M platform launched by Alibaba Cloud can help engineers reduce their workload and free O&M engineers from various mechanized tasks. The platform will undertake all the dirty work, reduce the time of failure, and allow the O&M personnel to put more creativity on digital innovation and enterprise business innovation to provide enterprises with better competitiveness.

0 0 0
Share on

Alibaba Cloud Community

864 posts | 196 followers

You may also like

Comments

Alibaba Cloud Community

864 posts | 196 followers

Related Products