Construction Practice of Bigo Real Time Computing Platform

1. The development history of Bigo real-time computing platform

Today I mainly share with you the construction process of the Bigo real-time computing platform, some problems we solved during the construction process, and some optimizations and improvements we made. First enter the first part, the development history of the Bigo real-time computing platform.

Let me briefly introduce Bigo's business. It mainly has three major APPs, namely Live, Likee and Imo. Among them, Live provides live broadcast services for users around the world. Likee is an app for creating and sharing short videos, very similar to Kuaishou and Douyin. Imo is a worldwide free communication tool. These major products are all related to users, so our business should revolve around how to improve the conversion rate and retention rate of users. As the basic platform, the real-time computing platform mainly serves the above businesses, and the construction of the Bigo platform should also provide some end-to-end solutions around the above business scenarios.

The development of Bigo real-time computing can be roughly divided into three stages.

Before 2018, there were very few real-time jobs. We used Spark Streaming to do some real-time business scenarios.

From 2018 to 2019, with the rise of Flink, it is generally believed that Flink is the best real-time computing engine, and we started to use Flink for discrete development. Each business line builds a Flink for simple use.

Starting in 2019, we have unified all businesses using Flink on the Bigo real-time computing platform. After two years of construction, all real-time computing scenarios are currently running on the Bigo platform.

As shown in the figure below, this is the current state of the Bigo real-time computing platform. On the Data Source side, our data are user behavior logs, mainly from APP and client. There are also some user information stored in MySQL.

These information will go through the message queue and finally collected into our platform. The message queue mainly uses Kafka, and now it is gradually adopting Pulsar. The MySQL logs mainly enter the real-time computing platform through BDP. On the real-time computing platform, the bottom layer is also based on the commonly used Hadoop ecosystem for dynamic resource management. The upper engine layer has been unified to Flink, and we do some development and optimization on it. On this one-stop platform for development, operation and maintenance, and monitoring, we built a BigoFlow management platform internally. Users can develop, debug and monitor on BigoFlow. Finally, in terms of data storage, we also connected with Hive, ClickHouse, HBase and so on.

2. Features and improvements of the Bigo real-time computing platform

Next, let's take a look at the features of the Bigo computing platform and the improvements we have made. As a developing company, the focus of our platform construction is to make it as easy as possible for business personnel to use. Thereby promoting business development and expanding the scale. We hope to build a one-stop platform for development, operation and maintenance, and monitoring.

First of all, on BigoFlow, users can develop very conveniently. The features and improvements we are developing in this area include:

* Powerful SQL editor.
* Graphical topology adjustment and configuration.
*One-click multi-cluster deployment.
* The version is managed in a unified manner, as convergent as possible.

In addition, in terms of operation and maintenance, we have also made many improvements:

*Perfect savepoint management mechanism.
*Logs are automatically collected to ES, with built-in common error troubleshooting rules.
*The task history is saved for easy comparison and problem tracking.

The last part is monitoring. Our features are:

*Monitoring is added automatically, and users basically do not need to manually configure it.
*Automatically analyze resource usage and recommend reasonable resource allocation for users.

There are three main places where our metadata is stored. They are Kafka, Hive and ClickHouse respectively. At present, we can fully open up the metadata of all storage systems. This will greatly facilitate the user, while reducing the cost of use.

After the metadata of Kafka is opened, it can be imported once and used indefinitely without DDL.
Flink and Hive are also fully connected. When users use Hive tables, they can use them directly without DDL.

ClickHouse is also similar, and can automatically track Kafka topics.

In fact, what we provide today is not only a platform, but also an end-to-end solution in general scenarios. In the ETL scenario, our solutions include:

* Universal RBI fully automated access.
*Users do not need to develop any code.
* Data goes into hive.
* Automatically update meta.

In the area of monitoring, our features are:

* Data source is automatically switched.
*Monitoring rules remain unchanged.
*The results are automatically stored in prometheus.

The third scenario is the ABTest scenario. The traditional ABTest is done offline, and the results can only be produced after a day. So today we converted ABTest to real-time output, and greatly improved the efficiency of ABTest through the integration of streaming and batching.

Improvements to Flink are mainly reflected in the following aspects:

First, at the connector level, we have customized many connectors to connect with all the systems used by the company.

Second, at the level of data formatting, we have very complete support for the three formats of Json, Protobuf, and Baina. Users do not need to do the analysis by themselves, they can use it directly.

Third, all the company's data falls directly into Hive, which is ahead of the community in the use of Hive. Including streaming reading, EventTime support, dimension table partition filtering, Parquet complex type support, etc.

Fourth, we have also made some optimizations at the State level. Includes SSD support, and RocksDB optimizations.

3. Typical business scenarios of Bigo

The traditional point-to-store storage is through Kafka to Flume, then to Hive, and finally to ClickHouse. Of course, most of ClickHouse is imported from Hive, and some are written directly through Kafka.

This link is a very old link and it has the following issues:

First, it is unstable. Once flume is abnormal, data loss and duplication often occur.
Second, poor scalability. In the face of sudden traffic peaks, it is difficult to expand.
Third, the business logic is not easy to adjust.

So we have done a lot of work after building Flink. Replace the original flow from Flume to Hive. Today, all ETL passes through Kafka, and then through Flink. All RBI will enter Hive's offline data warehouse as a historical storage, so that data will not be lost. At the same time, because many jobs require real-time analysis, we use another link to directly enter the ClickHouse real-time data warehouse from Flink for analysis.

In this process, we have made some core transformations, which are divided into three major parts. First of all, in terms of user access, our transformation includes:

As simple as possible.
General RBI fully automatic.
The meta-information is connected without DDL.
In addition, in Flink itself, our transformation includes:

Parquet write optimization.
Concurrency adjustment.
Through SSD disks, jobs with large status are supported.
RocksDB optimization for better memory control.
Finally, in the area of data sink, we have done a lot of customized development, not only supporting Hive, but also docking with ClickHouse.

4. Efficiency improvement brought by Flink to business

The following mainly introduces some of the transformations we have made in the ABTest scenario. For example, after all the data falls to Hive, offline calculations are started, and after countless workflows, a large and wide table is finally produced. There may be many dimensions on the table, recording the results of grouping experiments. After the data analyst gets the results, he analyzes which experiments are better.

Although this structure is very simple, the process is too long, the result is late, and it is not easy to increase the dimension. The main problem is actually Spark. This job has countless workflows to execute, and one workflow cannot be scheduled until the other one is executed. And offline resources are not very well guaranteed. Our biggest problem before was that the results of ABTest from the previous day could not be output until the afternoon of the next day. Data analysts often reported that they could not work in the morning and could only start analysis when they were about to leave work in the afternoon.

So we started to use Flink's real-time computing capabilities to solve the problem of timeliness. Unlike Spark tasks that have to wait for the last result before outputting, Flink consumes directly from Kafka. Basically, the results can be obtained in the morning. But at that time, because it finally produced a result with many dimensions, possibly hundreds of dimensions, the State was very large at this time, and OOM was often encountered.

Therefore, we took a compromise in the first step of the transformation process. Instead of directly using Flink to join all the dimensions in one job, we split it into several jobs. Each job calculates a part of the dimension, and then uses HBase to make a join of these results, and then imports the result of the join into ClickHouse.

In the process of remodeling, we found a problem. Maybe the job needs to adjust the logic frequently. After the adjustment, you need to see if the result is correct. Then this requires a time window of 1 day. If you read historical data directly, Kafka will save the data for a long time. When reading historical data, you have to go to the disk to read it, which puts a lot of pressure on Kafka. If you don’t read the historical data, because only the zero point can be triggered, then the logic is changed today, and the results can’t be seen until a day later, which will lead to very slow debugging iterations.

As mentioned earlier, all our data is in Hive, which was still version 1.9 at that time, and we supported streaming data from Hive. Because these data are triggered by EventTime, we support the use of EventTime to trigger on Hive. For stream batch unification, Spark is not used here, because if Spark is used for job verification, two sets of logic need to be maintained.

We use stream-batch integration on Flink to do offline supplementary data or offline job verification. The real-time one is used for daily operations.

As I said just now, this is actually a compromise solution, because it depends on HBase and does not fully utilize the capabilities of Flink. So we carried out the second round of transformation to completely remove the dependence on HBase.

After the second round of iterations, we have been able to handle day-level window transactions for large tables today on Flink. This stream-batch unified solution has been launched. We directly use Flink to calculate the entire large and wide table. After the daily window is triggered, the results are directly written into ClickHouse, and the results can be produced in the early morning.

During the whole process, our optimization of Flink includes:

*State supports SSD drives.
* Streaming read Hive, support EventTime.
*Hive dimension table join, supports partition partition load.
The most complete ClickHouse Sinker.
*After the optimization, our hourly tasks are no longer delayed, and the completion time of the daily level is brought forward from the afternoon to before work, which greatly speeds up the iteration efficiency.

V. Summary and Outlook

Summarize the current status of real-time computing in Bigo. First, very close to the business. Secondly, it seamlessly connects with all the ecology used in the company, basically eliminating the need for users to do any development. In addition, the real-time data warehouse has taken shape. In the end, our scenes are not rich enough compared with big factories. For some typical real-time scenarios, because the business requirements are not so high, many businesses have not really switched to real-time scenarios.

Our development plan has two major blocks.

The first piece is to expand more business scenarios. Including real-time machine learning, advertising, risk control and real-time reporting. In these fields, it is necessary to promote the concept of real-time computing more and connect with business.

Another piece is on Flink itself, we have many scenarios to do internally. For example, it supports large Hive dimension table join, automatic resource configuration, CGroup isolation, etc. The above are some of the work we will do in the future.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us