Application of Real-Time Compute for Apache Flink in Weibo

In this blog, Cao Fuqiang, Senior System Engineer and Head of Data Computing at the Machine Learning R&D Center at Weibo, gives an introduction to the application of Realtime Compute for Apache Flink with Weibo:

1. An Introduction to Weibo

Weibo is China's leading social media platform, with 241 million daily active users and 550 million monthly active users. Mobile users account for more than 94% of the daily active users.

2. An Introduction to Data Computing Platforms

2.1 Overview

The following figure shows the architecture of the data computing platform.

The first is scheduling, which is based on K8s and Yarn, to deploy Flink and Storm for real-time data processing and SQL services for offline processing.
We built the Weibo AI platform upon clusters to manage jobs, data, resources, and samples.
We have built some services on the platform to support various business parties in a service-oriented way.

The services of Realtime Compute mainly include data synchronization, content deduplication, multi-modal content understanding, real-time feature generation, real-time sample splicing, and streaming model training, which are closely related to business. In addition, it supports Realtime Compute for Apache Flink and real-time compute for Storm, which are general, basic computing frameworks.
For offline computing, Hive's SQL and SparkSQL are used to build an SQL computing service that currently supports most of the business parties in Weibo.

Data is exported using the data warehouse and feature engineering. On the whole, we run nearly 1,000 real-time compute jobs online and more than 5,000 offline jobs, processing more than 3 PB of data every day.

2.2 Data Computing

The following two figures show data computing, including real-time compute and offline computing.

Real-time compute mainly includes real-time feature generation, multimedia feature generation, and real-time sample generation. These are closely related to business. In addition, some basic real-time compute services for Flink and Storm are also provided.
Offline computing mainly includes SQL computing. It mainly includes ad hoc, data generation, data query, and table management. Table management mainly includes data warehouse management, including table metadata management, table access permissions, and table upstream and downstream lineage management.

2.3 Real-Time Characteristics

As shown in the following figure, a real-time feature generation service was built based on Flink and Storm. In general, it consists of job details, input source feature generation, output, and resource configuration. Users can develop feature-generated UDFs according to our pre-defined interfaces. Other things, such as input and feature writing, are provided by the platform, and users only need to configure them on the page. In addition, the platform provides input data source monitoring, job exception monitoring, feature writing monitoring, feature reading monitoring, all of which are generated automatically.

2.4 Batch and Stream Integration

The following describes the architecture of batch and stream integration built on FlinkSQL. First, we unify metadata and integrate real-time and offline logs through the metadata management platform. After integration, we provide a unified scheduling layer when users submit jobs. Scheduling refers to dispatching jobs to different clusters based on the job type, job characteristics, and current cluster load.

Currently, the scheduling layer supports the HiveSQL, SparkSQL, and FlinkSQL computing engines. Hive and Spark SQL are used for batch computing, while FlinkSQL runs batch and streaming jobs together. The processing results are output to the data warehouse for the business side to use. The batch and stream integration architecture has four key points:

Unify batch and stream code to improve development efficiency
Unify the metadata of batch and stream processing and unify management to ensure metadata consistency
Run the batch and stream program together to save resources
Unify batch and stream scheduling to improve cluster utilization

2.5 Data Warehouse

For offline warehouses, we divide data into three layers: raw logs, the middle layer, and the data service layer. The middle layer implements metadata unification, while the lower layer consists of real-time data warehouses.
For real-time data warehouses, we perform streaming ETL of these rawlogs through FlinkSQL and then write the final data results to the data service layer through a streaming summary. It stores data to various real-time stores, such as Elasticsearch, HBase, Redis, and ClickHouse. Real-time data storage allows external users to query data. It also provides the ability of additional data computing. You need to solve the problem of a long generation period for offline features to build a real-time data warehouse. We also use FlinkSQL to solve the problem of long development cycles of streaming jobs. One of the key points is the management of metadata in offline and real-time data warehouses.

3. Typical Application of Flink in Data Computing Platforms

3.1 Streaming Machine Learning

First of all, I will introduce several characteristics of streaming machine learning. The most significant feature is real-time processing. This can be divided into real-time feature processing and model processing.

The main purpose of real-time feature processing is to feedback user behavior in a more timely manner and describe users in a more granular manner.
Real-time model processing uses online samples to train models in real-time and reflect online changes of objects in a timely manner.

Streaming Machine Learning Features in Weibo

The sample size is large, and the current real-time samples can reach millions of QPS.
The model scale is large. As for model training parameters, the entire framework supports training for hundreds of billions of models.
High requirements for job stability
High real-time requirements for samples
High real-time requirements for models
Many requirements for platform businesses

Several Problems Facing Streaming Machine Learning

One is the full link because the end-to-end link is relatively long. For example, a streaming machine learning process will start from log collection to feature generation, sample generation, model training, and service launch; the whole process is very long. Any problems with any links will affect the final user experience. Therefore, we have deployed a relatively complete full-link monitoring system for each link, and there are relatively rich monitoring indicators.
Another one is its large data scale, including massive user logs, sample size, and model size. We investigated common real-time compute frameworks and finally chose Flink to solve this problem.

The Process of Streaming Machine Learning

The first is offline training. After we get offline logs and generate samples offline, we read the samples through Flink and then do offline training. After the training is completed, these training result parameters are saved in the offline parameter server. The result serves as the Base model of the model service for real-time cold start.
The real-time streaming machine learning process comes next. We download real-time logs, such as Weibo posts and interaction logs. After downloading these logs, use Flink to generate its samples and then perform real-time training. The training parameters are saved in a real-time parameter server after the training is completed. Then, they are synchronized periodically from the real-time parameter server to the real-time parameter server.
Finally, the model service will pick those parameters corresponding to the model from the parameter service to recommend the user features or the material features. The model scores the user and material-related characteristics and behaviors. Then, the sorting service will retrieve the scoring results, add some recommended strategies to select the most suitable material for the user and provide feedback to the user. The user makes some interactions on the client side and then sends a new online request, which will generate new logs. The whole streaming learning process is a closed-loop process.
The delay of offline samples and model update takes days or hours, while the streaming method takes hours or minutes.
The computing pressure for offline model training is concentrated, while the computing pressure for real-time training is distributed.

Sample

The following briefly introduces the development history of streaming machine learning samples. In October 2018, we launched our first streaming sample job, which was performed by Storm and external storage Redis. In May 2019, we adopted the new real-time compute framework, Flink, and adopted the union + timer scheme to replace window computing to achieve join operations of multiple data streams. In October 2019, we launched a sample job, with the QPS of a single job reaching hundreds of thousands. In April 2020, the sample generation process was platform-based. By June 2020, the platform had made an iteration to support the sample drop, including the sample library and the improvement of various monitoring metrics of samples.

_10

The sample generation of streaming machine learning is a concatenation of multiple data streams according to the same key. For example, there are three data streams, and the result of data cleaning is stored in a specific way. k is the aggregated key, and v is the desired value in the sample. After data union, we perform KeyBy aggregation and store the data in the memory region value state after aggregation; as shown in the following figure:
If k1 does not exist, register the timer and store it in the state.
If k1 exists, take it out of the state, update it, and save it back in the state. At the end, after the timer expires, the data will be output and cleared from the state.

_11

Sample Platform

We have made the whole sample concatenation process a platform-based operation and divided it into five modules, including input, data cleaning, sample concatenation, sample formatting, and output. Based on the platform development, users only need to care about the business logic part. The following are required to be developed by users:

Data cleaning logic corresponding to the input data
Data formatting logic before sample output

The rest of the configuration can be implemented on the UI:

The time window for sample concatenation
The aggregation operation for the fields within the window

The resources are reviewed and configured by the platform. In addition, the entire platform provides some basic monitoring, including the monitoring of input data, sample indicators, job exceptions, and sample output.

_12

Sample UI for Streaming Machine Learning Project

The following figure shows a sample of the streaming machine learning project. The left shows the job configuration for sample generation, and the right shows the sample library. The sample library mainly manages and displays samples, including sample description permission and sample sharing.

_13

Application of Streaming Machine Learning

Finally, let's take a look at the effect of the streaming machine learning application. Currently, we support real-time sample concatenation, with QPS in the millions. We also support streaming model training, which can support hundreds of models simultaneously. The real-time capability of models can be updated within minutes or hours. Streaming learning supports the full flow of disaster tolerance and full link automatic monitoring. Recently, we are working on a deep learning model for streaming data to improve the expression capability of real-time models. We will also explore new application scenarios in reinforcement learning.

_14

3.2 Multimodal Content Understanding

Introduction

Multimodal is the ability or technology to implement or understand multimodal information using machine learning methods. This piece of Weibo mainly includes images, videos, audio, and text.

The images section includes object recognition tagging, OCR, face, stars, level of appearance, and intelligent cropping.
The video section includes copyright detection and logo recognition.
The audio section includes tags of voice to text and audio.
The text section mainly includes the text's subtext, timeliness, and classification label.

For example, when we started to do video classification, we only used those frames after video frame extraction (for images.) Later, during the second optimization, audio-related things and blog-related things corresponding to videos were added, which is equivalent to considering the multimodal integration of audio, images, and text to generate the video classification tags more precisely.

_15

Platform

The following figure shows the platform architecture of multimodal content understanding. In the middle part, Flink performs real-time compute, receiving data, such as image streams, video streams, and blog streams. Then, it uses calling basic services in the following section through a model plug-in to implement deep learning models. After the service is called, the content features are returned. Then, we save the features in feature engineering and provide them to various business parties through the data mid-end. Alerts for full link monitoring throughout the job immediately respond to abnormal conditions. The platform provides functions, such as log collection, metric statistics, and CASE tracking automatically. In the middle part, zk is used for service discovery to solve the problem of service status synchronization between real-time compute and deep learning models. In addition to state synchronization, some load balancing policies exist. The bottom line is to use the data reconciliation system to improve the data processing success rate.

_16

UI

The UI for multimodal content understanding mainly includes job information, input source information, model information, output information, and resource configuration. We can improve development efficiency through configuration-based development. Then, some monitoring metrics of model calls are generated automatically, including the success rate and time spent on model calls. When the job is submitted, a job for metrics statistics is generated automatically.

_17

3.3 Content Deduplication Service

Background

In the recommendation scenario, if you keep pushing duplicate content to the user, it will make the user experience bad. After taking this into account, we built a content deduplication platform based on the Flink real-time stream compute platform, distributed vector search system, and deep learning model services. The platform features low latency, high stability, and characteristics of a high recall rate. It currently supports multiple business parties with over 99.9% stability.

_18

Architecture

The following figure shows the architecture of a content deduplication service. The bottom section describes the multimedia model training, which is offline. For example, we can get some sample data, perform sample processing, and store the samples in the sample library. When we need to do model training, we select samples from the sample library and do model training. The training results are saved to the model library.

_19

The main model used here for content deduplication is the vector generation model, including vectors of images, text, and videos.

After we verify that the trained model is correct, we save the model to the model library. The model library stores some basic information about the model, including the running environment and model version. Then, the model needs to be deployed online. The deployment process requires pulling the model from the model library and knowing some technical environment in which the model will run.

After the model is deployed, we retrieve materials from the real-time material library Flink and call the multimedia estimation service to generate vectors for these materials. Then, these vectors are stored in the Weiss Library, which is a vector recall retrieval system developed by Weibo. After the material is stored in the Weiss library, a batch of materials similar to this material will be recalled. In the section of fine comparison, a certain strategy is used to select the most similar one from all the recall results. Then, the most similar item and the current material are aggregated to form a content ID. Finally, when the business uses it, it deduplicates it by the corresponding content ID of the material.

Application

There are three application scenarios of content deduplication:

It supports video copyright. The stability of pirated video recognition reaches 99.99%, and the piracy recognition rate is 99.99%.
It supports full-site Weibo video deduplication. The application stability for recommendation scenarios reaches 99.99%, with second-level processing delay.
The stability of the recommended material deduplication reaches 99%, with second-level processing delay, and the accuracy rate reaches 90%.

_20

Summary

We have done a lot of work in platform and service by combining the Flink real-time stream computing framework with business scenarios. We have also made many optimizations in development efficiency and stability. We improve development efficiency through modular design and platform-based development. Currently, the real-time data compute platform comes with a system that features end-to-end monitoring, metric statistics, debug case tracing (Logs Review). In addition, FlinkSQL implements batch-stream integration. These are some new changes that Flink has brought to us, and we will continue to explore the wider application of Flink with Weibo.

Community