How to discover bugs in online advertising in real time

1. Background

The search advertisement data processing link of the e-commerce platform is usually long, and generally goes through the following process:

Advertisers place advertisements in the background;
Advertisement and keyword data are written into the database;
The data in the database is fully constructed (imported into the data warehouse and then processed offline) or incrementally constructed (with the help of message queues and stream computing engines) to produce "content files" for building online indexes;
BuildService builds the index used for retrieval by the search service based on the "content file".
The following figure is the buyer and seller data processing link of ICBU's advertising system:

The right half (BP->DB) and the offline part are the update process of advertising data.

Complex data processing links combined with massive (usually more than 100 million) product data pose a huge challenge to the correctness test of the delivery status of all online products. From the database, to offline large-scale data joint table processing, to online index construction, any abnormality or data delay at any node in the link may cause "capital loss" to advertisers and the platform, such as:

The advertiser cancels the advertisement of product A in the background, but because of the data link processing delay, the status of A in the search engine is still "promoting", so that A can continue to be exposed when buyers search for advertisements, and correspondingly when " When the "click" behavior occurs, it will cause an incorrect deduction.
Advertisers set a product to only advertise to customers in a certain region/country, but due to inappropriate processing of the filter logic of the search engine, the customer’s advertisements are advertised in all regions, which will also cause false clicks. payment.
Traditional testing methods, either focusing on the module function test of the back-end application of the advertiser, or the module function test of the search engine, lack effective and comprehensive test methods for the test of the full link function. Online business monitoring focuses on the monitoring of business performance indicators, such as CTR (click through rate, click rate), CPC (cost per click, click cost), RPM (revenue per impression, revenue per thousand views), etc. . There is a lack of an effective discovery mechanism for misplacement of advertisements that involve the vital interests of advertisers and the total revenue of the platform.

We expect to verify the consistency between its delivery status in the database and the status in the search engine by back-checking its last status in the database before the exposure time for all the products that are actually exposed by the online search advertising engine, so as to achieve online Real-time detection of ad misdelivery issues. At the same time, through different trigger detection methods, the effective coverage of each link of data change is achieved.

2. Stage results

With the help of log stream synchronization service (TTLog), massive data NoSQL storage system (Lindorm), real-time business verification platform (BCP), message queue (MetaQ), online data real-time synchronization service (Jingwei) and massive log real-time analysis system ( Xflush) realizes the online real-time discovery of ICBU search advertisement misplacement problems, and covers the real exposure traffic of all online users. At the same time, by adding active verification to data change nodes, online problems in specific scenarios (the advertisement has not been retrieved by users) can be discovered before users.

In addition, with the help of the technical system of TTLog + real-time computing engine Blink + Alibaba Cloud log service SLS + Xflush, the real-time disclosure of online engine/algorithm effects is realized.

The following is the real-time quality market of ICBU advertisements:

It has been put into online use since the end of August. At present, this real-time system has discovered many online problems, and almost all of them directly affect the loss of assets and the interests of advertisers.

3. Technical realization

1. Engine exposure log data processing

For the e-commerce search advertising system, when a real user request is reached (1.1 in Figure 1), a real-time advertisement exposure will be generated. Correspondingly, an exposure record will be written in the log of the search engine (as shown in Figure 1. One in 2). We use the log stream synchronization service TTLog to collect the log data on each server node of the search engine in a unified manner (3 in Figure 1), and then use the data reconciliation service platform BCP to connect to the "streaming" data in TTLog (Figure 1 In 4), the data is cleaned, filtered, and sampled, and then the data to be verified is pushed to the message queue service MetaQ (5 in Figure 1).

2. DB data processing

As shown in Figure 2, usually, the business database MySQL will only store the latest data at the current moment for each domain object. In order to obtain the last data before the actual exposure of the advertising product in the engine, we use Jingwei to monitor each data change in the database, and write the "snapshot" of the changed data into Lindorm (the bottom layer is HBase storage, which supports random reading of massive data Write).

3. Data consistency check

In the advertising testing service igps (our own application), we listen to the news changes of MetaQ, pull the data to be verified in MetaQ (6 in Figure 1), and analyze the results of each advertising product in the search engine when it is exposed. The state of , and at the same time obtain the moment of its exposure. Then, based on the exposure time point, by querying Lindorm, the final data status of the advertising product in MySQL before the exposure time point is obtained (7 in Figure 1). Then igps exposes the advertisement, and verifies the consistency of the data state in the engine and the data state in MySQL.

If the data verification is inconsistent, an error log will be printed out. Finally, with the help of Xflush, a real-time analysis system for massive logs (8 in Figure 1), we can achieve real-time aggregation statistics, visual display, and monitoring and alarming of error data.

4. Active verification of data change nodes

Because the online real-time user search traffic has a certain degree of randomness, the coverage of the traffic scene has great uncertainty. As a supplement, we have added active verification at the data change node.

In the entire data link, there are two important nodes for data changes:

Data changes in MySQL;
Engine index switching.
For data changes in MySQL: we use Jingwei to monitor changes, and construct a specific engine query request string for the change information of a single piece of data, and initiate a query request (1.3 in Figure 1).

For engine index switching (mainly full switching): We aggregate and analyze/rewrite historical (such as the past 7 days) online advertising traffic offline to obtain a set of test case requests. Then monitor the switching operation of the online engine index. When the engine index is fully switched, we actively initiate a batch request for engine services (1.2 in Figure 1).

The above two kinds of proactive requests will eventually reuse the previously built data consistency verification system to verify the status of advertisement delivery.

The picture above is a real-time verification error monitoring picture of the advertising delivery status. From the picture, we can clearly see the data quality of the search advertising link at the current moment. Whether it's the DB synchronization delay of the China-US business, the delay of the incremental data processing link from the DB to the engine, or the logic error caused by the release change, it will lead to an abnormal rise in the error data curve. The verification rules cover the promotion plan (campaign), promotion group (adgroup), customer status (customer), word status (keyword), and product status (feed). The verified nodes cover two different links of exposure and clicks.

5. Real-time quality of engines and algorithms

The search engine log pvlog contains a lot of valuable information. Making good use of this information can not only achieve real-time discovery of online problems, but also help algorithm students perceive real-time online effects and provide a starting point. As shown in Figure 3, we use the real-time computing engine Blink to analyze and segment the pv information in TTLog, then output the split results to Alibaba Cloud Log Service SLS, and then connect to Xflush for real-time aggregation and visual display.

As shown in the figure above, in the first half of the year, we experienced an online asset loss failure. It was because a parameter was missing in the search advertising engine SP request string constructed by the search application side, which resulted in the advertisements of some top customers not being placed in the designated area. It took more than 10 hours from the occurrence of the fault to the reporting of more than 10+ customers. By monitoring real-time key values and important value values of SP request strings in real time, we can quickly discover scenarios where key values or value values are missing.

In addition, the distribution of different recall types, deduction types, and deduction prices can not only monitor the occurrence of online abnormal states, but also provide reference for algorithm students to do experiments, adjust parameters, and troubleshoot online problems.

4. Several core issues
1. why lindorm?

In the initial implementation, we used Jingwei to monitor changes in the business DB and write them to another new DB (MySQL), but performance was a very big bottleneck. Our database is divided into 5+ physical libraries, 1000+ sub-tables, the average data volume of a single table reaches 1000+w rows, and the total data reaches 100 billion rows.

Afterwards, the query performance was improved from an average of 1s to 70ms through storage optimization and logical table partitioning.

2. Why BCP + MetaQ + igps?

Initially, we wanted to use BCP to verify the data directly: encapsulate the query interface of lindorm through igps, and then provide hsf interface for direct use in BCP.

But it is still due to performance issues: a message in TTLog contains 60+ PVs on average, each PV may have 5 or more advertisements, and each advertisement needs to check 6 tables, and a single message needs to call about 60x5x6=1800 in BCP verification times hsf request. When we sampled 10% of the TTLog data in BCP, the performance of the backend service igps had already become a bottleneck, the hsf thread pool was full, and the average cpu usage of the seven servers reached over 70%.

With the introduction of MetaQ, the network overhead of hsf calls can be eliminated, and the production and consumption of messages can be decoupled. When the traffic peak arrives, igps can keep its consumption rate unchanged, and more messages can be temporarily stored in the queue. Through this optimization, we not only survived 10% sampling, but when the online sampling rate is set to 100%, the average cpu usage of our igps server is still only around 20%, and there is no message in metaq accumulation.

But in this way, the role of bcp has changed from the original "sampling, filtering, verification, and alarm" to "sampling and filtering". It cannot play its role of quickly adapting to business changes through online coding.

3. why not all blink?

In fact, the process of "BCP + MetaQ + igps" can be replaced by "Blink + SLS", so why not use Blink uniformly.

On the one hand, due to the relatively small traffic factor of the current click verification, we currently write the verification code directly in BCP, which does not need to go through the release process, which is faster. Moreover, BCP has personalized functions such as "delay verification" and "current limiting control". On the other hand, judging from our current experience of using Blink, the real-time processing engine still has some unstable factors, especially unstable network jitter (may be caused by data sources and Blink workers crossing computer rooms).

4. How to split the key value requested by SP?

When doing real-time key value monitoring of SP request strings, I encountered a small problem: the parameter key in the SP request string is dynamic, not every key will appear in every string, and different request string keys appear order is different. How to segment to meet Xflush's "column value grouping" format requirements.

The implementation method is to use Blink's udtf (custom table-valued function) to parse each sp request string to obtain all keys and corresponding values of each string. Then when outputting, split each sp request string into multiple lines of output according to the format of "validKey={key}, validValue={value}". Then Xflush can be used to group by validKey and count the number of rows.

Summary and follow-up planning
This article introduces the real-time discovery of data end-to-end consistency problems in the e-commerce search advertising scenario through the processing technology of big data, and realizes the whole process of data through "real-time discovery" combined with "active verification of data change nodes". Consistency check.

There are two main directions for subsequent optimization:

1: Combined with the usage scenarios of the business, it reveals real-time data with richer dimensions.

2: "Move left" this set of technical systems to the offline/pre-release testing stage to realize one-click automated testing of "function, performance, and effect", while covering the entire link from search applications to engines.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us