Engineering practice for real-time business support of B-side algorithms

a background

In the marketing scenario, Algorithm Classmates will provide advertisers with personalized marketing tools to help advertisers better refine marketing and achieve better ROI improvement within controllable costs. During this period of time, we have supported multiple real-time business scenarios, such as real-time estimation of bidding strategies, synchronization of keyword batch services, and real-time features. We understand that most of the ODPS scenarios can be flexible for students on the business side. It is used, but it is still not enough for the use of Blink. We have accumulated some experience for the scene here, and hope to be of some help to everyone.

Two technology selection

Why choose Blink? In most offline scenarios, if there is no requirement for timeliness, or the data source is batch mode, non-streaming (such as TT, SLS, SWIFT, sequence), etc., it is better to choose ODPS for this scenario; generally speaking, the data source is For real-time (such as TT/SLS/SWIFT), scenarios that require sequential reading of ODPS, and high timeliness requirements, it is better to choose Blink.

Blink currently also supports Batch mode and Steaming mode. Batch mode refers to a fixed start time and end time. Compared with ODPS, its biggest advantage is to apply for resources in advance, which can be exclusive, so as to ensure timeliness; Streaming mode is real-time consumption in the traditional sense, which can Realize millisecond-level processing.

From the perspective of development mode, it is mainly divided into Data Stream mode, which is similar to ODPS MR; the second is SQL mode; from the perspective of ease of use, SQL is undoubtedly the lowest cost; but for complex scenarios, Data Stream's control ability It is also the best. It can flexibly define various caches and data structures, and support multiple scenarios at the same time.

Three main scenes

1 Real-time replay bid strategy evaluation

business background

The Replay system is a simulation system that collects, structuring, and subsequent processing of online bidding logs. The system records the bidding information of the through train online engine after the recall, mainly covering online recall, bidding, scoring and other queue information. Combined with the sorting and deduction formula, the log can be used to simulate the online bidding environment. To put it simply, it is possible to evaluate what kind of results would be brought if other bids were adopted at that time on bidword. Through the replay system, the algorithm team and advertisers can use the offline traffic to estimate the effect of user strategy modification before the online AB test. more controllable. At the same time, in the process of negative strategy testing, the impact on the market's earnings can be reduced as much as possible.

The algorithm team hopes to realize the evaluation of multiple bidding strategies on the business side based on the online fine sorting and recall logs, playback the sampled logs (1 billion data) within 1 day, evaluate the bidding strategy, and support the real-time offline of ad, so as to avoid the bidding strategy of offline ad It has an impact, and it is expected that 1 billion data volumes will be completed within 1-2 hours.

main challenge

How to load 10 million material data;
Real-time synchronization of offline ads with high qps (1 million);
Decoupling the business side, how to realize the decoupling of the whole real-time job link and business
solution

Material data loading: load all data directly when blink is started, avoiding pressure on igraph access under high qps conditions; in addition, it adopts broadcast mode, only once loaded, and each node can be used, avoiding multiple loading of odps data;

The offline ad information is stored in the IGraph in a bucket-based manner, and the full amount of offline ad is read in the periodic cache mode, and the query 200W+qps is controlled at about 1w, and the RateLimit current limiting component is used to control the access concurrency. Limit the IGraph concurrency control to about 400,000 to achieve smooth overall traffic;

The overall real-time engineering framework reserves UDF interfaces so that the business side can only implement the SDK, and other engineering performance, concurrency, current limiting, and embedded logic can be implemented internally. It supports the decoupling of the engineering framework and algorithm strategy Replay.

Summarize

Based on this business requirement, based on the flexible capability of the blink streaming Batch mode, we realized the data processing of tt data with a fixed start and end time. Precipitated reading and writing tt components, ODPS components, iGraph components, and buried point components. These precipitated components well support subsequent job development of similar businesses, and the components provide basic capabilities for subsequent job productization.

2 Real-time features

business background

With the development of B-side algorithms, the incremental dividends brought by model upgrades are getting less and less. It is necessary to consider further capturing user intentions from customer real-time information, more comprehensive and real-time mining of potential needs, and further increase growth space from the B-side perspective. , based on online user behavior logs to generate real-time characteristics of user behavior, the algorithm team uses real-time data to improve the online model.

Based on this requirement, we have produced a user real-time feature output link, and obtained user real-time features by analyzing the upstream A+ data source. The real-time features mainly include the following:

Obtain nearly 50 feature data values of users and output them to igraph.
Output user ids with certain characteristics and aggregate them by minute
Output the sum, mean or number of a certain characteristic in the past 1 hour
main challenge

The amount of real-time feature data development is very large. For each feature data, real-time data links need to be developed and maintained. The cost of development and operation and maintenance is high, and the wheel is repeated;
Feature data development requires developers to understand:

Data source, ETL processing will be performed based on the factual data source;
Computing engine, flink sql maintains a set of its own computing semantics, which needs to be learned and used proficiently according to the scene;
Storage engine, real-time data development needs to be implemented to serve, so it is necessary to select a relational storage engine, such as igraph, hbase, holgres, etc.;
Query optimization methods, different storage engines have their own query clients, usage and optimization methods, so it is necessary to learn how to use different engines.
solution

From the perspective of product design, a set of real-time platform capabilities is designed to make developing real-time features as easy as developing offline tables in odps. The advantage of the product is that users only need to understand SQL to develop real-time features:

Does not require knowledge of real-time data sources
No need to know the underlying storage engine
Real-time feature data can be queried only with sql, no need to learn different engine query methods
The entire real-time development product links Jiguang platform, dolphin engine, blink engine, and storage engine to connect the entire process in series and provide users with an end-to-end development experience without perceiving technical details that are not related to their own work.

Related platform introduction:

Dolphin intelligent acceleration analysis engine: Dolphin intelligent acceleration analysis engine is derived from Alimama Data Marketing Platform Dharma Disk (DMP) scenario, based on the general OLAP MPP computing framework, for typical calculations of marketing scenarios (label circle people, insight analysis) etc. A large number of performance optimizations have been carried out at the level of storage, indexing, and computing operators, achieving substantial improvements in computing performance, storage costs, and stability. Dolphin itself is positioned as an acceleration engine, and data storage and computing operators rely on the underlying odps, holgres and other engines. In the form of plug-ins, operator integration and underlying data storage and index optimization have been completed in holgres, achieving an order-of-magnitude improvement in computing performance and supporting business scale in specific computing scenarios. At present, Dolphin's core computing capabilities mainly include: cardinality computing kernel, approximate computing kernel, vector computing kernel, SQL result materialization and cross-DB access, etc. Dolphin also implements a set of SQL translation and optimization capabilities, which automatically converts the original user input SQL into an underlying optimized storage format and calculation operator. Users do not need to care about the underlying data storage and calculation mode, and only need to spell SQL according to the original data table, which greatly improves the convenience of users.

Jiguang consumer operation platform: Jiguang is a one-stop research and development platform for marketing acceleration scenarios. Through the productization of the platform, it can better empower users with its characteristic engine capabilities. The featured scenarios supported by Jiguang include super-large-scale tag intersection and difference (tens of billions of tag circles output in milliseconds), crowd insight (hundreds of billions of second-level queries), and second-level effect attribution (event analysis, attribution analysis) , real-time and million-level crowd orientation capabilities. Based on the marketing data engine, Jiguang provides one-stop operation and maintenance control, data governance, and self-service access capabilities, making users more convenient to use; Jiguang has accumulated common data engine templates for search and promotion, including cardinality calculation templates and reports Templates, attribution templates, crowd insight templates, vector calculation templates, approximate calculation templates, real-time delivery templates, etc., are based on mature business templates, allowing users to use them at zero cost and without code.

According to current business needs, real-time data sources and storage data sources are encapsulated

Implement real-time feature operators:

concat_id:

Meaning: From the records entered in the input table, select a field and sort it in reverse order according to timestamps. You can configure parameters to deduplicate according to id and timestamp, and support users to fetch top k data
Example of use:

Meaning: From the records entered in the input table, select a field to sum, average or count the specified time range
Example of use

Summarize

Based on the real-time feature requirements of the B-side algorithm, a set of real-time feature output system based on blink sql + udf is precipitated, which escapes the sql input by the user, generates bin SQL Streaming tasks on the Bayes platform, and produces real-time feature data storage Into iGraph, the basic capabilities of blink writing igraph components, concat_id operator, aggregation operator and other basic capabilities have been deposited, laying the foundation for the subsequent Dolphin streaming real-time feature output system, supporting multiple subsequent expansion methods of feature operators, and quickly supporting such User needs.

3 Keyword batch synchronization

business background

Every day, many merchants join the through train through different channels; and there is a relatively large space for accepting new customers. On the other hand, there is also a large room for optimization for the low-activity part of the existing customers of the system. The system buys words as an important starting point for new customers and low-activity promotion. It is hoped that through higher-frequency keyword updates (day->hour-level) for through-train new customers and low-activity customers, it will help target customers. Advertisements try more keywords, save the superior and eliminate the inferior, so as to achieve the goal of promoting life.

Based on this requirement, we supplement the hour-level message update link on the basis of the existing day-level offline link to support the update of each word package under the standard plan and the system word update of the smart plan. The message update volume per hour is in the thousands Tens of thousands, use Blink to call the function service of faas with all ODPS request parameters, and write the result of each request to the ODPS output table. The update frequency is two hours, update time: 8:00 am to 22:00 pm, single addition and deletion scale: add 500W/delete 500W.

main challenge

Blink batch jobs require hour-level scheduling
Faas function calls need current limiting
solution

Use Blink UDF to realize the function service function of calling HSF for request
The blink UDF uses RateLimiter for current limiting, and the QPS of accessing function services can be strictly controlled by the node parallelism
Configure shell scripts on the Dataworks platform to schedule batch computing tasks on the Bayes platform
Summarize

Based on this requirement, the blink sql batch mode is used to realize such a near-real-time update link, open up the scheduling mode of such batch jobs, and lay the foundation for the subsequent commercialization of batch jobs.

Four future prospects

Based on the business of the B-side algorithm, the Dolphin engine has designed and developed the Dolphin streaming link. The user's development of real-time features on the Aurora platform becomes as simple as developing an offline table on the odps. Users do not need to understand the real-time data source and the underlying storage engine. Real-time feature data can be queried with sql. However, there are also batch processing services similar to those mentioned in this article in the B-side algorithm business. These businesses need to develop blink batch sql, blink streaming batch mode, ODPS UDF and java code tasks, and provide scheduling scripts, and finally package and submit the project For the algorithm team to use. In the future, we hope that users can self-develop batch computing services on the Jiguang platform, reduce algorithm development costs, provide a scalable, low-cost batch computing engine capability, support rapid business iteration, and empower business implementation to get results quickly.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us