Looking at the development trend of real-time data warehouse from the core scenario of Alibaba

1: Real-time data warehouse has become the standard configuration of business

The first trend is that real-time data warehouses have become standard configurations.

Business has higher and higher requirements for timeliness and flexibility, which makes real-time data a rigid demand. The huge advantages of real-time data warehouses in terms of cost and flexibility make businesses give priority to real-time data warehouses as the production, storage and use platforms for real-time data. In Alibaba, Hologres serves about 90% of BUs, with a cluster size exceeding 600,000 cores and maintaining a growth rate of 100%. In these businesses, there are more common real-time data warehouse scenarios, such as:

1. Digital operation: In this scenario, the upstream connects to Flink for data stream processing; the downstream connects to BI tools, large data screens, etc. to realize self-service development and launch of the business. It greatly improves the development efficiency and flexibility, and supports the development experience of what you see is what you get.

2. Network traffic analysis and Metrics analysis: Through the real-time storage and monitoring of network traffic and other Metrics data, it can quickly warn and locate potential equipment failures. Query second-level responses on trillion-level records, and detect faults in seconds.

3. Real-time logistics tracking: Real-time tracking of logistics information through real-time data warehouses to ensure real-time update and real-time query of logistics flow status.

In addition to these relatively common real-time data warehouse scenarios, because of the analysis service integration (Hybrid Serving/Analytics Processing, hereinafter referred to as HSAP) capabilities (and the corresponding Hologres high-speed pure real-time writing capabilities and enumeration capabilities), Hologres is also used It is used in many atypical real-time data warehouse scenarios. For example:

4. Advertisement group selection for merchants: Provide high-QPS, low-latency group selection and advertising delivery services for merchants (to B) through Hologres.

5. Unmanned vehicle delivery: Hologres carries the order and logistics index information of the goods on the unmanned vehicle, and reports logistics information in real time for the B-side post station, thereby helping the station owner to complete tasks such as intelligent parcel sorting and mobile drop-in; Facing users, and then dispatching capacity through the system, to achieve "timing door-to-door, delivery to the building".

6. Feature storage and sample storage in search recommendation: Utilize the powerful point-check capability of Hologres to realize real-time feature store, real-time feature store and real-time algorithm effect analysis.

7. Customer full-link experience: The customer service department stores relevant multi-channel data of customers in Hologres to provide various detailed query capabilities (to C) directly to consumers.


There are many similar scenarios. The real-time "seeing" and "use" of data has become the driving force behind the rapid development of enterprises.

Two: Real-time data warehouse supports online production system

The second trend is that the real-time data warehouse is increasingly becoming a part of the production system.

Traditionally, a real-time data warehouse (data warehouse) is a non-production system. Because it mainly faces internal customers, although the importance of large screens is high, the real-time data warehouse is not in the key link of production. That is to say, if the real-time data warehouse is unavailable, the impact on customers Not big. This is why there is a big gap between most real-time data warehouse products and systems such as databases in terms of high availability, resource isolation, disaster recovery and other capabilities.

Traditionally, external services are provided through offline/streaming processing + result checking, that is, the key link to interact with users is result checking (hosted by systems such as HBase, Redis, and MySQL). The advantage of this model is that it is simple and reliable, but the limitations are also huge, and the service functions that can be provided are very limited and inflexible. The business is eager to open the internal real-time data warehouse capabilities to external customers (to B, to C) in a controlled manner, and to maintain the consistency of data and logic between the internal and external systems. The above-mentioned scenarios such as Ali advertising, unmanned vehicle delivery, and customer full-link experience are all cases of to B or even to C.

As the real-time data warehouse is provided as a service, users have put forward higher requirements for the concurrency, availability, and stability of the service. This is also where Hologres has focused on efforts in the past year. In the past year, Hologres has introduced capabilities such as multiple copies, hot upgrades, fast failover, resource isolation, read-write separation, and disaster recovery to achieve production-level high availability, and it has been well applied in this year's Double 11. To give a few examples:

Last year, Alibaba's Chief Customer Office (CCO) implemented dual-link write and storage redundancy to ensure high availability. This year’s Double 11 uses the Hologres native high-availability solution to remove the manual dual link, which saves manpower investment in real-time task development and data comparison on the backup data link, reduces data inconsistency during link switching, and reduces the overall development labor cost by 200. The number of man-days has decreased by more than 50% compared to last year; 100+ backup link operations for real-time re-insurance have been reduced, and computing resources have been reduced by 2000CU.

Alibaba's Data Technology and Product Department (Data Technology, hereinafter referred to as DT) uses the Hologres read-write separation solution, so that high-throughput writing and flexible query do not interfere with each other; while analysis query QPS increases by 80%, query jitter is significantly reduced.
We believe that the production systematization of real-time data warehouses is an inevitable trend, and believe that all real-time data warehouse products will gradually increase investment in development in this area.

Three: Analysis Service Integration (HSAP)

The third trend is the integration of analytical services (HSAP).

Hologres is the initiator of this aspect. The source is that the business within the Alibaba Group has a strong demand for the integration of analysis services. The best practice of analysis service integration is first implemented in Ali, but we are also seeing more and more products in the industry. Integration with enterprises in advocacy and practice analysis services.

Analysis service integration (HSAP) can be understood from several levels:

The most basic thing is that users can use a set of technology stacks (Flink+Hologres) to solve the two tasks of Ad-hoc Query analysis (internal) and online services (internal, to B, to C), thereby reducing development and maintenance cost. Traditionally, real-time data warehouses implement Ad-hoc Query, while lambda architecture implements online services. These two are completely different in terms of technology stack, data link, development and maintenance, etc., but the source of the processed data is often the same data, resulting in a large number of redundant development operations, and data consistency is also a big problem. By using a unified technology stack to meet the needs of these two aspects at the same time, development, operation and maintenance, and governance become simpler.

Taking the scenario of Ali CCO as an example, after the data is written into the Hologres row storage table (the row storage table has high write throughput, the primary key query is fast, and the binlog update cost is low), the binlog of the Hologres table will be consumed and processed by Flink for the second time. , stored in the column storage table of Hologres to provide analysis (column storage is fast for statistical queries). The row storage table provides online services/checks, and the column storage table provides analysis capabilities.

The higher-level HSAP is that users can use one piece of data to implement Ad-hoc Query and online services on one platform, while achieving good resource isolation and availability.

For example, this year’s Double 11 DT department launched the Hologres read-write separation solution (two Hologres instances are responsible for real-time writing and real-time query respectively, but share an underlying data storage), and multiple read instances are responsible for different types of queries. , so that read-write isolation, analysis query and service query isolation can be guaranteed, and there is only one copy of data. That is the so-called One Data, Multi Workload.

In addition to the above-mentioned benefits of the integration of analysis services, another significant advantage is that the speed of service launch is significantly accelerated. After the integration, the boundary between analysis and service becomes blurred, so there is not much difference between service development and analysis, and service can be considered as a simple analysis with a fixed pattern. In this way, the traditionally complex process of launching services is greatly simplified. When there is an urgent need for temporary development, it can also be launched immediately without cumbersome procedures.

We believe that the concept of integrated analysis services will be implemented in more scenarios with the development of products like Hologres. And this will also feed back HSAP products like Hologres, and better precipitate the concept, methodology, and support capabilities of HSAP in the product, so that more users can benefit from HSAP more easily.

Four: Real-time data governance has become a rigid demand

The fourth trend is that real-time data governance becomes more and more important.

Real-time data has a fatal attraction for enterprises. Therefore, enterprises will gradually increase investment in real-time data warehouses consciously or unconsciously. However, due to real-time requirements, the real-time data warehouses of various enterprises often do not have the strict methodology and management system of implementing offline data warehouses. Because there is no governance, a large amount of redundant or unreasonable data often leads to a sharp increase in costs and a decline in data credibility. In a super-large enterprise like Ali, this cost will be highlighted, which has become a rigid demand for real-time data warehouses.

By performing data governance on real-time data warehouses, offline data warehouses, streaming computing, message queues and other full-link data, it is possible to realize that there is no "legal place" for data, so as to save costs while improving data quality and truly transforming data become an asset of the business.

Five: Database-like real-time data warehouse

The fifth trend is the database-like transformation of real-time data warehouses.

Big data was born from the sublation of traditional databases. From NoSQL to NewSQL, big data products have embarked on a road independent of databases. But just like from NoSQL to NewSQL, real-time data warehouses in big data products are also learning from databases, providing better compatibility with databases, so that users can use real-time data warehouse products at a lower cost.

This includes several aspects:

Operational SQL and compatibility with traditional databases in terms of protocol and syntax, so that developers can use familiar tools (BI, development tools, etc.) to connect and develop. The accumulation of big data in this area is still not as good as the accumulation of databases for decades. Quite a few business students are very proficient in databases, but it is not easy to get started with big data (especially real-time data warehouses).
The data model and semantics are closer to traditional databases. For example, the concept of primary key (Primary Key) is lacking in traditional data warehouse products, and data warehouse products often cannot guarantee the atomicity of operations, which limits the application of many scenarios. For example, Clickhouse lacks a primary key in the sense of a database (the primary key mentioned by CK is another thing, a non-unique constraint), so it is not suitable for dealing with database CDC synchronization scenarios. In the past two years, the big data industry can clearly see the enhancement of this area. The most typical example is the addition of ACID capabilities to near-real-time data warehouses represented by DeltaLake, Iceberge, and Hudi. Of course, subject to the architecture, the performance and delay of this near-real-time ACID in frequent update scenarios have bottlenecks.
In Alibaba, a large number of scenarios require this primary key-based update capability. Take Alibaba's internal scenarios as an example:

Real-time synchronization of databases: By synchronizing (mirroring) upstream sub-databases and tables and multiple business databases into a big data real-time data warehouse, it can provide powerful analysis capabilities for business data, which requires good processing Pure real-time high-frequency UPDATE and DELETE operations.
UPDATE and DELETE (RETRACTION) operations generated by Flink calculations: For example, to count GMV, Flink will generate UPDATE records when the results are updated, and in some scenarios will generate RETRACTION records (DELETE), which require the downstream system to handle this well. Two types of events.
The calculation of business such as risk control is completed by multiple jobs, and these jobs jointly update a large wide table in real time (each job updates some fields), which requires the downstream system to provide partial update capabilities based on primary keys.
Traditionally, such services are undertaken by NoSQL systems such as HBase and Redis, or database RDS such as MySQL and PostgreSQL. However, the problem with NoSQL is that the analysis ability is generally weak, while the problem with databases is that the writing performance and scale are limited.

These businesses are common in big data processing. However, the challenge in Ali is that due to the huge scale (especially in scenarios such as Double 11), there are strict requirements on the update performance and delay based on the primary key.

Hologres has been designed with these two points in mind from the very beginning. Hologres is fully compatible with PostgreSQL 11 protocols, syntax, functions, etc., and many PostgreSQL extensions (such as PostGIS) can be used directly. At the same time, Hologres provides a complete primary key concept and powerful update capabilities, and provides single-SQL ACID. This year's Double 11, some businesses measured a real-time write update performance of 3.5 million+ per second. These capabilities greatly relax the application scenarios of real-time data warehouses, and change the traditional scenarios carried by NoSQL and RDS to real-time data warehouses, providing users with more powerful analysis and processing tools.

The database-like transformation of the real-time data warehouse is not equivalent to the HTAP database. Compared with HTAP, HSAP has weakened transaction capabilities. Because in the serving scenario, the complete transaction capabilities of traditional databases are not required. And this abandonment brings great improvement in real-time writing performance and query performance, as well as improvement in scalability (because there is no need for a global transaction manager). Therefore, HSAP is more suitable for big data scenarios than HTAP.

Six: Real-time data warehouse development agile

The last trend is the change in development methodology. The development of real-time data warehouses is becoming more and more agile to adapt to the flexible and changeable analysis scenarios.

In the past, the development of data warehouses often followed the classic methodology, adopting the ODS->DWD->DWS->ADS layer-by-layer development method, and using event-driven or micro-batch scheduling between layers. Layering brings better semantic layer abstraction and data reuse, but it also increases the dependence on scheduling, reduces the timeliness of data, and reduces the agility of flexible data analysis.

The real-time data warehouse drives the real-time business decision-making, and rich contextual information is usually required when making decisions. Therefore, the traditional ADS development method that is highly tailored to the business has been challenged. Thousands of ADS tables are difficult to maintain and the utilization rate Low, more business parties hope to conduct multi-angle data comparative analysis through DWS or even DWD, which puts forward higher requirements for the computing efficiency, scheduling efficiency, and IO efficiency of the query engine.

With various query engine optimization technologies such as vectorized rewriting of computing operators, refined indexing, asynchronous execution, and multi-level caching, the computing power of Hologres has been greatly improved in each version. Therefore, we see that more and more users are adopting agile development methods. In the pre-calculation stage, only data quality cleaning, basic large-table association widening, and modeling to DWD and DWS are sufficient to reduce the number of construction tasks. At the same time, the flexible query is executed in the interactive query engine during the real analysis. Through the second-level interactive analysis experience, it supports the important trend of democratization of data analysis.

Seven: Summary

Alibaba is an early company in the industry that uses real-time data warehouses to process massive amounts of data. The development of real-time data warehouses in Ali has also gradually entered the deep water area. Whether it is systematization of production, integration of analysis services, real-time data governance (platformization), or database-like and agile, real-time data warehouses are rapidly iterating with the rapid development of business needs, and in annual dramas such as Double 11 Shining brighter and brighter, it becomes an indispensable partner and assistant in business.

Business-driven technology, data brings value, real-time data warehouse Hologres grows and polishes together with Alibaba's core business, from multi-dimensional complex OLAP analysis to high QPS point check, high-performance real-time writing and updating to high availability, providing big data platform Unified analysis service export, to meet the storage, development, management, service of the one-stop real-time data warehouse, the whole process and the whole scene.

We believe that these trends in real-time data warehouses are also applicable to the entire industry. We will gradually use the capabilities accumulated during Alibaba’s Double 11 in cloud products to help customers make good use of real-time data warehouses and grow together!

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us