Looking at the Development Trend of Real-Time Data Warehouses from the Core Scenarios of Alibaba

By Guobei, Senior Technical Expert of Alibaba Cloud and Head of Hologres

This article interprets the problems of the development of real-time data warehouses and core development trends to help you make better product selection and data warehouse planning.

The real-time data warehouse is a highly popular concept in the field of big data, which is probably as prevailing as the lake house. After over ten years, big data has become standard for every company. Traditionally, offline data warehouses (open-source data warehouses represented by Hive/Spark, closed source represented by Alibaba Cloud MaxCompute, Snowflake, Amazon Redshift, Google BigQuery, and traditional IT vendors such as Vertica, Oracle, and HANA), streaming computing (represented by Apache Flink/Spark Structured Streaming), and data service layers (Apache HBase, MySQL, Elasticsearch, Redis, etc.) form the standard architecture for big data processing, Lambda architecture. The Lambda architecture provides serving capabilities for real-time data. However, typical problems with the Lambda architecture are complex development, data redundancy, and inflexible analysis.

In recent years, as real-time data warehouses represented by ClickHouse, Apache Doris, and Alibaba Cloud Hologres have emerged, they have realized de-Lambda architecture by writing detailed data in real-time, using a flexible interactive query, and achieving a good balance in real-time performance, flexibility, cost, management, and operations and maintenance.

With the perfect ending of the Double 11 Global Shopping Festival in 2021, real-time data warehouse technology has also been practiced and developed in the Double 11 scenarios of Alibaba. From the early silo development, to the introduction of data warehouses based on domain hierarchical modeling, to the new all-in-one architecture integrating analysis and service, the development efficiency has been gradually improved, the data quality has been better, and more technological innovations have been produced. All of these bring possibilities and trends in the development and application of data warehouses in the future.

Let's talk about some trends in the development of real-time data warehouses seen during Alibaba Double 11.

Real-Time Data Warehouses Have Become Standard for Business

The first trend is that real-time data warehouses have become standard.

Businesses’ requirements for timeliness and flexibility are getting higher, making real-time data a rigid demand. The huge advantages of real-time data warehouses in cost and flexibility make real-time data warehouses the first choice as the production, storage, and use platform in businesses. Hologres serves about 90% of business units at Alibaba. The cluster size exceeds 600,000 cores and maintains a growth rate of 100%. Among the business, more common real-time data warehouse scenarios exist, such as:

1) Digital Operation: In this scenario, the upstream is connected to Flink for data streaming and processing, and the downstream is connected to business intelligence (BI) tools and data dashboards to enable self-service development and launch of the business. It improves development efficiency and flexibility and supports a What You See Is What You Get development experience.

2) Network Traffic Analysis and Metrics Analysis: The system can quickly generate alerts and locate potential device faults by storing and monitoring network traffic and other Metrics data in real-time. When querying among trillions of records, the system can respond to faults and find faults in seconds.

3) Real-Time Logistics Tracking: Real-time data warehouses are used to track logistics information in real-time. This ensures the real-time update and real-time query of logistics flow status.

In addition to these relatively common real-time data warehouse scenarios, Hologres is used in many atypical real-time data warehouse scenarios because of the Hybrid Serving/Analytics Processing (HSAP) capability and the corresponding Hologres high-speed pure real-time write capability and point query capability. Examples:

4) Crowd Selection for Merchants: Hologres provides high-QPS and low-latency crowd selection and advertising services for merchants (to B).

5) Delivery from Driverless Vehicles: Hologres carries the order, logistics, and other metric information of goods on driverless vehicles. For B-end post stations, Hologres reports the logistics information in real-time and helps owners of the post stations complete tasks, such as intelligent parcel sorting and mobile delivery. For users, Hologres can realize regular door-to-door services and delivery to the building through the scheduling capacity of the system.

6) Feature Store and Sample Store in Search Recommendation: The powerful point query capability of Hologres enables real-time sample (feature store), real-time feature (sample store), and real-time algorithm effect analysis.

7) Complete-Process Experience of Customer: The customer service department stores customer-related multi-channel data in Hologres to directly provide customers with various detailed query capabilities (to C).

In similar scenarios, being seen and being used of the real-time capability of data have become the driving force for the rapid development of enterprises.

Real-Time Data Warehouses Support Online Production System

The second trend is that real-time data warehouses are increasingly becoming a part of the production system.

A real-time data warehouse (data warehouse) is a non-production system by tradition. Since it mainly faces internal customers, the real-time data warehouse is not on the key process of production in nature, despite the high importance of dashboards. If the real-time data warehouse is unavailable, the impact on customers is negligible. This is why most products of real-time data warehouses are far from databases and other systems in terms of high availability, resource isolation, and disaster recovery.

Traditionally, external services are provided based on offline or streaming processing and result point query. That means the key process for interaction with users is result point query carried by systems, such as HBase, Redis, and MySQL. Despite the advantages of simplicity and reliability, this model has huge limits, providing limited and inflexible serving functions. The business is eager to open the internal real-time data warehouse capability to external customers (to B and to C) in a controllable manner and to maintain the consistency of data and logic between the internal and external systems. The scenarios listed above, such as Alibaba advertisements, delivery from driverless vehicles, and end-to-end experience of customers, are all cases of opening to B and even to C.

Since real-time data warehouses are provided as a service, users have stricter requirements for the concurrency, availability, and stability of the service. This is also where Hologres has focused over the past year. In 2021, Hologres introduced capabilities to achieve production-level high availability, such as multi-replica, hot upgrade, fast failover, resource isolation, read/write splitting, and disaster recovery. In addition, it was applied well during Double 11 2021. For examples:

The Chief Customer Office of Alibaba (CCO) did dual-procedure writing and store redundancy to ensure high availability 2020. During Double 11 2021, the native high availability scheme of Hologres was used to remove manual dual-process. Therefore, it eliminated the workforce investment in the development of real-time task and data comparison on the standby data procedure. It also reduced data inconsistency during process switching and decreased the overall development labor cost by 200 workdays, more than 50% compared to 2020. Moreover, it reduced over 100 backup procedure operations for real-time protection for major events, reducing computing resources by 2000 CUs.
Data Technology Office of Alibaba (DT) uses the read/write splitting solution of Hologres, so high-throughput write and flexible query do not interfere with each other. While analysis and query of QPS increase by 80%, query jitter decreases.

We believe the systematization of the production of real-time data warehouses is an inevitable trend, and each real-time data warehouse product will gradually increase the investment in development in this field.

Hybrid Serving/Analytics Processing (HSAP)

The third trend is the Hybrid Serving/Analytics Processing (HSAP).

Hologres is the initiator in this area because the business of Alibaba has strong demands for HSAP. The best practices for HSAP are implemented within Alibaba first. However, we also see more products and enterprises in the industry advocating and practicing HSAP.

HSAP can be understood from several levels:

The most basic thing is that you can use a set of technology stacks (Flink and Hologres) to solve the two tasks of Ad-hoc Query analysis (internal) and online services (internal, to B, and to C), thus reducing the costs of development and operations and maintenance. By tradition, real-time data warehouses focus on Ad-hoc Query, while the lambda architecture implements online services. These two technologies are completely different in the technology stack, data procedure, development, operations, maintenance, etc. However, the processed data source is often the same data, resulting in a large amount of development job redundancy. Also, data consistency is a grave problem. However, using a unified technology stack meets both requirements and makes development, operations, maintenance, and management simple.

Let’s take scenarios of CCO as an example. After data is written to the Hologres row store tables that have high write throughput, fast primary key query, and low binlog overhead in the update scenarios, the data will be secondarily consumed and processed by Flink through binlogs of the Hologres tables and stored in the Hologres column store tables for analysis. (The column store is fast for the statistical query.) Row store tables provide online services or point query, and column store tables offer analysis capabilities.

A higher level of HSAP is that you can use one piece of data on one platform to realize two tasks of Ad-hoc Query and online services, realizing good resource isolation and availability at the same time.

For example, DT launched the read/write splitting solution of Hologres during Double 11 2021. Two instances of Hologres are responsible for real-time write and query but share the same underlying data store. At the same time, multiple read instances are responsible for different types of queries. Therefore, it ensures that read/write splitting and isolation of analysis query and service query, and only one piece of data is available. That is the so-called One Data, Multi Workload.

In addition to the benefits mentioned above, another significant advantage of HSAP is that the service launching speed is much faster. Since the boundary between analysis and service becomes blurred after the integration, the development of service and analysis differs a little from each other. It can be considered that service is simple with fixed-pattern analysis. Therefore, the traditionally complex process of service launching is simplified. Faced with an urgent need for temporary development, services can also be launched immediately without a complicated process.

The concept of HSAP will be implemented in more scenarios as products like Hologres develop. Also, this will provide feedback to HSAP products like Hologres, which will help precipitate the concept and methodology and support the capabilities of HSAP in the products, thus making it easier for more users to benefit from HSAP.

Real-Time Data Management Becomes a Rigid Demand

The fourth trend is that real-time data management is becoming increasingly important.

Real-time data holds considerable appeal for enterprises. Therefore, enterprises will consciously or unconsciously increase investment in real-time data warehouses step by step. However, due to the real-time requirements of real-time data warehouses of enterprises, they often do not have such a rigorous methodology and management system as offline data warehouses. With no management and a large amount of redundant or unreasonable data, costs increase sharply and data credibility fell. In super-large enterprises like Alibaba, the cost of this area will be highlighted, which has become a kind of rigid demand for real-time data warehouses.

You can realize that no place outside the law is available for data by performing data management on complete processes, such as real-time data warehouses, offline data warehouses, streaming computing, and message queuing. Therefore, while saving costs, you improve the quality of data and turn data into company assets.

Database-like Development of Real-Time Data Warehouses

The fifth trend is database-like real-time data warehouses.

Big data was born from developing useful parts of the traditional database and discarding useless parts. From NoSQL to NewSQL, big data products have been separate from databases. However, just like the development from NoSQL to NewSQL, real-time data warehouses in the field of big data products are learning from databases, providing better compatibility than databases. Therefore, you can use real-time data warehouse products at a lower cost.

This includes several aspects:

Thanks to operations based on SQL and compatibility with traditional databases in protocol and syntax, developers can use regular tools (BI, development tools, etc.) to realize development. The accumulation of big data in this area is still not as good as the accumulation of the database for decades. Many business personnel are skilled in the database, but they feel it is difficult to get started with big data, especially in real-time data warehouses.
Data models and semantics move closer to traditional databases. For example, the concept of primary keys is lacking in traditional data warehouse products, and atomic data warehouse products are often not guaranteed, thus limiting the application in many scenarios. For instance, ClickHouse lacks a primary key in the sense of a database (the primary key mentioned by CK is another thing and is not a unique constraint), so it is not suitable to handle the CDC synchronization of the database. Recent years have seen an enhancement in this area (in the industry of big data). The most typical example is that near-real-time data warehouses represented by Delta Lake, Iceberg, and Hudi have the capabilities of atomicity, consistency, isolation, and durability (ACID). However, constrained by the architecture, the performance and latency of this near real-time ACID have bottlenecks infrequently updated scenarios.

At Alibaba, a large number of scenarios require the update capability based on primary keys. Let’s take the internal scenarios of Alibaba as an example:

Real-Time Database Synchronization: It can provide powerful analysis capabilities for business data by synchronizing (mirroring) upstream database shards and multiple business databases in real-time into a large data real-time warehouse. This requires good processing of the operations of pure real-time high-frequency UPDATE and DELETE.
The UPDATE and DELETE (RETRACTION) Operations Generated by Flink Computing: For example, Flink generates an UPDATE record when the result is updated to count GMV (Gross Merchandise Volume), but in some scenarios, it generates a RETRACTION record (DELETE). This requires the downstream system to handle these two types of events well.
The calculation of business (such as risk control) is completed by multiple jobs that update a large wide table in real-time. Each job updates some fields. This requires the downstream system to provide some update capabilities based on primary keys.

Traditionally, such services are undertaken by NoSQL systems (such as HBase and Redis) or databases (such as MySQL and PostgreSQL). However, NoSQL is generally weak in analysis, while databases are limited in write performance and scale.

The business is prevalent in big data processing. The challenges of Alibaba lie in its huge scale, especially in scenarios like Double 11, which impose strict requirements for the update performance and latency based on primary keys.

Hologres considered these two points from the beginning of its design. Hologres is completely compatible with the protocol, syntax, and function of PostgreSQL 11. Many PostgreSQL extensions can be used directly, such as PostGIS. Meanwhile, Hologres provides a complete concept of primary keys and powerful update capabilities and provides ACID with single SQL. During Double 11 2021, some business was measured with a real-time write update performance of over 3.5 million per second. These capabilities extend the application scenarios of real-time data warehouses and change the traditional scenarios hosted by NoSQL and RDS to real-time data warehouses, providing users with more powerful analysis and processing tools.

The database-like real-time data warehouses are not equivalent to Hybrid Transactional/Analytical Processing (HTAP) databases. Compared with HTAP, HSAP is weakened in transactional capability because the complete transactional capabilities of traditional databases are not required in serving scenarios. This is because the full transactional capabilities of traditional databases are not required in the serving scenario. This abandonment brings a big improvement in real-time write performance and query performance and an enhancement in scalability because a global transaction manager is not required. Therefore, HSAP is more suitable for big data scenarios than HTAP.

Agility in Real-Time Data Warehouse Development

The last trend is the change in development methodology. The development of real-time data warehouses is becoming more agile to adapt to the flexibility of analysis scenarios.

Previously, the development of data warehouses was often based on the classic methodology. It used using the layer-by-layer development method of Operational Data Store (ODS) → Data Warehouse Detail (DWD) → Data Warehouse Summary (DWS) → Application Data Service (ADS) and event-driven or micro-batch scheduling between layers. Layering brings better semantic layer abstraction and data reuse, but it increases dependency on scheduling and reduces data timeliness and the agility of flexible data analysis.

Real-time data warehouses drive the real-time performance of business decision-making, which usually requires rich contextual information. Therefore, the traditional development method of customized ADS highly based on business has been challenged immensely. Thousands of ADS tables are difficult to maintain and the utilization rate is low. More business parties expect to conduct multi-angle data comparison and analysis through DWS or DWD. This sets out higher requirements for the computing efficiency, scheduling efficiency, and input or output efficiency of a query engine.

With optimized technologies of a query engine, such as computing operator vectorization rewriting, refined indexing, asynchronous execution, and multi-level caching, the computing capability of Hologres has been improved in each version. Thus, more users adopt the agile development method. In the pre-computing stage, users can only conduct data quality cleaning and basic large table association widening and build models at the layers of DWD and DWS to reduce the modeling layer. Meanwhile, flexible queries are executed in interactive query engines during real analysis. The important trend of data analysis democratization is supported in seconds through the interactive analytics experience.

Summary

Alibaba is a company that applied real-time data warehouses to process massive amounts of data early in the industry. The development of real-time data warehouses in Alibaba has gradually entered the deep-water zone. Whether the systematization of production, HSAP, real-time data management (platformization), or database-like development and agility, real-time data warehouses are rapidly iterating with the quick development of business requirements. In addition, real-time data warehouses are becoming more excellent during annual, momentous occasions (such as Double 11) and becoming an indispensable partner and assistant for business.

Hologres grows together with the core business of Alibaba as business drives technologies and data adds value. From multi-dimensional and complex Online Analytical Processing (OLAP) analysis to high-QPS point query, high-performance real-time write and update to high availability, Hologres provides a unified analysis service for big data platforms and meets the full process of store, development, management, and service of overall real-time data warehouses.

We believe the trend of these real-time data warehouses is also applicable to the entire industry. We will gradually reveal the capabilities accumulated in Double 11 via cloud products to help customers make good use of real-time data warehouses and grow together!

Learn more about Hologres

Community

Looking at the Development Trend of Real-Time Data Warehouses from the Core Scenarios of Alibaba

Real-Time Data Warehouses Have Become Standard for Business

Real-Time Data Warehouses Support Online Production System

Hybrid Serving/Analytics Processing (HSAP)

Real-Time Data Management Becomes a Rigid Demand

Database-like Development of Real-Time Data Warehouses

Agility in Real-Time Data Warehouse Development

Summary

Read previous post:

Read next post:

Hologres

You may also like

Comments

Hologres

Related Products

MaxCompute

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Realtime Compute for Apache Flink