All Products
Search
Document Center

E-MapReduce:Doris

Last Updated:Feb 05, 2025

Apache Doris is a high-performance, real-time analytical database that can be used in scenarios such as report analysis, ad hoc queries, and federated queries across data lakes. This topic describes Apache Doris.

Background information

For more information about Apache Doris, see Introduction to Apache Doris.

Scenarios

As shown in the following figure, after various data integration and processing, the data sources are usually stored in the real-time data warehouse Doris and an offline data lake or data warehouse, such as Apache Hive, Apache Iceberg, or Apache Hudi. Apache Doris

Apache Doris is widely used in the following scenarios:

  • Report analysis:

    • Real-time dashboards

    • Reports for in-house analysts and managers

    • Highly concurrent user-oriented or customer-oriented report analysis, such as website analysis and ad reporting that usually require thousands of query per second (QPS) and quick response times measured in milliseconds

  • Ad doc query: Analyst-oriented self-service analytics with irregular query patterns and high throughput requirements

  • Unified data warehouse construction: Apache Doris allows users to build a unified data warehouse on one single platform and save the trouble of handling complicated software stacks. The unified data warehouse that is built based on Apache Doris replaces the old complex architecture, which consists of Apache Spark, Apache Hive, Apache Kudu, Apache HBase, and Apache Phoenix.

  • Federated queries across data lakes: Apache Doris performs federated analytics on data in Apache Hive, Apache Iceberg, and Apache Hudi by using external tables. This achieves outstanding query performance without the need to copy data.

Technical overview

The following figure shows the overall architecture of Apache Doris. Architecture of Apache DorisThe architecture of Apache Doris is simple and neat with only two types of processes:

  • Frontend (FE): It is responsible for user request processing, query parsing and planning, metadata management, and node management.

  • Backend (BE): It is used to store data and execute query plans.

Both types of processes are horizontally scalable. A single cluster supports up to hundreds of machines and tens of petabytes of storage capacity. In addition, these two types of processes provide high availability of services and high reliability of data by using consensus protocols. This highly integrated architecture design greatly reduces the O&M cost of a distributed system.

The technology of Apache Doris is introduced from the following five aspects:

  • In terms of interfaces, Apache Doris adopts the MySQL protocol, supports standard SQL, and is highly compatible with the MySQL dialect. You can access Apache Doris by using various client tools. Apache Doris also supports seamless integration with business intelligence (BI) tools.

  • In terms of storage engines, Apache Doris uses a columnar storage engine that encodes, compresses, and reads data by column. This enables a high compression ratio and largely reduces irrelevant data scans. This way, I/O and CPU resources are more efficiently used.

    Apache Doris also supports various index schemas to minimize data scans:

    • Sorted compound key index: allows you to specify up to three columns to form a compound sort key. This way, you can effectively prune data to better support highly concurrent reporting scenarios.

    • Z-order index: allows you to efficiently run range queries on any combination of fields in a schema.

    • Min or Max index: enables effective filtering of equivalence and range queries on numeric types of data.

    • BloomFilter index: enables effective equivalence filtering and pruning of high cardinality columns.

    • Inverted index: allows you to search for a specific field.

  • In terms of storage models, Apache Doris supports a variety of storage models and optimizes the models for different scenarios:

    • Aggregate key model: This model merges the value columns that have the same key in advance. This significantly improves performance.

    • Unique key model: Keys are unique in this model. Data with the same key is overwritten to achieve row-level data updates.

    • Duplicate key model: This is a detailed data model that can store the data of fact tables as details.

    Apache Doris also supports strongly consistent materialized views. Materialized views are automatically selected and updated. This greatly reduces maintenance costs for users.

  • In terms of query engines, Apache Doris adopts the Massively Parallel Processing (MPP) model in its query engine to achieve parallel execution between and within nodes. It also supports distributed shuffle join for multiple large tables to handle complex queries. The following figure shows the query engine. Query

    The query engine of Apache Doris is vectorized with all memory structures laid out in a columnar format. This can largely reduce virtual function calls, improve cache hit rates, and make efficient use of single instruction multiple data (SIMD) instructions. The performance of the query engine of Apache Doris in wide table aggregation scenarios is 5 to 10 times higher than that of non-vectorized engines.

  • In terms of optimizers, Apache Doris uses a combination of Cost-Based Optimization (CBO) and Rule-Based Optimization (RBO). RBO supports constant folding, subquery rewriting, and predicate pushdown. CBO supports Join Reorder. The Doris CBO is under continuous optimization for more accurate statistics collection and derivation and more accurate cost model prediction.

Apache Doris uses Adaptive Query Execution technology to dynamically adjust the execution plan based on runtime statistics. For example, Apache Doris can generate a runtime filter, push the filter to the probe side, and then automatically penetrate the filter to the Scan node at the bottom. This greatly reduces the amount of data in the probe and increases the join performance. The following figure shows the process. AQEApache Doris supports the following runtime filters: In, Min, Max, and BloomFilter.

Note

The content and figures in this topic are referenced from Introduction to Apache Doris.