Community Blog An In-Depth Understanding of Presto (1): Presto Architecture

An In-Depth Understanding of Presto (1): Presto Architecture

Part 1 of this series introduces some principles of Presto from a macro perspective, from the outside to the inside.

By Yunlei
Contributed by Alibaba Cloud Storage


Presto is an open-source distributed SQL query engine by Facebook, which is suitable for interactive queries and analysis. You can use Presto to query data from gigabytes to petabytes. The architecture of Presto evolved from the RDS architecture. Presto stands out among various memory computing databases because of the following aspects:

  1. It has a clear architecture and is a system that can run independently and does not depend on any other external system. For example, for scheduling, Presto provides monitoring for clusters and can schedule according to monitoring information.
  2. Simple data structure, columnar storage, logical rows, and most of the data can be easily converted into the data structure required by Presto.
  3. Rich plug-in interfaces that can perfectly connect to external storage systems or add custom functions.

This article introduces Presto from the outside to the inside.



Presto uses a typical master-slave model:

  1. Coordinator (master) is responsible for meta management, worker management, and query resolution and scheduling.
  2. The worker is responsible for computing and reading and writing.
  3. Discovery server is usually embedded in the coordinator node. It can be deployed separately for node heartbeat. In the following, discovery and coordinator share a machine by default.

In the worker's configuration, the configuration could be:

1.  The ip:port of discovery

2.  An http address contains the service inventory, including the discovery address.

{ "environment": "production",     "services": [     {            "id": "ffffffff-ffff-ffff-ffff-ffffffffffff",         "type": "discovery",         "location": "/ffffffff-ffff-ffff-ffff-ffffffffffff",         "pool": "general",         "state": "RUNNING",         "properties": {             "http": ""         }        }    ]    }

3.  A local file address with the same content as principle 2

The principles of 2 and 3 are based on service inventory. The worker will dynamically monitor this file. If there is any change, load the latest configuration and point to the latest discovery node.

In design, both discovery and coordinator are a single node. If multiple coordinators survive at the same time, the worker will randomly report the process and task status to one of them, resulting in split-brain. There may be deadlocks when a query is scheduled.

Discovery and Coordinator Availability Design: Due to the use of the service inventory, the monitoring program can modify the content in the service inventory and point to the discovery of the standby machine after the discovery is hung, thus seamlessly doing the switching. The configuration of a coordinator must be specified when the process starts, and multiple coordinators cannot survive in the same cluster. The best way is to configure it to the same machine with discovery. The secondary machine deploys standby discovery and coordinator. Normally, a secondary machine is a cluster that only contains one machine. When the primary machine is down, the heartbeat of the worker is instantly switched to secondary.

Data Model

Presto adopts a three-layer table structure:

  1. The catalog corresponds to a type of data sources, such as hive or MySQL data.
  2. Schema corresponds to the database in MySQL.
  3. Table corresponds to a table in MySQL.


The Presto storage unit includes:

  1. Page: It is a collection of multiple rows of data, which contains data in multiple columns. Only logical rows are provided internally. It is stored in a columnar format.
  2. Block: It is a column of data. Different encoding methods are adopted according to different types of data. Understanding these encoding methods is helpful for your storage system to connect to Presto.

Different types of blocks:

1.  The Array block type is applied to a type of fixed width, such as int, long, and double.

Block consists of two parts:

  • Boolean valueIsNull[] indicates whether each row has a value.
  • The Value of Each Row of T Values[]

2.  A variable-width block is applied to string data and consists of three parts of information:

  • Slice: It is the string used to concatenate data of all rows.
  • Int offsets[]: It is the start offset position of each row of data. The length of each row is equal to the start offset of the next row minus the start offset of the current row.
  • Boolean valueIsNull[] indicates whether a row has a value. If a row has no value, the offset of the row is equal to the offset of the previous row.

3.  A string block with a fixed width. The data of all rows is concatenated into a long string of Slice. The length of each row is fixed.

4.  Dictionary Block: The distinct value is small for some columns, which is suitable for saving with a dictionary. There are two main parts:

  • A dictionary can be any type of block (even a dictionary block can be nested). Each row in the block is sorted in order.
  • Int ids[] indicates the number of the value corresponding to each row of data in the dictionary. When searching, first find the id of a row and then get the real value in the dictionary.

Plug-in Name

After getting a hold of the data model of Presto, you can write plug-ins for Presto to connect to your storage system. Presto provides a set of connector interfaces to read metadata from custom storage and column storage data. First, look at the basic concepts of connector:

  1. ConnectorMetadata: It manages table metadata, partitions, and other information. When processing requests, metadata should be obtained to confirm the location of the read data. Presto passes in the filter condition to reduce the range of read data. Metadata can be read from disk or cached in memory.
  2. ConnectorSplit: It is a collection of data processed by an I/O task and the unit of scheduling. One split can correspond to one partition or multiple partitions.
  3. SplitManager: It constructs a split based on the meta of the table.
  4. SlsPageSource: It reads none or more pages from the disk based on the split information and the column information to be read.

Plug-ins can help developers add these features:

  1. Connect your storage system
  2. Add a custom data type
  3. Add a custom processing function
  4. Custom permission control
  5. Custom resource control
  6. Add query event processing logic

Presto provides a simple connector (local file connector) that can be referred to as how to implement the connector. However, the unit of traversal data used in the local file connector is cursor, which is a row of data rather than a page. Hive implements three types of connectors: parquet, orc, and rc file.


This article introduces some principles of Presto from a macro perspective. The next article of the series will go deeper into Presto to help readers understand some internal designs, which will be of great use for performance tuning and the addition of custom operators.

Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.

0 1 0
Share on

Alibaba Cloud Community

606 posts | 102 followers

You may also like