Snowflake is a cloud-native data warehouse product with a high level of topics. It has been listed in the second half of 20 and now has a market value of hundreds of billions. Its popularity further confirms the importance of the cloud. Throughout the current database vendors of all sizes, migrating to the cloud is an inevitable strategic step, while snowflake is more direct, similar to AWS Aurora or our PolarDB, it is an OLAP database product built around the cloud infrastructure.
Preface
this paper, published in 2016, gives an overall introduction to the design concept and overall architecture of snowflake:
- multi-tenant, elastic, comprehensive SQL support, ACID transaction, semi-structured data built-in support, highly scalable, high availability
- the design itself is to achieve Software as a service, minimum configuration and user O & M (very important!) , you don't even need to do physical design
- multi-cluster + shared data
- compute and storage are separated and expanded independently.
The traditional share nothing data warehouse architecture has some problems:
- data and Computing are coupled. All machines are homogeneous, but the workload is heterogeneous. Some are I/O-intensive and some are CPU-intensive. However, to handle all these tasks, each node must have good configurations in many aspects, resulting in a decrease in resource utilization. You can see this by looking at the machine configuration of TiDB.
- Member changes cause data reshuffle, which may affect node performance and cause cluster performance jitter.
- online upgrade is difficult. Each node has only part of data. The upgrade requires collaboration between nodes to avoid inconsistent data processing.
The latter two points are important in the cloud, because nodes often fail, and efficient upgrades help shorten the business development cycle and improve the business competitiveness of enterprises.
Therefore Snowflake will be calculated and storage decoupling into two service, can separate extension:
- computing layer is share nothing the virtual warehouse, by EC2 the cluster composition, each EC2 is A worker node, contains the Local Local disk as table data cache.
- The storage layer is AWS S3 block storage, which can be considered to have unlimited capacity + no data loss + ultra-low cost.
The combination of the two forms a multi-cluster + shared data architecture.
Overall architecture
it is divided into three layers, from top to bottom is Cloud Servier -> Virtual Warehouse -> Data Storage, and each layer interacts through Restful API.
Data Storage
AWS S3 block storage has excellent availability and durability. It is a cloud disk with high access latency but high throughput. By using the PUT, GET, and DELETE operations, you can only write or overwrite the entire file, but cannot perform the append operation. This feature also greatly affects the design of snowflake in concurrency control and other aspects. (data file is immutable!!) However, GET can obtain some data, so only some columns can be obtained.
table is partitioned into a number of micro partition, each MP CORRESPONDS TO a file, click inside the column tissue + compression + storage, file header stored in individual columns offset for part read, the format is similar to that of the Parquet.
The temp data generated in the query can also be persisted to S3 (VW local disk is not enough), so there is no OOM/OOD problem in theory.
In Cloud Service(FoundationDB), metadata information such as table -> files Association, file statistics (zonemap),trx lock/trx log,schema, and access control are stored!
Virtual Warehouse
it is an abstraction of computing resources for specific users. It is a share nothing EC2 cluster. A snowflake deployment can create multiple VW, their respective for different workload create on demand , respectively scale up/down independently, so you can make full use of the elastic resources on the cloud.
VW is stateless. If it fails, you can retry the query level.
Each worker node has its own local cache (local SSD), save its the access data file header + column data,cache worker node interior, will be multi-a worker process sharing, the replacement is achieved through a simple LRU mechanism.
When a query is to be executed, there is a query optimizer in the upper layer (Cloud Service). For the data to be accessed by the query, the table data files are allocated to each worker in the VW through consistent hashing, create a temporary share-nothing cluster.
Note: This consistent hashing is lazy yes! The worker process only determines the collection of data files that it is responsible for, but does not immediately obtain data from S3. Instead, it uses the LRU policy to read data on demand and gradually replaces the data in the local cache, therefore, resizing is relatively smooth.
For load skew problem, by allowing query level do file stealing, by efficient worker node from inefficient worker node get the file to scan and treatment, load balancing.
The execution engine features:
- column-oriented (compression + SIMD + cache efficient)
- vectorized execution (small batch pipeline)
- push based (avoid iterator in that tight loop, in addition can allow DAG morphology OF plan, a data to push to more downstream operator)
- no transaction processing overhead, all at the upper layer
- no buffer pool !!! Memory is used to calculate operators
Cloud Service
this layer is designed for all users (multi-tenancy) and provides important system management functions: access control, query optimization, transaction management, schema,table -> File mapping. FoundationDB is used for cross-AZ replication to ensure high availability.
- query optimization
the Cascades CBO is used, but the search space is much smaller because there is no index (column storage). Without index maintenance, zonemap is used to restrict data access. Accurate statistics are stored in zonemap. Without too much optimization, some decisions are made at runtime to avoid bad plans and ensure predictable performance.
based On the underlying S3 Copy-On-Write feature, when each table file is changed, a new file is generated and a new table version is formed. New metadata is created to maintain a new mapping, therefore, the random update effect is very poor, and bulk load / bulk update is more suitable.
This COW is naturally MVCC and supports SI.
The evolution of the table version is stored in the FoundationDB.
- Zonemap pruning
because the immutable table file can obtain accurate statistics, it does not use indexes that are difficult to maintain, but uses zonemap to filter data as much as possible and reduce data access.
zonemap is very suitable for COW. Every time a new file is generated, a new zonemap is created. zonemap is very small, with low storage overhead, and can be quickly accessed.
In addition to static pruning, also try to do dynamic pruning, such as build hash table rear, will collect join key distribution of information, and push to probe side the pruning.
For more information about how to optimize snowflake, see cmu's talk:
https://www.youtube.com/watch? v=CPWn1SZUZqE
continuous Availability
at the system layer, metadata storage : FoundationDB is used for cross-AZ replication.
In the data layer, use S3 data storage.
The two systems are highly availability, thus ensuring the the state of the whole system was highly available, and the other node, whether Cloud Service various service node, still don't cross-AZ the Virtual Warehouse, all of them are stateless. If they fail, you can retry the query on the upper layer.
because each node is stateless, the update is very simple:
Metadata storage must be backward compatible if schema changes occur. With the feature of FoundationDB Quick failover and quick start recovery, you can smoothly upgrade to a new instance version.
service node, you can directly deploy a new serivce and gradually drain traffic.
Time travel + Cloning
because MVCC + COW file, the file of the old version table can be retained in S3 for a period of time. Therefore, you can specify snapshot ts to query time travel.
In addition, the storage feature of S3 can be used to UNDROP mistakenly deleted data (data is not discarded).
In addition, COW can be used for table clone, that is, to copy a new table based on the existing table definition and table data. As you can imagine, this table does not need to copy data, but only copy metadata, the following two sets of metadata evolve independently, similar to git branch.
Summary
this paper has a very high level, so the technical details are relatively vague, but from which we can see some of the most important product design decisions of snowflake. I personally feel that it is because of these decisions that make it successful now.
- Taking full advantage of cloud infrastructure, especially elastic computing resources, stable and cost-effective storage resources with unlimited expansion, combined with multi-tenant features, can minimize instance deployment and O & M costs.
- Without pursuing ultimate performance, we can see that this Virtual Warehouse architecture is very different from the traditional share nothing data Warehouse (Redshift/Greenplum), cross-network storage level estimation based on S3 -> SSD -> Memory is difficult to achieve good comprehensive query performance, unless the workload is relatively fixed, relevant data is always cached in each ECS instance of VW.
- The system emphasizes ease of use and avoids user-adjusted switches and user maintenance as much as possible. This feeling is very important. It is very important for cloud users to reduce the threshold and O & amp; M workload, personally, I think this should be the most concerned issue for customers besides cost. As for the query response time, it should be good in AP scenarios.
- Supports Semi-Structured data and Schema-Less data.
Snowflake, the cloud-native form of computing and storage separation, has a great impact on many database vendors. For example, the SingleStore after the commercialization of MemSQL is similar to this, but because they were originally a share nothing architecture, data can only be synchronized asynchronously from local cache> S3 to reduce storage costs and implement local hot data caching. TiDB seems to be studying a similar problem, that is, reducing cluster costs through the underlying large-scale object storage.
Start Building Today with a Free Trial to 50+ Products
Learn and experience the power of Alibaba Cloud.
Sign Up Now