6 Optional Technologies for Data Storage
1. Structured data storage
Businesses often use transactional databases simultaneously for reporting, in which case data needs to be read frequently, but data is written much less frequently. With the increasing demand for data reading, more innovations have entered the query field of structured data storage, such as the innovation of columnar file format, which helps to improve data reading performance and meet analysis needs.
Line-based formats store data in a file as lines. Row-based writes are the fastest way to write data to disk, but it's not necessarily the fastest way to read because you have to skip a lot of irrelevant data.
Column-based formats store all column values together in the file. This results in better compression since the same data types are now grouped together. Often, it also provides better read performance because you can skip columns you don't need.
Let's look at common choices for structured data storage. For example, you need to query the total number of sales for a certain month from the orders table, but the table has 50 columns. In the row-based architecture, 50 columns of the entire table are scanned when querying, but in the columnar architecture, only the order sales column is scanned when querying, thus improving data query performance. Let's take a closer look at relational databases, focusing on transactional data and the need for data warehouses to handle data analysis.
RDBMS is more suitable for online transaction processing (OLTP) applications. Popular relational databases include MSSQL, MariaDB, PostgreSQL, etc. Some of these traditional databases have been around for decades.
Many applications, including e-commerce, banking, and hotel reservations, are powered by relational databases. Relational databases are very good at handling transactional data that requires complex join queries between tables. From the perspective of transaction data requirements, relational databases should adhere to the principles of atomicity, consistency, isolation, and durability, as follows:
Atomicity: The transaction will be fully executed from start to finish, and in the event of an error, the entire transaction will be rolled back.
Consistency: Once the transaction is complete, all data is committed to the database.
Isolation: Multiple transactions are required to run simultaneously in isolation without interfering with each other.
Durability: In the event of any interruption (such as a network or power failure), the transaction should be able to recover to the last known state.
Typically, data from relational databases is dumped into a data warehouse for reporting and aggregation.
Data warehouses are more suitable for online analytical processing (OLAP) applications. Data warehouses provide fast aggregation capabilities for massive amounts of structured data. Data needs to be loaded in batches, making it impossible for warehouses to provide real-time insights on hot data.
Modern data warehouses use columnar storage to improve query performance. Thanks to columnar storage, these data warehouses provide very fast query speed and improve I/O efficiency.
A data warehouse is a central repository that can store accumulated data from one or more databases. They store current and historical data for creating analytical reports of business data.
Although, data warehouses centrally store data from multiple systems, they cannot be considered data lakes. A data warehouse can only handle structured relational data, while a data lake can handle both structured relational data and unstructured data such as JSON, log, and CSV data.
2. NoSQL database
NoSQL uses a variety of data models, including columnar, key-value, search, document, and graph models. NoSQL databases offer scalable performance, high availability, and resilience.
NoSQL usually doesn't have a strict database schema, each record can have any number of columns (attributes), which means a row can have 4 columns and another row in the same table can have 10 columns. Partition keys are used to retrieve values or documents that contain related properties. NoSQL databases are highly distributed and can be replicated. NoSQL databases are very durable and highly available without performance issues.
3. NoSQL database types
Columnar databases: Columnar data stores help to scan a column when querying data, rather than scanning the entire row. If the items table has 10 columns and 1 million rows, and you want to query the quantity of an item in the inventory, the columnar database will only apply the query to the item quantity column, without scanning the entire table.
Document Databases: The most popular document databases are MongoDB, Couchbase, MarkLogic, Dynamo DB, and Cassandra. A document database can be used to store semi-structured data in JSON and XML formats.
Graph Database: A graph database stores vertices and links between vertices (called edges). Graphs can be built on relational and non-relational databases.
In-memory key-value stores: They store data in memory and are used in scenarios where data is read frequently. The application's query first goes to the in-memory database, and if the data is available in the cache, it doesn't hit the main database. In-memory databases are great for storing user session information that results in complex queries and frequent requests for data such as user profiles.
NoSQL has many use cases, but to build a data search service, all data needs to be indexed.
Log search and analysis are common big data application scenarios, and Elasticsearch can help you analyze log data from websites, servers, and IoT sensors. Elasticsearch is used by a large number of industry applications such as banking, gaming, marketing, application monitoring, ad tech, fraud detection, recommendation, and IoT.
5. Unstructured data storage
Hadoop adopts the mode of master node and child node. Data is distributed in multiple child nodes, and the master node coordinates operations and performs query operations on the data. The Hadoop system relies on massively parallel processing (MPP), which makes it possible to quickly query various types of data, whether structured or unstructured.
When a Hadoop cluster is created, each child node created on the slave server is accompanied by a block of disk storage called the local Hadoop Distributed File System (HDFS). You can query the stored data using common processing frameworks such as Hive, Ping, and Spark. However, the data on the local disk is only persisted for the lifetime of the associated instance.
If you use Hadoop's storage layer (i.e. HDFS) to store data, then storage and computation will be coupled together. Adding storage means having to add more machines, which also increases computing power. For maximum flexibility and best cost-effectiveness, compute and storage need to be separated and scaled independently of each other.
In general, object storage is better suited for data lakes to store all kinds of data in a cost-effective manner. Supported by object storage, cloud-based data lakes can flexibly decouple computing and storage.
6. Data Lake
The advantages of a data lake are as follows:
Ingest data from various sources: A data lake allows you to store and analyze data from various sources (such as relational, non-relational databases, and streams) in a centralized location to produce a single source of truth. It answers questions such as why is the data spread across multiple places? Where is the single source of truth?
Ingest and store data efficiently: Data lakes can ingest any type of data, including semi-structured and unstructured data, without the need for any schema. This answers the question of how to quickly ingest data from various sources, in various formats, and store it efficiently at scale.
As the amount of data generated continues to scale: A data lake allows you to separate the storage tier from the compute tier, scaling each component separately. This answers the question of how to scale with the amount of data produced.
Apply analytics to data from disparate sources: With a data lake, you can identify data patterns as you read and create a centralized data catalog of data collected from disparate sources. This enables you to analyze data quickly and at any time. This answers the question of whether multiple analysis and processing frameworks can be applied to the same data.