In this tutorial series, we will learn how to optimize file storage on a Linux server to ensure data integrity using ZFS.
By Alexandru Andrei, Alibaba Cloud Tech Share Author. Tech Share is Alibaba Cloud's incentive program to encourage the sharing of technical knowledge and best practices within the cloud community.
Warning: If you're going to use ZFS on your Cloud Disks, don't use the snapshot feature included in the Alibaba console. When you roll back to those backups, the potential for corruption exists, especially when using multiple disks. This is because ZFS keeps structural data on these devices and if the snapshots aren't all taken at the exact same time, the discrepancies may lead to missing data or corruption. ZFS includes its own snapshot feature and also has the ability to easily replicate these to other systems with commands such as
zfs send and
While the default ext4 filesystem, used by most of the Linux distributions, might cover basic user needs, when advanced features are required you need to look at alternatives. ZFS is one such alternative, one of the most important and most capable, packing a long list of features and sophisticated mechanisms of ensuring data integrity. Since it's a complete data storage solution, besides a filesystem, ZFS also includes a volume manager, similar in scope to the Logical Volume Manager (LVM) but more efficient in certain areas. To decide whether you actually need this tool in your infrastructure, here's a list of things it can do, which will be detailed in the next section:
- Makes administering storage devices much easier and more flexible. Allows most of the operations, like expanding, replacing devices, correcting errors, to be done online (on a running system and without requiring a reboot).
- Data compression can significantly boost performance and reduce disk space requirements.
- Snapshots can be used to easily rollback undesired or unexpected changes. They also provide a means to create fast, off-site, consistent, incremental or differential backups (
- Clones allow users to branch off/fork content, working simultaneously on multiple versions, without affecting the original. Only the differences are stored, reducing disk space requirements.
- Overall read/write performance can be significantly increased in certain setups.
- Redundant ZFS pools can self heal in the event of errors. Extremely reliable checksumming of data, Merkle Trees and many other techniques ensure not even a bit of information can change unnoticed.
- Atomic transactions -- they either entirely succeed or they are entirely cancelled/aborted. This protects you against partial writes/changes that leave other filesystems in inconsistent states. If power is lost during a write, you only lose data that didn't have time to reach the disk, but you won't get corruption.
- More efficient cache (ARC - Adaptive Replacement Cache).
- Being a copy-on-write system, it rarely overwrites data, which gives a lot of windows of opportunity to recover from mistakes. It also makes it SSD-friendly.
Advantages of Using ZFS
- It greatly simplifies administration of data storage. Let's take a practical example. In a traditional setup, multiple partitions or disks are used to logically split data. An administrator might create a disk/partition for storing image files uploaded through a website and another disk/partition for storing the database. He/she will have to estimate how much to allocate to each disk/partition based on guesses regarding which one will store more data, how to accommodate future growth, etc. When the time comes to expand disks, rebooting is required, which causes downtime in the services running on that instance or the administrator has to come up with complicated workarounds to migrate data to larger disks without having to temporarily shut down his website. Extending partitions that don't have enough contiguous space available is another complication. With ZFS, we eliminate many of these problems. We create a storage pool with a few simple commands, then we create two datasets: one for storing the image files, one for the database. There is no need to preallocate specific sizes for these datasets, therefore we don't have to estimate how much storage each one will require. As long as the storage pool has free space available, each dataset can freely grow at its own rate. We can however limit growth, if there's a reason to do so, by setting quotas, which define the maximum disk space a dataset can use. The storage pool can be expanded later on by adding more disks, or, by replacing older, smaller disks with new, larger ones (second method is preferred, because first leads to various decreases in efficiency and performance). All of these actions can be executed without having to reboot the machine or pause services that are reading/writing from/to ZFS datasets.
- You can take snapshots of datasets. It's a way to "freeze" the state of the dataset to a point in time, keeping it on the system, unchanged, for as long as you require it. The origin dataset continues to operate as normal and, initially, no additional disk space is required to keep the frozen dataset, only differences that are added to the origin are stored. Snapshots can be useful for taking consistent backups, because data cannot change during the time you are copying it somewhere else. Another use they can have is providing a safe point you can return to (rollback). For example, you can take a snapshot before risky operations such as cleaning up or optimizing a database. In case something went wrong, you can quickly restore the dataset containing your database. Because of the way ZFS is built (copy-on-write), rolling back is fast, even if you have to revert hundreds of GB of changes.
- Snapshots are very efficient and don't require preallocating space for storing changes (as is the case with LVM). There's also virtually no limit to the number of snapshots you can create and no noticeable performance degradation even when a large number of these are created. This allows freedom to implement extreme use-cases, such as taking a snapshot of a filesystem every minute (and also backing up off-site if required).
- High-frequency, off-site backups, that could only take seconds to upload, are made possible. Snapshot replication is very efficient because, unlike other tools, ZFS doesn't need to waste time first scanning data and comparing what has changed. Simply put, a tool like rsync would have to scan, file by file, and look for changes, then also write, file by file, on the destination server, which besides taking more time, results in more I/O operations on disks. ZFS just "knows" that "these data blocks here are what changed between snapshot1 and snapshot2". Copying raw data, instead of creating file by file, results in less I/O.
- Clones can be used to save disk space on very similar objects or work on multiple versions of a product at the same time. Example: you can snapshot an entire website (code files, images, resources, database, etc.) Afterwards, you can create 10 clones which will initially use no additional disk space (except for metadata, but that is negligible). Only the changes that will be added to each clone will be stored (differences between origin and clones). Then, you can have different teams work on improvements, each with its own approach. They all have their own isolated environments (clones), so you can easily compare results at the end of the project. After testing each clone, you can promote the most successful one to take over the original dataset. Of course, clones can be used in many other creative ways.
- When storage pools are set up with redundancy (mirrored devices or devices set up in Raid-Z arrays), ZFS has very sophisticated and efficient algorithms of ensuring data integrity and is also able to self heal when errors are detected.
- In certain setups, reads and writes can be distributed, in parallel, across multiple storage devices, speeding up these operations.
- Data can be compressed on the fly, speeding up reads and writes and saving disk space. This should be activated on every dataset except those that store data that is already compressed. For example, there is no point in turning on compression when storing video files, zip archives or jpeg images, since all of these are already compressed.
Adaptive Replacement Cache (ARC) is much better than the default Least Recently Used (LRU) caching mechanism that operates when we use the ext4 filesystem. Files that we read/write from/to storage are cached (copied) in random access memory (RAM), which is much faster than hard-drives or SSDs. This greatly speeds up subsequent reads of the same data when the system needs it again since it bypasses the physical device and gets it directly from memory. The LRU's problem is that it's very rudimentary, caching the last read file unconditionally and potentially evicting (deleting) useful files from memory. For example, we might have useful files in the cache, at which point we download a 7GB file. If we only have about 8GB of RAM, all of the previous caches will be flushed to store the downloaded file. This is detrimental, because the previous cache contained files that the system reads often, while the new cache, contains a file we downloaded, that we may not need in the near future. The ARC is much smarter and doesn't flush from cache files that it sees are being accessed often.
Structure of ZFS Pool
The basic building blocks for ZFS are the physical storage devices. These are added to various types of vdevs. Although partitions and even files can also be used to back data, it's strongly recommended to use whole disks in ZFS structures to get the best performance and reliability. Creating vdevs out of files should only be used to test and experiment.
Here are some of the components that go into making a ZFS storage pool:
pool -- One or more vdevs build a storage pool. You can think of it as a virtual large disk made up of all the virtual devices added to it.
vdev (virtual device) -- One, or more physical storage devices grouped together, build a vdev. There are multiple types of virtual devices:
disk -- When the vdev consists of a single storage device. Multiple one disk vdevs can be added to a single pool. This will increase pool capacity and make ZFS stripe (split and spread) data across all of its vdevs, therefore reading and writing will be faster. To illustrate, in a pool consisting of 4 disk vdevs, each having the same capacity, when you write 4GB of data, this will be split (striped) in four equal parts and 1GB will be sent to each disk at the same time.
mirror -- Made from two or more disks. In this type of vdev, ZFS keeps data mirrored (identical) on all of the physical devices it contains. You can add multiple mirror vdevs to a pool to stripe data across them, in a similar fashion to the scenario described earlier. Example: you add disk1 and disk2 to mirror1, disk3 and disk4 to mirror2. Writing 4GB of data will send 2GB to mirror1 and 2GB to mirror2. You can lose disk1 and disk3 and still recover all data. But if you lose all disks in the same mirror (disk1 and disk2), then the whole pool is lost.
Raid-Z -- Two or more devices can be added to this type of vdev and data can be recovered if sufficient information, called parity, survives. Raid-Z1 uses one disk for parity, Raid-Z2 uses two and so on. In a Raid-Z2 vdev, we can lose a maximum of 2 devices and still be able to rebuild data. In a Raid-Z1 setup we can only lose one.
log -- The ZFS Intent Log (ZIL) is normally stored on the same devices where the rest of the data is stored. But a Separate ZFS Intent Log (SLOG) can be configured. When hard-disks are used to store data, a log vdev backed by faster storage media, such as SSDs, can help increase performance and reliability. Here's an overly-simplified illustration to help you understand how it works. Most writes are delayed so that the system can do some optimizations before physically storing data. Other types of writes are considered urgent, so a program can instruct the system to flush data as fast as possible. A database server can say: "Write this customer data now since it's important and we want to make sure we don't lose it". This is called a synchronous write. Without a SLOG, ZFS will send the ZIL to the hard-disks. Then, the data in ZIL will be organized and written to its final destination. At the same time, other write jobs may be active on this disk and mechanical hard drives are inefficient when trying to write to multiple locations at the same time, which can result in degraded performance. With a SLOG, ZIL data can be sent to an SSD while the hard-disk completes other jobs, resulting in faster response times and higher throughput when dealing with a lot of synchronous writes. Furthermore, the SLOG vdev is now faster, so, in case of a power failure, chances are higher that it will finish capturing a synchronous write. When the system reboots, the ZIL can be replayed, and the data that hit the SLOG, but didn't have time to go to the hard-disk, can now be stored properly. This helps eliminate data loss or at least minimize it when write operations are abruptly interrupted. And, as mentioned, operations are atomic, meaning that a "half-write" will be discarded, so you don't get inconsistent data and corruption. In the worst case, you will have older data on disk (what was written in the last few seconds before power loss won't be available).
cache -- A vdev that can be used to cache frequently accessed data. Only makes sense to be used when the cache device is faster than the devices used to store data. Example: create a cache vdev on an SSD when your storage pool is made out of (slower) mechanical hard-disks. If you already use SSDs in your storage pools, then a cache device doesn't help. In the cloud, this hybrid structure can be used to optimize costs: build your pools out of cheaper storage devices and add a more expensive, faster device with more IOPS for the cache vdev.
spare -- In this vdev, you add a device that you want to designate as a replacement for any drive that fails in your ZFS array. By default, you have to manually use a command to replace a failed drive with your spare but steps can be taken to instruct ZFS to automatically do so when needed.
dataset -- This can be a ZFS filesystem, snapshot, clone or volume.
volume -- The storage pool in its entirety, or parts of it, can be used as (a) volume(s). These are virtual block devices that can be used just like real partitions or disks. You can consider it a ZFS partition but not formatted with the ZFS filesystem. It's useful when you need the advantages offered by ZFS (can still be snapshotted, cloned, etc.) but you have to use a different filesystem on the storage pool or just need the raw (virtual) devices to feed to other applications, for example as virtual disks to virtual machines.
How to Choose the Right Type of ZFS vdev
- Do you need redundancy? ZFS has a strong focus on being extremely reliable and needs redundant vdevs to be able to recover from various types of data corruption due to device failures. But it gives users the option to run less reliable, non-redundant structures (made of disk vdevs), if they don't mind the risks. Keep in mind that the more disks you add, the more the risks increase. To illustrate, think about having a ZFS pool consisting of a single hard-disk and a pool consisting of 1000 disks. In the first structure, it's entirely possible that it will work for years without failing while in the second example it's almost certain that out of so many devices at least one will fail, if not more. And if one device fails in a non-redundant pool, even if it represents a small piece of the structure, the entire pool becomes compromised because of the way data is striped across all devices. A 1000MB file may have 1MB of data stored on each of those disks. Even if only one device fails, and we do have 999MB of data available, the 1MB is irrecoverable and we are also missing important metadata (data about data) that helps ZFS connect the dots. You should only use non-redundant arrays either if you have a good off-site backup scheme in place or the data that you are storing is easily recoverable (e.g. software downloads, games, etc.) Alibaba Cloud Disks are already redundant, so the risks are greatly reduced, but another layer of protection can be useful. It's up to you to decide the acceptable level of risk for your infrastructure. Whatever type of vdev you choose, don't forget that off-site backups (preferably to a different datacenter zone) are always a good strategy to prevent data loss. Replication in the form of mirrors, parity used by Raid-Z, Alibaba's own measures to protect your data, are all excellent factors that increase reliability under normal circumstances but you have to think about unexpected situations too, such as a hacker/malware/ransomware that overwrites all of your data, or accidents caused by administrators that overwrite important files. In such situations, even with redundancy, all of these changes would be replicated across all devices, so you still have data, but bad data, and no way to revert changes. But an off-site incremental or differential backup protects you against these scenarios.
- How much redundancy can you afford? Say you need to store 10 terabytes of information. You would have to pay for 20 terabytes of space to create a mirror with two devices, where you can lose one of them and still recover. And while you pay for 20 terabytes, you can only use 10. If you want to be able to recover after losing 2 devices, you would have to pay for 30 terabytes of data, and so on. Raid-Z can help here. In Raid-Z2 for example, if you buy 12 one terabyte disks, you are able to store 10 terabytes of data and only 2 terabytes are used for redundancy, instead of the 10 used by the mirror. And you can still lose two devices and recover. But though it can save money/resources, Raid-Z has other types of cost: recovery (resilvering) takes a longer time than is the case with mirrors and the operation puts more stress on the devices, read performance is slightly lower and you can't add or remove disks to/from a Raid-Z vdev after it has been created. Every vdev comes with its own advantages and disadvantages so you will have to decide what works best for you.
- Do you prefer speed over reliability or vice versa? Is read speed more important than write speed or vice versa? Example: adding more devices to a mirror vdev doesn't increase write throughput but it does increase read bandwidth and data reliability. Adding more regular disk vdevs increases both read/write speeds but decreases reliability. Raid-Z has better write performance than mirrors but it's slower on reads.
- Flexibility: certain types of vdevs can be more flexible than others. You can add devices to a mirror vdev or remove them (if more than 2 devices are used). But the number of devices in a Raid-Z vdev cannot be changed after creation.
In the next tutorial, Use Your Storage Space More Effectively with ZFS - Exploring vdevs, we will configure ZFS on an Ubuntu instance, create a pool and learn how to use each type of vdev.