ZFS Without Tears

Introduction

What is ZFS?

ZFS is a combined file system and volume manager that actively protects your data. It replaces lots of software you may be using such as LVM, RAID, and backup applications.

What is this document?

This is a set of notes I use to remember how to manage my ZFS configuration. You can find dozens of other ZFS tutorials and manuals. Most of them are better. Mine is shorter because it's far from comprehensive.

Traditional storage system organization

RAID levels are used to describe how an array of devices can be managed as one big storage resource. To understand ZFS, we only need a few:

RAID 0: Data is written across all the devices. Striping.
RAID 1: Data is duplicated on all the devices. Mirroring.
RAID 5: Striping with distributed parity.
RAID 6: Striping with double distributed parity.
RAID 10: A stripe of mirrors.

Redundancy mechanisms

A stripe has no redundancy. Storage for each file will be spread out across all the devices. If any device fails, the entire file system on the array will be compromised. Stripes are used to gain speed: All the devices work in parallel to transfer one file.

In a mirror with N devices, N-1 devices can fail and the data can still be recovered. All the devices are the same size, so the capacity of the array is the size of one device. The performance of a mirror is the same as that of one component device.

A parity array is more complex: The data is striped along with parity information. In RAID 5, data can be recovered if one device fails. In RAID 6, two devices may fail. Parity information takes up space: If all the devices in a RAID 5 array are the same size "X", one X will be required for the distributed parity information. For RAID 6, 2X will be consumed for parity. A parity array is slower than a mirror, but it provides more storage space with the same number of drives.

RAID 10 is used to get the redundancy of a mirror with more performance: It's created from an N-stripe of M-mirrors. By fussing with N and M, safety, size, and speed can be balanced for a given application.

ZFS storage systems

ZFS supports:

N-stripes
N-stripes of M-Mirrors
N-stripes of M-raid arrays.

Each raid array can have up to 3 level parity. They are named:

raidz1
raidz2
raidz3

Understanding ZFS

The concepts

Pool
Virtual device
Dataset
Volume
Filesystem
Snapshot
Clone
Replication

The big picture

ZFS manages one or more pools.
A pool is always a stripe array of virtual devices.
A virtual device may be simple, such as a disk drive.
A virtual device may be complex: either a mirror or a parity array.
Datasets are dynamically allocated from the pool.
A dataset can contain a ZFS file system or a blank volume.
Storage for a ZFS file system is allocated and managed automatically.
Storage for a volume is allocated in one unformatted fixed-sized chunk.
You can format a volume with any file system you prefer.
The state and contents of a dataset can be frozen in a read-only snapshot.
Snapshots are low cost in speed and space.
A dataset can be reverted to any previous snapshot.
Snapshots can be converted to read-write clones.
Snapshots and clones depend on the existence of their parent datasets.
Clones can be promoted to become regular independent datasets.
You can browse snapshots and clones with regular file system tools.
Snapshots can be replicated (copied) to other pools located anywhere.
Incremental replication makes efficient backups.

Virtual Devices

Simple vdevs are disk drives, partitions, or files:

/dev/sdb
/dev/sdb1
/home/myself/myGreatBigFile

Complex vdevs are specified using an expression:

<type> device1 ... deviceN

The type is one of seven keywords taken from one of two groups:

Storage vdevs:

mirror
raidz1
raidz2
raidz3

Optimization vdevs:

log
cache
spare

Storage vdevs, as the name suggests, are where you get space for your files. Optimization vdevs are used to optimize pool performance or reliability. They are optional.

In some contexts, zfs commands require the use of simple vdevs. We will denote these as devices, reserving the term vdev for contexts where a complex or a simple device may be specified.

Pools

Pools are the top-level zfs construct for managing a collection of virtual devices. A pool is created by specifying a stripe of vdevs. Space for datasets is allocated dynamically from all the storage vdevs in the pool.

Pool design

Each top-level vdev in a pool is allowed to be a different type and/or size, but this is seldom (if ever) a good idea. The most common redundant pool organizations are:

N-mirror
N-raidzX (X=1,2, or 3)
N-stripe of M-mirrors
N-stripe of M-raidzX

Compound vdevs

You cannot create mirrors of raid arrays or raid arrays of mirrors, etc. Only the types listed in the previous section are allowed (for now.)

Datasets

There are four types:

ZFS file systems
Volumes
Snapshots
Clones

ZFS File systems

By default, when you create a new dataset, ZFS also creates a filesystem which manages space automatically. When you create a file in that filesystem, it is likely that bits of it are stored on every storage vdev in the pool. When you delete a file, the blocks are returned to the pool.

Volumes

A volume, also called a zvol, is a large fixed-sized dataset formatted with any filesystem you prefer. Zvols are often used as disk drives for virtual machines. When a virtual machine is no longer needed, the volume can be destroyed, returning all the space to the pool.

Snapshots

A snapshot is conceptually a read-only copy of another parent dataset. Snapshots are implemented in such a way that they can be created nearly instantaneously and take very little space. A snapshot depends on the continued existence of the parent.

Clones

A clone is a snapshot you can modify. Clones grow as they are modified. Like snapshots, they depend on their parent.

A good policy to follow: Never create a clone you don't intend to destroy or promote in just a few days. If you actively use a clone, it will diverge until it consumes as much space as the parent with added disadvantage that you can never destroy the parent.

I mention this caveat because many beginners think it would be cool to have a pristine installation of Windows or Linux that gets cloned for use by virtual machines. It doesn't end well.

Preparation for ZFS

Installation

ZFS is a foreign kernel module that gets rebuilt automatically by dkms whenever your kernel is updated. Consequently, it depends on the kernel-devel package.

Install kernel-devel:

yum install kernel-devel

Install a link to the zfs package repository:

yum localinstall --nogpgcheck \
    http://archive.zfsonlinux.org/fedora/zfs-release$(rpm \
    -E %dist).noarch.rpm

Install zfs:

yum install zfs

Partitioning drives for ZFS

THIS NEEDS REVISION. (There is a lot of misinformation on the web about this topic, perhaps because the behavior of ZFS-on-Linux has changed.)

At one time, the standard advice was to build pools from unpartitioned devices. There are many discussions on the web about the merits of this convention. But it's no longer true: ZFS now partitions unpartitioned devices automatically. I believe this is the reason:

Drives sold by different manufacturers with the same nominal size may have slightly different physical sizes. If you need to replace a device in an existing pool, the new device must not be smaller. If it is even one byte smaller, it cannot be used. By introducing a "spacer" partition, ZFS can adjust the size of the ZFS data partition so the new disk size will match the old ones.

To let ZFS have its way, it's best to create a pool with drives that have no partitions. When you create a pool, zfs will create two paritions:

/dev/sdx1   : The zfs data partition
/des/sdx9   : The spacer partition

If you intend to boot a UEFI operating system on a ZFS device, you will have to create partitions by hand. This is covered in "Reliably boot Fedora with root on ZFS."

To clean up an old disk that might have been part of zfs array, you should first use gdisk to list the partitions. Common partition types used for zfs are:

FreeBSD ZFS (0xa504)
TBD

If you see one of these types, exit gdisk and clear the ZFS label:

zpool lableclear -f /dev/sdxN

After that, use gdisk to remove all the partitions.

Limiting memory usage

ZFS wants a lot of memory. The recommended minimum is 1G per terabyte of storage in your pool(s). Without constraints, ZFS is likely to run off with all your memory and sell it at a pawn shop. To prevent this, you should specify a memory limit. This is done using a module parameter. A reasonable setting is half your total memory:

Edit or create:

/etc/modprobe.d/zfs.conf

Add this line:

options zfs zfs_arc_max=17179869184

The size is in bytes and must be a power of 2:

64GB  = 68719476736
32GB  = 34359738368
16GB  = 17179869184
8GB   = 8589934592
4GB   = 4294967296
2GB   = 2147483648
1GB   = 1073741824
500MB = 536870912
250MB = 268435456

Using ZFS without ECC memory

ZFS "wants" you to use ECC memory, which is typically only available on server class motherboards with Intel Xeon processors.

If you don't use ECC memory, you are taking a risk. Just how big a risk is the subject of considerable controversy and beyond the scope of these notes. Feelings run high on this topic:

"When you run a NAS without ECC and with software RAID, people tell legends of your idiocy which survive the ages. Statues will be built to your folly. Children centuries in the future will read of your pure, unbridled stupidity as something to herd their tender aspirations away from the path you once trod."
- Hat Monster

First, give yourself a fighting chance by testing your memory. Obtain a copy if this utility and run a 24 hour test:

http://www.memtest86.com

This is particularly important on a new server because memory with defects is often purchased with those defects. If your memory passes the test, there is a good chance it will be reliable for some time.

They didn't use ECC memory...

You might be tempted to keep your server in a 1-meter thick lead vault buried 45 miles underground. It turns out that many of the radioactive sources for memory-damaging particles are already in the ceramics used to package integrated circuits. The cost and effort would likely be wasted.

Instead, we're going to enable the unsupported ZFS_DEBUG_MODIFY flag. This will mitigate, but not eliminate, the risk of using ordinary memory.

Edit:

/etc/modprobe.d/zfs.conf

Add the line:

options zfs zfs_flags=0x10

Reboot to be sure it "takes".

You can see the current value here:

cat /sys/module/zfs/parameters/zfs_flags

Command line operations

The rest of this guide describes only two commands:

zpool keyword parameters ...
zfs keyword parameters ...

ZPOOL Commands

The zpool command is used to create and configure pools.

The big picture

Create and destroy pools
Add and remove vdevs
Mount and unmount pools
Import and export pools
Take devices offline or online
Replace devices
Attach and detach mirror devices
Show status and history
Scrub to verify and repair
Set and get properties

Creating a pool

The basic pattern:

zpool create myPool vdev1 vdev2 ... vdevN (options)

This expression creates a stripe. Blocks for datasets will be allocated across all the vdevs. If each vdev is a simple device, the pool will be vulnerable: If one device fails, the whole pool is lost. Consequently, the vdevs are usually mirrors or raidz arrays.

There are many options, but I want to mention one right now because it is irreversible if you get it wrong:

-o ashift=12

This specifies that your disk is Advanced Format, which is the same as saying it has 4096 byte sectors instead of the old 512 byte sectors. Most disks made after 2011 are advanced format so you'll need this option most of the time. If you forget, ZFS assumes the sector size is 512. If that's the wrong answer, you'll take a big performance hit. More details about this are covered later. I won't show this option in the examples, because it clutters up the logic. But don't forget!

Adding vdevs

You can add more vdevs to a pool anytime:

zpool add myPool vdev1 ... vdevN

Removing vdevs

You can only remove log, cache or spare vdevs:

zpool remove myPool aVdev

Storage allocation

Blocks for datasets are allocated evenly across all the top-level storage vdevs. This is why you can't remove a top-level storage vdev.

Simple vdevs (devices)

Simple vdevs are specified using path notation. ZFS puts "/dev/" in front a path name unless the path begins with "/".

You can specify devices like this:

sdb sdc sdd sde

Or using the full path:

/dev/sdb /dev/sdc /dev/sdd /dev/sde

File vdevs must have absolute paths:

/home/aFile /var/temp/goop

File vdevs are used mostly (always?) for experimentation.

Create a stripe of two devices

zpool create myPool sdb sdc

Create a mirror of two devices

zpool create myPool mirror sdb sdc

Create a raidz1 of four devices

zpool create myPool raidz sdb sdc sdd sde

Create a stripe of two mirrors (RAID 10)

zpool create myPool mirror sdb sdc mirror sdd sde

Create a stripe of two raidz1 arrays

zpool create myPool raidz sda sdb sdc raidz sdd sde sdf

Create a stripe of files

This kind of pool is used to experiment without using disk drives. An example:

Create some empty files. (The minimum size of a file vdev is 64M.)

dd if=/dev/zero of=myFile1 bs=1M count=64
dd if=/dev/zero of=myFile2 bs=1M count=64

Create the pool:

zpool create myPool ~/myFile1 ~/myfile2

The paths must be be absolute.

Creating a mirror by attaching a device

We start with a pool that has one device:

zpool create myPool sdb

Then we attach another, creating a mirror:

zpool attach myPool sdb sdc

We can add another, continuing from the previous example:

zpool attach myPool sdc sdd

You always specify the last vdev in the mirror followed by the new one.

Decreasing the level of a mirror

Continuing from the previous example, we remove the "end" device:

zpool detach myPool sdd

Now the mirror has only sdb and sdc.

Demoting a mirror to a simple storage vdev

Continuing from the previous example:

zpool detach myPool sdc

Now only sdb is left.

Mounting a pool

By default a new pool is mounted at the root of the file system where it appears as a directory named after the pool.

You can specify an alternative mount point for the pool when creating:

zpool create myPool ... -m aPath

In this expression, aPath is a regular file system path or the special keyword none. If none is specified, the pool will not be mounted. The last element of the path is the name for the actual mountpoint, not just the parent path. Consequently, you can use any name for the mountpoint. If an empty directory exsits at that location with a conflicting name, zfs will mount the pool on that directory. Otherwise, a new mount point is created.

You can set or change the mountpoint of a pool anytime later:

zfs set mountpoint=aPath myPool

If the pool was already mounted, the old mount point will be removed.

Note that we are using the command zfs instead of zpool. That's because the pool itself is a dataset, which will be discussed in more detail later.

Adding a hot spare drive

You can add one or more spares at any time:

zpool add myPool spare /dev/sdf

Spares can be used with any kind of pool.

Pitfalls when adding top-level vdevs

A chain is only as strong as its weakest link. Suppose you have a raidz pool:

zpool create myPool raidz sdb sdc

Now you add a device:

zpool add myPool sdd

You have added a top-level vdev that is a simple device.
You have not added a device to the raidz vdev.
The pool will be irreparably damaged if anything goes wrong with sdd.

The raidz is spoiled because blocks for files are allocated from all the top level devices, some from the raidz vdev and some from the device sdd1.

Show the status of a pool

zpool status

Show data statistics

For the whole pool

zpool iostat

Also individual vdevs

zpool iostat -v

Continuous monitor every 5 seconds

zpool iostat -v 5

Show the history of a pool

zpool history myPool

Destroy a pool

zpool destroy myPool

Physically move a pool to another machine

After exporting a pool, you can remove the devices and install them in another computer:

zpool export myPool

Exporting a pool unmounts all the filesystems and "offlines" all the devices, so it is sometimes used in other situations when you want to stop all access to the pool.

On the destination machine, you resurrect the pool by importing:

zpool import myPool

After the import, all mountable datasets and filesystem will be mounted in their original locations. Sometimes this is inconvenient. You can arrange for all new new mountpoints to be under a new path using the altroot property:

zpool import myPool -o altroot=/my/new/path

If you export a pool created using file vdevs, there is no place for them to store their parent directory. To import such a pool, you must specify the parent like this:

zpool import myPool -d ~/zfsPlayroom

Rename a pool

First export the pool:

zpool export myPool

Then import it and specify a new name:

zpool import myPool myNewName

Common pool properties

These can be specified when importing a pool using -o or by using the set command. The default value is indicated first when there are alternatives:

altroot=path
readonly=off | on
autoreplace=off | on
ashift=12

Example:

set readonly=on myPool

Pool information properties

These are read-only:

    health
    size
    free
    capacity
    allocated

free + allocated  = size
health = ONLINE | DEGRADED | FAULTED | OFFLINE | REMOVED | UNAVAIL

Using device ids

When you create a pool, it's easy to type and remember device names. These are the familiar "/dev/sdx" names. But there is a serious problem with device names: For scsi or sata disks, they are associated with the port where you plugged in the cable. And for USB disks and other removable devices, they are determined by the order they were plugged in. If you import a pool with the cables switched, disaster follows. And if you move the disks to another machine and try import them, there will likely be device name conflicts.

To avoid all this, it's better to switch to names that are associated with the volume (the physical disk) rather than the connection. Linux provides several choices. They are listed in the /dev/disk directory. There are only two choices that really useful: device IDs and UUIDs.

Device IDs are composed from the disk model number and serial number. When I have to replace a disk, I can be sure I got the right one because the serial number is printed on the paper label and "zpool status" will show the ID.

To make the switch to device ids:

zpool export myPool
zpool import myPool -d /dev/disk/by-id

You can also use the very fashionable UUIDs:

zpool export myPool
zpool import myPool -d /dev/disk/by-id

The good thing about UUIDs (and device IDs) is that they aren't optional: a disk always has both. A bad thing about UUIDs and device IDs is that they are far too long and complex to type or remember. That's why I usually partition and assemble disks using device name and then switch to device IDs.

You can switch back to the old device names like this:

zpool export myPool
zpool import myPool -d /dev

You can switch device names anytime. They become part of the data structures on disk the next time you export the pool or shut down.

Using device aliases

If you have a lot of drives in many different racks, locating a drive given the device ID isn't easy. ZFS provides a way to assign your own alias names. You specify alias names in this file:

/etc/zfs/vdev_id.conf

In the following example, we have two 4-bay enclosures that contain pools "mrpool" and "mrback."

Run "zpool status" and capture the output in a prototype vdev_id.conf file. Edit the file so it looks like this:

alias r1c1_7TNO ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E2YR7TN0-part1
alias r1c2_8PT2 ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E4FP8PT2-part1
alias r1c3_VNSR ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E7ADVNSR-part1
alias r1c4_FNN3 ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E3FLFNN3-part1

alias r2c1_4CX4 ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E1XS4CXA-part1
alias r2c2_AXC9 ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E1EFAXC9-part1
alias r2c3_3Y4P ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5KJ3Y4P-part1
alias r2c4_3C3J ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5KJ3C3J-part1

In this example, the short name was made by combining location in the box (row/col) with the last four characters of the device id, which is part of the drive serial number printed on the disk label. If a drive is reported as defective, you have the location and a positive id you can physically see.

After creating vdev_id.conf, run:

udevadm trigger

Now you can see you new alias shortcuts in:

/dev/disk/by-vdev

When you confirm that all is well in /dev/disk/by-vdev, you can rename the drives in your pools:

zpool export mrpool
zpool import -d /dev/disk/by-vdev mrpool

Now when you run "zpool status" you'll see the alias names.

These aliases can be configured and used before a pool even exists. You can use the alias names when creating a pool and this is a good way to make sure you're using the correct devices. Always remember that linux device names e.g. "/dev/sdx" can change after rebooting even if you didn't deliberately add or remove anything. It can, for example, depend on the order the drives power up. Or if a device fails to start, all subsequent names in the scan order will change.

Adding a log device

By default, the mysterious "ZIL" (ZFS intent log) is created in the storage pool. A separate log device can be specified to improve performance. This can be done when a pool is created:

zpool create fastPool /dev/sdb3 log /dev/myOtherSSD

Or it can be added later:

zpool add fastPool log /dev/myOtherSSD

Log vdevs can be redundant:

zpool add fastPool log mirror /dev/ssd1 /dev/ssd2

You can attach a device to a mirrored log vdev to increase the level:

zfs attach fastPool /dev/ssd2 /dev/sdd3

You can also attach a device to a non-mirrored log vdev to create a mirror.

Adding a cache device

ZFS always uses the memory-based "ARC" cache. It is possible to add one or more secondary cache vdevs. In ZFS-speak, this is called an L2ARC. These are usually fast devices such as SSDs.

Cache devices can be specified when a pool is created:

zpool create fastPool /dev/sdb3 cache /dev/mySSD

Or they can be added later:

zpool add fastPool cache /dev/mySSD

Cache vdevs cannot be redundant

Scrubbing a pool

Scrubbing verifies all checksums and fixes any files that fail.

zpool scrub myPool

Although the command returns immediately, the scrub itself my take many hours or even days to complete. Use "zpool status" to view the progress. You can stop a scrub using:

zpool scrub -s myPool

Using the ZED monitor daemon

The zed daemon is part of the zfs installation and runs all the time. There is no explicit systemd enable/disable script.

The configuration file is:

/etc/zfs/zed.d/zed.rc

One essential change is to specify your email address by uncommenting the line:

ZED_EMAIL_ADDER="root"

You can change "root" to your own email address or forward messages to root in /etc/aliases.

To reload the configuration file after editing, you must "HUP" the daamon. First get the process id number for zed:

ps ax | grep zed

Then send the HUP signal:

kill -HUP <pid>

Zed can do lots of nifty things like replacing defective drives with hot spares automatically. Caution: It might be better to take care of backups before you swap in a spare and start resilvering.

Replacing defective devices

Mirror and raidz vdevs can be repaired when a device fails:

Take the old device offline.
Replace the device.
Put the new device online.

As soon as the new device comes online, it will start resilvering - the data will be restored using other drives in the mirror or raidz. You can inspect the progress of resilvering using zpool status.

Taking a device offline

zpool offline myPool aDrive

Taking a device offline until next reboot (temporary)

zpool offline -t myPool aDrive

Replacing a device

zpool replace myPool oldDrive newDrive

The newDrive can be a hot spare or any unused drive.

If newDrive is not specified, the oldDrive will be used: That can occur when oldDrive is part of a mirror or raidz pool and it has been replaced by a new drive with the same device path.

ZFS can "tell" if a drive was part of a zfs pool. It will object if you try to add such a drive online in a new roll. To force the issue, add the "-f" option to the replace command.

Putting a device back online

zpool online myPool aDrive

Using autoreplace

If the pool property autoreplace is "on" and spare drive is part of the pool, the defective drive will be replaced by the spare and resilvering will start automatically. Using this option seems attractive, but frequently it is better to attend to your backups before starting a lengthy resilvering process.

Replacing a device - example

I recently replaced a device in a raidz2 pool. Here are the details.

The pool device names are from /dev/disk/by-id. The bad disk (which still works, but simply has too many errors to make me comfortable) shows up at:

/dev/disk/by-id/ata-WDC...3C3J-part1

Currently, this is at device name:

/dev/sde

First, I preserve the partition structure:

sfdisk -d /dev/sde > oldmap.dat

The oldmap.dat file is plain text. Edit the file and note the line in the header that assigns a value to "label-id" Also note the expression that assigns a value to "uuid" in the last line of the file. Here is a sample:

label: gpt
label-id: B303994D-9CCE-4732-AA3F-B3755BFD2E69
device: /dev/sdh
unit: sectors
first-lba: 34
last-lba: 7814037134

/dev/sdh1 : start=        2048, size=  7759462400, \
    type=516E7CBA-6ECF-11D6-8FF8-00022D09712B, \
    uuid=B973915D-7169-4377-ACD2-58285E608D73, \
    name="FreeBSD ZFS"

In this example, I added the line continuation marks for illustration only. Now remove the entire line that assigns label-id and cut out the assignment expression for uuid so the file now looks like this:

    label: gpt
    device: /dev/sdh
    unit: sectors
    first-lba: 34
    last-lba: 7814037134

    /dev/sdh1 : start=        2048, size=  7759462400, \
            type=516E7CBA-6ECF-11D6-8FF8-00022D09712B, \
    name="FreeBSD ZFS"

As noted before DO NOT use line continuations "\" in the real file.

If your device is too clobbered to read the parition map, you could use another device from the pool if they were all partitioned the same way.

Take the device offline with respect to zfs:

zpool offline mrPool /dev/disk/by-id/ata-WDC...3C3J-part1

Now shutdown the computer, remove the device, and put in the new one. Before you slide in the new device, note the last four letters of the serial number. In this case, they were:

Boot up the computer and list the device id mapping to confirm that the new drive is still at /dev/sde:

ls -l /dev/disk/by-id | grep 0535

(It was.)

Now partition the new device like the old one using your edited map file:

sfdisk /dev/sde < oldmap.dat

The sfdisk utility will generate a new label-id and uuid.

List the device id again so you can copy the parition id to the clipboard to avoid typing it later:

ls -l /dev/disk/by-id | grep 0545 | grep part

Tell zfs to replace the old device with the new one:

zpool replace mrPool \
    /dev/disk/by-id/ata-WDC...3C3J-part1 \
    /dev/disk/by-id/ata-WDC...0545-part1

Confirm that resilvering has started:

zpool status mrPool

The status shows:

     state: DEGRADED
            status: One or more devices is currently being resilvered.  
    The pool will continue to function, possibly in a degraded state.
            action: Wait for the resilver to complete.
              scan: resilver in progress since Mon Aug 27 09:11:41 2018
                    57.5G scanned out of 21.1T at 207M/s, 29h41m to go
                    6.98G resilvered, 0.27% done
            ...

            replacing-7                    DEGRADED  0  0  0
            /dev/disk/by-id/...3C3J-part1  OFFLINE   0  0
            /dev/disk/by-id/...0545-part1  ONLINE    0  0  0  (resilvering)

Dealing with disk sector sizes

As we stated earlier, ZFS will operate a lot faster if it knows the physical sector size of your disk(s). If you're building a big storage array, it's worth spending a little time to get this right.

Unfortunately, ZFS cannot reliably detect the sector size. To make matters worse, many disks lie about their size. The history of this peculiar behavior is beyond the scope of this article.

If you take the trouble to find your disk on the manufacturer's web site, they will often reveal the physical sector size. Otherwise, they may state that the disk has the Advanced Format, which is marketing-speak for 4096 byte sectors. It might even be printed on the box.

If that's too much trouble, here's a heuristic that's likely to work: First, ask the disk: You can use fdisk:

fdisk /dev/sdx

Or the more impressive:

lsblk -o NAME,PHY-SeC

You'll get back either 512 or 4096. Now use this heuristic:

IF the reported sector size is 4096 THEN
    The true sector size is 4096.
ELSE IF the disk was made before 2010 THEN
    The true sector size is 512.
ELSE IF the disk was made after 2011 THEN
    The true sector size is probably 4096.
ELSE
    Do you really want to use this old P.O.S.?
END IF

The option to specify a sector size of 4096 is:

-o ashift=12

The "12" here is the power of 2 that makes 4096. We're all geeks in here you see.

Example: Creating a pool with specified sector size:

zpool create mrnas -o ashift=12 sdb1 sdc1 sdd1 sde1

This optimization is only effective if all the disks have the same sector size.

Maxims for pools

It's a good idea to make storage vdevs the same type and size.
You can add or remove log, cache, or spare vdevs any time.
You can add storage vdevs to a pool, but you can't remove them.
You can add or remove drives from a mirror vdev.
You cannot add or remove drives from a raidz vdev.
Mirrors are faster and more flexible than raidz.
An N-stripe of M-mirrors can give any desired level of speed and redundancy.
Redundancy is not a substitute for backups.

ZFS Commands

The zfs command is used to create and configure datasets.

The big picture

Create and destroy datasets
Mount and unmount datasets
Rename datasets
Get and set properties
Create and destroy snapshots
Rollback a snapshot
Create and destroy clones
Promote a clone
Sharing on network file systems
Replication - copies and backups

Dataset types

FileSystem - The default type. Contains a zfs filesystem.
Volume (zvol) - A dataset with an alternative filesystem.
Snapshot - Captures the read-only state of a dataset at some instant.
Clone - A modifiable snapshot

Creating datasets

A dataset path always begins with a pool name:

zfs create myPool/myDataset

Any hierarchy of datasets can be created:

zfs create myPool/myDataset/mySubset1/mySubset2

Each dataset to the right is called a descendant of the datasets to the left.

Mounting datasets

If the pool was created with the default mountpoint, datasets are automatically mounted under the pool with the same path used to create them.

Alternatively, you can specify a mount point when a dataset is created using a property:

zfs create myPool/myDataset mountpoint=aPath

In this expression, aPath is any filesystem path or one of the keywords none or legacy. If none is specified, the dataset will not be mounted. If legacy is specfied, you can mount the dataset using the regular mount command:

mount -t zfs myPool/myDataset /mnt/here

Or in /etc/fstab:

myPool/myDataset  /mnt/myStuff   zfs

Or in /etc/auto.mount:

myStuff  -fstype=zfs  :myPool/myDataset

You can change the mountpoint of a dataset anytime:

zfs set mountpoint=aPath

The old mountpoint will be removed and a new one created.

To restore the default mountpoint behavior, first give the pool a mountpoint at the root:

zfs set mountpoint=/myPool myPool

Then change the mountpoints of each dataset you've created to be inherited:

zfs inherit mountpoint myPool/myDataset1
...
zfs inherit mountpoint myPool/myDatasetN

When switching back to inherited mountpoints, I found it necessary to delete the old mountpoint directories by hand. Perhaps this is a bug in ZFS for Linux?

Creating a volume

You can use datasets for non-zfs filesystems. These are called "volumes."

This will create a blank 32G volume:

zfs create myPool/myVolume -V 32G

A new device entry will be created automatically at:

/dev/zd0

The next volume you create will be associated with:

/dev/zd1

And so on each time you add a new volume.

You can see the association between zvol devices and datasets here:

/dev/zvol/myPool/myVolume -> ../../zd0

You can now (optionally) partition the device:

fdisk /dev/zvol/myPool/myVolume

Create a filesystem:

mkfs -t ext4 /dev/zvol/myPool/myVolume

And mount the volume:

mount /dev/zvol/myPool/myVolume /mnt/here

In these examples, /dev/zd0 could be used in place of the longer symlink path, but it's easy to lose track of these associations if you have many volumes. It's safer to use the symbolic links under /dev/zvol/...

Rename a dataset

zfs rename oldPath newPath

Destroy a dataset

Mark a dataset for destruction:

zfs destroy myPool/myDataset

The actual "destruction" is deferred until all descendant datasets, snapshots, and clones are deleted. If you want everything gone immediately:

zfs destroy -R myPool/myDataset

List zfs datasets

zpool list

Dataset compression

Using compression will make zfs faster unless the dataset contains mostly compressed data (such as media.)

Enable default compression scheme:

zfs set compression=on myPool/myDataset

Enable a better compression scheme:

zfs set compression=lz4 myPool/myDataset

Turning on compression only takes effect for files added later.

Dataset access time recording

Although turned on by default, it should usually be turned off:

zfs set atime=off myPool/myDataset

If you don't turn off access time recording, your incremental backups will include every file you've accessed even if it hasn't changed.

IMPORTANT: NFS sharing won't work unless the dataset is mounted locally.

To specify NFS sharing and immediately enable remote access:

zfs set sharenfs=on myPool/myDataset

Linux clients can mount the share from the command line using:

mount -t nfs zfs.host.com:/myDataset /mnt/here

Using /etc/auto.mount the specification is:

myLocalName -fstype=nfs zfs.host.com:/myDataset

Note: The pool name is not part of the nfs path, so it is impossible to share multiple datasets with the same name from different pools on the same host.

Note: It is said to be "auspicious" to avoid sharing zfs datasets using native "/etc/exports" and the sharenfs property: Use one or the other.

IMPORTANT: Samba sharing won't work unless the dataset is mounted locally.

The samba server requires some preparation. Add these directives to the smb.conf global section:

[global]
    ...
    usershare path = /var/lib/samba/usershares
    usershare max shares = 100
    usershare allow guests = yes
    usershare owner only = no

The usershares directory must be created by hand

cd /var/lib/samba
mkdir usershares
chmod o+t usershares
chmod g+w usershares

Make sure that the mountpoint directory for the dataset is owned by the samba guest account user. If you want more restricted access, the procedure should be obvious.

When the configuration is complete, restart:

systemctl restart smb
zfs share -a

To specify samba sharing and immediately enable remote access:

zfs set sharesmb=on myPool/myDataset

After you execute that command, a new share description file should appear in /var/lib/samba/usershares. If you don't see the new file file there, make sure that samba is running and make sure the dataset has a mountpoint property that is a valid file system path. You can't share a dataset with a legacy mountpoint.

Windows clients will see this at the path:

\\zfs.host.com\mypool_mydataset

Notice that the client sees the path components all in lowercase.

To check that samba is working, you can list the shares:

net usershare list

NOTE: It is said to be "auspicious" to avoid sharing zfs datasets using native "smb.conf" and also using "set sharesmb": Use one or the other.

After an NFS or Samba share is enabled, you can disable or enable the sharing for a specified dataset using:

zfs share myPool/myDataset
zfs unshare myPool/myDataset

These operations take effect immediately. But unsharing a dataset will only be effective until the next reboot. At boot time, the startup scripts will mount all configured shares using:

zfs share -a

At shutdown, some other script will run:

zfs unshare -a

To permanently disable a share, turn off one or both share properties:

zfs set sharenfs=off myPool/myDataset
zfs set sharesmb=off myPool/myDataset

Note: Turning off sharenfs will immediately disable remote access, but turning off sharesmb will not. You must explicitly unshare a previously active samba share.

Working with properties

We have seen a few dataset properties in the previous sections. Properties are set using the expression:

set optionName=optionValue myPool/myDataset

Here are a few common properties and values:

mountpoint      aPath
compression     off | on | lz4
atime           on  | off
readonly        off | on
exec            off | on
sharenfs        off | on
sharesmb        off | on
quota           size
reservation     size

In these expressions, size can be specified with the M, G, or T suffix. When the option is "off" or "on", the default value is shown first.

Descendant datasets inherit all properties from their parents except the quota and reservation properties.

If you explicitly set a property and later decide it would be better to inherit the value, you can issue the command:

zfs inherit propertyName myPool/myDataset

By adding the "-r" option, all descendants of the specified dataset will inherit the property.

Showing a property value

zfs get propertyName myPool/myDataset

Showing all property values

zfs get all myPool/myDataset

Property sources

When you get a property value (as described in the previous section) the listing will show the source of the property, which can be:

default             Never specified. The default value is used.
inherited from x                Inherited value from parent dataset x.
local               The user previously used "set" to specify a value.
temporary           The property was specified in a (temporary) -o option when mounting.
received            The property was set when the dataset was created by "zfs receive"
(none)              The property is read-only.

Listing properties by source

You can list only properties that have specific source:

zfs get -s local all myPool/myDataset

And you can add -r to list the properties of all children:

zfs get -r all myPool/myDataset

The -s and -r options can be combined.

Listing received property values

A property can have both a received and a local (in effect) value.

zfs get -o all myProperty myPool/myDataset

This will show the local and received property values.

Inheriting properties

Some properties are automatically inherited from parent datasets. Others are not. After changing the value of an otherwise-inherited property, you can restore the inherited value using:

zfs inherit *propertyName* myPool/myDataset

This will replace any previously set value with the value specified somewhere on the parent path.

Snapshots

A snapshot of a dataset behaves like a backup: it contains the dataset frozen in time when the snapshot was created.

Note the special syntax:

zfs snapshot myPool/myDataset@mySnap1

The process of making a snapshot is nearly instantaneous.

All snapshots of a dataset are accessible in a hidden directory below the original dataset mountpoint:

.zfs/snapshot

Inside that directory, you would find, for example, mySnap1 which was created in the previous example.

You can use "zfs rename" on snapshots, but their parent path cannot be changed.

Listing snapshots

To see snapshots, add an option:

zfs list -t snapshot

Or use:

zfs list -t all

Finding files in snapshots

If you need to look for a deleted file in an old snapshot, remember that each dataset has it's own hidden .zfs directory. A parent dataset's .zfs/snapshots may show child mountpoint directories, but they will be empty. Don't panic. You're not looking in the right place.

Renaming snapshots

You can only rename the snapshot, not the dataset path before the "@" symbol:

zfs rename myPool/myDataset@oldName myPool/myDataset@newName

Recursively rename a snapshot and the snapshots of all descendants:

zfs rename -r myPool/myDataset@oldName myPool/myDataset@newName

Recursive snapshots

By default, descendant datasets are not part of snapshot. (Not to be confused with descendant directories in the dataset filesystem, which will be included.) To include descendant datasets, use the -r option. For example,

zfs create myPool/myDataset1
zfs create myPool/myDataset1/myDataset2

This will not include myDataset2:

zfs snapshot myPool/myDataset1@mySnap1

But this will include myDataset2:

zfs snapshot -r myPool/myDataset1@mySnap1

Rollbacks

A rollback transforms a dataset back into the state it was in when a snapshot was created:

First make sure the snapshot itself isn't mounted, then:

zfs rollback -rf myPool/myDataset@mySnap

The -r option removes all dependents snapshots of the snapshot back to the time the snapshot was created. Without this option, you can only rollback to the most recent snapshot. Using -R deletes all dependent clones as well. The -f option unmounts the filesystem if necessary.

Clones

A clone works like a copy of a snapshot, but you can modify the the files inside:

zfs clone myPool/mySub1/@gleep myPool/myFirstClone

Note: Clones must be destroyed before the parent snapshot and dataset can be destroyed.

Promotion

You can promote a clone so it become a normal dataset, independent of the orignal snapshot.

zfs promote myPool/myClone

You might want to rename it to reflect it's elevated status:

zfs rename myPool/myClone myPool/myNewDataset

Snapshots of volumes

The syntax is the same:

zfs snapshot myPool/myVolume@mySnap

A new device entry is automatically created:

/dev/zvol/myPool/myVolume@mySnap

This can be mounted in the usual way. Read-only status is implicit.

Replication

Replication is used to copy a source dataset from one pool to a destination dataset in another pool. The destination dataset may be part of a pool on another host. The most common use of replication is making backups.

First time backup:

zfs snapshot myPool/myDataset@now
zfs send myPool/myDataset@now | zfs receive -d myBack

Later incremental backups:

zfs rename myPool/myDataset@now myPool/source@then
zfs snapshot myPool/myDataset@now
zfs send -i myPool/myDataset@then myPool/myDatasetl@now | zfs receive -dF myBack
zfs destroy myPool/myDataset@then

zfs send options:

-i : Incremental (old and new snapshots are the following parameters)
-I : Same as -i, but all snapshots taken since the previous "now" are included.

zfs receive options:

-d : Discard the pool name from the sending path
-F : Force rollback to most recent snapshot.

In the example above, using -d produces this result on the receiving side:

myBack/myDataset@now

Without -d, you would get:

myBack/myPool/myDataset@now

The -F option does a rollback on the receiving dataset, discarding any changes since the last snapshot. This seems alarming, but the practice is necessary because simply browsing a mounted backup will alter access times (if enable) and zfs will treat the data as modified, forcing things to be copied. If the destination dataset is used only for backups, there shouldn't be any useful changes since the last snapshot.

An alternative to using -F is to make the destination dataset readonly:

zfs set readonly=on myBack/now

If myBack is used exclusively for backups, you can make the whole pool readonly:

zfs set readonly=on myBack

The readonly property will be inherited by all new descendant datasets created under myBack. It only applies to normal filesystem operations, not to the receive command, which will modify the destination dataset to match the source.

Replicating a pool

By adding a few options, you can backup the entire pool including all descendant datasets.

First time backup:

zfs snapshot -r myPool@now
zfs send -R myPool@now | zfs receive -ud myBack

Later incremental backups:

zfs rename -r myPool@now myPool@then
zfs snapshot -r myPool@now
zfs send -Ri myPool@then myPool@now | zfs receive -uFd myBack
zfs destroy -r myPool@then

zfs rename options:

-r : Recursively rename the snapshots of all descendant datasets.
     (Only snapshots can be recursively renamed!)

zfs snapshot options:

-r : Recursively create snapshots of all descendants

zfs send options:

-R : Create a replication package (gets all descendants including snapshots and clones)
-i : Incremental (old and new snapshots are the following parameters)

When doing an incremental send, the "old" snapshot must be the one previously sent.

zfs receive options:

-u : Don't mount anything created on the receiving side.
-d : Discard the pool name from the sending path

When used with -R on the sending side, -F will delete everything that doesn't exist on the sending side. This make the readonly option unnecessary.

The -u option is useful when the datasets being created have non-default mountpoint options. It prevents them from being mounted and possibly overriding existing directories or mountpoints in the receiving filesystem. Unfortunately, the systemd startup scripts will attempt to mount them the next time you reboot. Please see Avoiding mountpoint conflicts..

Secure replication between hosts

You mush have ssh configured and working first. In this example, myBack is assumed to exist on the destination host with ip name "destHost". For a non-incremental backup:

zfs send myPool@now | ssh destHost zfs receive -d myBack

Dealing with received properties

By default, when a pool receives a dataset, whatever property values existed on the sending side will be preserved. However it is possible to override this behavior by setting properties in the zfs receive command or by setting them locally after the dataset is received. In this case, the property will have both a local and a received value. If you subsequently clear the local value, the received value will take effect again. See the "zfs get" section above for a listing of possible property sources. In the next section, we cover the most common case for modifying a received property.

Avoiding mountpoint conflicts

When several hosts use zfs replication to backup file systems, there is a potential for mountpoint conflicts on the backup host. If root file systems are backed up, they will almost certainly conflict with mountpoints active on the backup host.

We would usually prefer that none of the mountpoints in the backup dataset be active on the backup host. But if it should ever be necessary to restore a backup, we want the original mountpoints to be remembered and restored. ZFS provides special options for send and recieve to get exactly this effect.

In the following example, the pool "srcPool/srcDataset" will be sent to "dstMachine" which has a pool named "dstPool".

First, on the dstMachine, you must create a parent dataset. It can have the name "srcDataset" to help you remember where it came from:

(On dstMachine)

zfs create dstPool/srcDataset

Now on srcMachine execute:

zfs snapshot -r srcPool/srcDataset@now
zfs send -R srcPool/srcDataset@now | ssh dstMachine \
    zfs receive -ud -x mountpoint dstPool/srcDataset

This will replicate srcPool/srcDataset to dstPool/srcDataset so they match except for mountpoints: They will be "latent" but appear to be "none" because that's what they inherit from the newly-created srcDataset on dstMachine.

The "-x mountpoint" options means "don't change the mountpoints on the target system, but remember the mountpoint property value so it will be set if the dataset is ever replicated without the "-x".

Sound a bit complex, but it's exactly what you want when doing backups from many machines to a common backup server pool.

Later incremental backups:

zfs rename -r srcPool/srcDatasetl@now srcPool/srcDataset@then
zfs snapshot -r srcPool/srcDataset@now
zfs send -Ri srcPool/srcDataset@then srcPool/srcDataset@now  \
    | ssh dstMachine zfs receive -ud -x mountpoint dstPool/srcDataset
zfs destroy -r srcPool/srcDataset@then

New zfs receive option:

-x mountpoint : Remember but don't modify the mountpoint

The effect of -x is to make sure we don't lose the mountpoint value, but keep the local value (previously set using the -o option.) I like to call these "latent" mountpoints. Perhaps there's a more official-sounding zfs term?

Displaying local and received values

zfs get -o all mountpoint myBack/myDataset

This command will show the value of mountpoint locally in effect and also the original received value.

Restoring a backup

When restoring a backup to the original host, use the "-b" option for *zfs send":

zfs send -R -b myBack/myDataset@now | zfs receive -ud myPool

New zfs send options:

-b : Send previously received property values instead of the local values.

In our example, this will restore myDataset and all the descendant mountpoints to their original state.

Accessing backups

In the examples show above, we suppress mountpoints on the backup device using (-x mountpoint) - If you try to access files in one this snapshouts, you'll have to mount the dataset. But that causes a big problem: changing the mountpoint from "none" so some local directory modifies the dataset so the next incremental snapshot from the source will fail. You'll see a message like this:

cannot receive: destination has been modified since most recent snapshot

You can access a backup without disturbing anything by using a temporary clone. Here is a an example of accessing the root (user) directory in a backup:

Create a clone

zfs clone mrPool/server/root@now mrPool/gleeb

Give it a local mountpoint

zfs set mountpoint=/gleeb mrPool/gleeb

Now recover your files from /gleeb

Destroy the clone

zfs destroy mrPool/gleeb

Get rid of the mountpoint directory

rmdir /gleeb

There are other ways to deal with this issue:

1) Using the -F option with "zfz receive": The "-F" will revert all changes (including deleting all snapshots) keeping only the most recent snapshot. The drawback to using -F is part about destroying "all other snapshots" - This can be a problem if you perform secondary backups (replication of a replication) because you'll need to preserve snapshots associated with each incremental stage.

2) Access snapshots through the invisible ".zfs" directory. This is a fine idea if the backup dataset has an active local mountpoint. But when backing up many computers to a common backup pool, having active mountpoints isn't a practical idea.

Problems and solutions

Starting zfs by hand

These commands are useful if you build and install zfs from source or if the modules fail to load during the boot process. Normally, zfs starts using the "zfs.target" under control of systemd. You can do this yourself:

systemctl start zfs.target

Here's what starting the zfs.target will do:

# Load zfs and dependencies:

    modprobe zfs

# Import all pools:

    zpool import -c /etc/zfs/zpool.cache -aN

# Mount all file systems:

    zfs mount -a

# Activate smb and nfs shares:

    rm -f /etc/dfs/sharetab
    zfs share -a

# Start the zed daemon:

    zed -F

Somehow, systemd makes the last command above daemonize zed, but to start it from the command line, I had to use systemctl:

systemctl start zed.target

ZFS disappears after kernel update

When you update the linux kernel, dkms is supposed to update the zfs modules. Occasionally, this mechanism fails and your zfs filesystems will be missing after a reboot. To fix this problem, you can force a rebuild using dkms commands, which is somewhat tedious, or you can simply re-install the packages:

yum reinstall spl-dkms
yum reinstall zfs-dkms

Now bring up zfs without rebooting:

systemctl start zfs.target

ZFS disappears after Fedora upgrade

After doing a major upgrade, e.g. version 22 to 23, ZFS will not start or automatically rebuild the modules even if the associated dkms packages are re-installed. I haven't figured out why this happens, but the fix is easy: Just add and install the modules by hand:

First, make sure your modules are up to date:

dnf update -y

Find out what version you have:

rpm -q spl-dkms
rpm -q zfs-dkms

Usually spl and zfs will have the same version numbers. Ignore the extension part of the version string: For example If "rpm -q" shows something like:

zfs-dkms-0.6.5.4-1.fc23.noarch

Your version string is just:

0.6.5.4

Now you can add and install the modules:

dkms add -m spl -v 0.6.5.4
dkms add -m zfs -v 0.6.5.4
dkms install -m spl -v 0.6.5.4
dkms install -m zfs -v 0.6.5.4

And bring up zfs:

systemctl start zfs.target

Dealing with ZFS updates

After your package manager updates zfs, you may find new features. To avoid problems where there may be multiple hosts that share pools, this process is controlled by a features protocol. You can decide which new pool features to enable or ignore, even if you accept the rest of a software update. This happened to me recently, so I'll outline the process with a specfic example:

After doing a routine "yum update", I got this message from "zpool status":

status: Some supported features are not enabled on the pool. The pool
can still be used, but some features are unavailable.

action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not
        support the features. See zpool-features(5) for details.

Ok. Next I tried running:

zpool upgrade

In older versions of zfs, the pool data structure had a version number which would be incremented by the upgrade process. It was an "all or nothing" deal. But that was yesterday. Running "zpool upgrade" now reports:

This system supports ZFS pool feature flags.

All pools are formatted using feature flags.

Some supported features are not enabled on the following pools. Once a
feature is enabled the pool may become incompatible with software
that does not support the feature. See zpool-features(5) for details.

POOL  FEATURE
---------------
mrBack
    filesystem_limits
    large_blocks

mrPool
    filesystem_limits
    large_blocks

It turns out you have to enable each feature on each of your pools by hand:

zpool set feature@filesystem_limits=enabled mrBack
zpool set feature@large_blocks=enabled mrBack

zpool set feature@filesystem_limits=enabled mrPool
zpool set feature@large_blocks=enabled mrPool

The possible values for a feature are:

disabled: Your pool data structures don't support the feature.
enabled: Your pool data structures have been updated to support the feature.
active: The feature is actually being used.

This is clearly a better idea: You can decide how to trade off the value of a new feature against potential compatibility problems if you ever want to import your pools on another host.