ZFS Without Tears

Introduction

What is ZFS?

ZFS is a combined file system and volume manager that actively protects your data. It replaces lots of software you may be using such as LVM, RAID, and backup applications.

What is this document?

This is a set of notes I use to remember how to manage my ZFS configuration. You can find dozens of other ZFS tutorials and manuals. Most of them are better. Mine is shorter because it's far from comprehensive.

Understanding traditional storage systems

RAID levels are used to describe how an array of devices can be managed as one big storage resource. To understand ZFS, we only need a few:

Redundancy and performance

A stripe has no redundancy. Storage for each file will be spread out across all the devices. If any device fails, the entire file system on the array will be compromised. Stripes are used to gain speed: All the devices work in parallel to transfer one file.

In a mirror with N devices, N-1 devices can fail and the data can still be recovered. All the devices are the same size, so the capacity of the array is the size of one device. The performance of a mirror is the same as that of one component device.

A parity array is more complex: The data is striped along with parity information. In RAID 5, data can be recovered if one device fails. In RAID 6, two devices may fail. Parity information takes up space: If all the devices in a RAID 5 array are the same size "X", one X will be required for the distributed parity information. For RAID 6, 2X will be consumed for parity. A parity array is slower than a mirror, but it provides more storage space with the same number of drives.

RAID 10 is used to get the redundancy of a mirror with more performance: It is implemented in ZFS using N-stripes of M-mirrors. By fussing with N and M, safety, size, and speed can be balanced for a given application.

ZFS supports 1, 2, or 3 level parity and improves on traditional RAID parity implementations. In zfs, these are called raidz levels to distinguish them from the old RAID level terminology.

Understanding ZFS

The concepts

The big picture

Virtual Devices

Simple vdevs are disk drives, partitions, or files:

/dev/sdb
/dev/sdb1
/home/myself/myGreatBigFile

Complex vdevs are specified using an expression:

<type> device1 ... deviceN

The type is one of seven keywords taken from one of two groups:

Storage vdevs:

Optimization vdevs:

Storage vdevs, as the name suggests, are where you get space for your files. Optimization vdevs are used to optimize pool performance or reliability. They are optional.

In some contexts, zfs commands require the use of simple vdevs. We will denote these as devices, reserving the term vdev for contexts where a complex or a simple device may be specified.

Pools

Pools are the top-level zfs construct for managing a collection of virtual devices. A pool is created by specifying a stripe of vdevs. Space for datasets is allocated dynamically from all the storage vdevs in the pool.

Pool design

Each top-level vdev in a pool is allowed to be a different type and/or size, but this is seldom (if ever) a good idea. The most common redundant pool organizations are:

Mirror of M devices
RaidzN of M devices
N-Stripe of M-mirrors

Complex vdevs

You cannot create mirrors of raid arrays or raid arrays of mirrors or any other compound vdev. The top level is always a stripe. Each element of the stripe must be a simple device or one of the three vdev types listed in the previous section.

Datasets

There are four types:

ZFS File systems

By default, when you create a new dataset, ZFS also creates a file system which manages space automatically. When you create a file in that filesystem, it is likely that bits of it are stored on every storage vdev in the pool. When you delete a file, the blocks are returned to the pool.

Volumes

A volume, also called a zvol, is a large fixed-sized dataset formatted with any file system you prefer. Zvols are often used as disk drives for virtual machines. When a virtual machine is no longer needed, the volume can be destroyed, returning all the space to the pool.

Snapshots

A snapshot is conceptually a read-only copy of another parent dataset. Snapshots are implemented in such a way that they can be created nearly instantaneously and take very little space. A snapshot depends on the continued existence of the parent.

Clones

A clone is a snapshot you can modify. Clones grow as they are modified. Like snapshots, they depend on their parent.


Preparation for ZFS

Installation

ZFS is a foreign kernel module that gets rebuilt automatically by dkms whenever your kernel is updated. Consequently, it depends on the kernel-devel package.

Install kernel-devel:

yum install kernel-devel

Install a link to the zfs package repository:

yum localinstall --nogpgcheck \
    http://archive.zfsonlinux.org/fedora/zfs-release$(rpm \
    -E %dist).noarch.rpm

Install zfs:

yum install zfs

Partitioning drives for ZFS

ZFS likes to use whole disks, but there is a drawback: If you try to replace a drive with a new one that is smaller, even by one byte, you are totally out of luck.

By using a partition slightly smaller than the drive, you have room to adjust the size when partitioning a new drive.

  1. You must use GPT - Use gdisk to write the label and create partitions.
  2. The partition type for ZFS is "FreeBSD ZFS" (0xa504)

Limiting memory usage

ZFS wants a lot of memory. The recommended minimum is 1G per terabyte of storage in your pool(s). Without constraints, ZFS is likely to run off with all your memory and sell it at a pawn shop. To prevent this, you should specify a memory limit. This is done using a module parameter. A reasonable setting is half your total memory:

Edit or create:

/etc/modprobe.d/zfs.conf

Add this line:

options zfs zfs_arc_max=17179869184

The size is in bytes and must be a power of 2:

16GB  = 17179869184
8GB   = 8589934592
4GB   = 4294967296
2GB   = 2147483648
1GB   = 1073741824
500MB = 536870912
250MB = 268435456

Using ZFS without ECC memory

ZFS wants you to use ECC memory, which is typically only available on "server class" motherboards with Intel Xeon processors.

If you don't use ECC memory, you are taking a risk. Just how big a risk is the subject of considerable controversy and beyond the scope of these notes.

First, give yourself a fighting chance by testing your memory. Obtain a copy if this utility and run a 24 hour test:

http://www.memtest86.com

This is particularly important on a new server because memory with defects is often purchased with those defects. If your memory passes the test, there is a good chance it will be reliable for some time.

You might be tempted to keep your server in a 1-meter thick lead vault buried 45 miles underground. It turns out that many of the radioactive sources for memory-damaging particles are already in the ceramics used to package integrated circuits. So these time-consuming measures are probably not worth the cost or effort.

Instead, we're going to enable the unsupported ZFS_DEBUG_MODIFY flag. This will mitigate, but not eliminate, the risk of using ordinary memory.

Edit:

/etc/modprobe.d/zfs.conf

Add the line:

options zfs zfs_flags=0x10

Reboot to be sure it "takes".

You can see the current value here:

cat /sys/module/zfs/parameters/zfs_flags

Command line operations

The rest of this guide describes only two commands:


ZPOOL Commands

The zpool command is used to create and configure pools.

The big picture

Creating a pool

The basic pattern:

zpool create myPool vdev1 vdev2 ... vdevN (options)

This expression creates a stripe. Blocks for datasets will be allocated across all the vdevs. If each vdev is a simple device, the pool will be vulnerable: If one device fails, the whole pool is lost. Consequently, the vdevs are usually mirrors or raidz arrays.

There are many options, but I want to mention one right now because it is irreversible if you get it wrong:

-o ashift=12

This specifies that your disk is Advanced Format, which is the same as saying it has 4096 byte sectors instead of the old 512 byte sectors. Most disks made after 2011 are advanced format so you'll need this option most of the time. If you forget, ZFS assumes the sector size is 512. If that's the wrong answer, you'll take a big performance hit. More details about this are covered later. I won't show this option in the examples, because it clutters up the logic. But don't forget!

Adding vdevs

You can add more vdevs to a pool anytime:

zpool add myPool vdev1 ... vdevN

Removing vdevs

You can only remove log, cache or spare vdevs:

zpool remove myPool aVdev

Storage allocation

Blocks for datasets are allocated evenly across all the top-level storage vdevs. This is why you can't remove a top-level storage vdev.

Simple vdevs (devices)

Simple vdevs are specified using path notation. ZFS puts "/dev/" in front a path name unless the path begins with "/".

You can specify devices like this:

sdb sdc sdd sde

Or using the full path:

/dev/sdb /dev/sdc /dev/sdd /dev/sde

File vdevs must have absolute paths:

/home/aFile /var/temp/goop

File vdevs are used mostly (always?) for experimentation.

Create a stripe of two devices

zpool create myPool sdb sdc

Create a mirror of two devices

zpool create myPool mirror sdb sdc

Create a raidz1 of four devices

zpool create myPool raidz sdb sdc sdd sde

Create a stripe of two mirrors (RAID 10)

zpool create myPool mirror sdb sdc mirror sdd sde

Create a stripe of two raidz1 arrays

zpool create myPool raidz sda sdb sdc raidz sdd sde sdf

Create a stripe of files

This kind of pool is used to experiment without using disk drives. An example:

Create some empty files. (The minimum size of a file vdev is 64M.)

dd if=/dev/zero of=myFile1 bs=1M count=64
dd if=/dev/zero of=myFile2 bs=1M count=64

Create the pool:

zpool create myPool ~/myFile1 ~/myfile2

The paths must be be absolute.

Creating a mirror by attaching a device

We start with a pool that has one device:

zpool create myPool sdb

Then we attach another, creating a mirror:

zpool attach myPool sdb sdc

We can add another, continuing from the previous example:

zpool attach myPool sdc sdd

You always specify the last vdev in the mirror followed by the new one.

Decreasing the level of a mirror

Continuing from the previous example, we remove the "end" device:

zpool detach myPool sdd

Now the mirror has only sdb and sdc.

Demoting a mirror to a simple storage vdev

Continuing from the previous example:

zpool detach myPool sdc

Now only sdb is left.

Mounting a pool

By default a new pool is mounted at the root of the file system where it appears as a directory named after the pool.

You can specify an alternative mount point for the pool when creating:

zpool create myPool ... -m aPath

In this expression, aPath is a regular file system path or the special keyword none. If none is specified, the pool will not be mounted. The last element of the path is the name for the actual mountpoint, not just the parent path. Consequently, you can use any name for the mountpoint. If an empty directory exsits at that location with a conflicting name, zfs will mount the pool on that directory. Otherwise, a new mount point is created.

You can set or change the mountpoint of a pool anytime later:

zfs set mountpoint=aPath myPool

If the pool was already mounted, the old mount point will be removed.

Note that we are using the command zfs instead of zpool. That's because the pool itself is a dataset, which will be discussed in more detail later.

Adding a hot spare drive

You can add one or more spares at any time:

zpool add myPool spare /dev/sdf

Spares can be used with any kind of pool.

Pitfalls when adding top-level vdevs

A chain is only as strong as its weakest link. Suppose you have a raidz pool:

zpool create myPool raidz sdb sdc

Now you add a device:

zpool add myPool sdd
  1. You have added a top-level vdev that is a simple device.
  2. You have not added a device to the raidz vdev.
  3. The pool will be irreparably damaged if anything goes wrong with sdd.

The raidz is spoiled because blocks for files are allocated from all the top level devices, some from the raidz vdev and some from the device sdd1.

Show the status of a pool

zpool status

Show data statistics

For the whole pool

zpool iostat

Also individual vdevs

zpool iostat -v

Continuous monitor every 5 seconds

zpool iostat -v 5

Show the history of a pool

zpool history myPool

Destroy a pool

zpool destroy myPool

Physically move a pool to another machine

After exporting a pool, you can remove the devices and install them in another computer:

zpool export myPool

Exporting a pool unmounts all the filesystems and "offlines" all the devices, so it is sometimes used in other situations when you want to stop all access to the pool.

On the destination machine, you resurrect the pool by importing:

zpool import myPool

After the import, all mountable datasets and filesystem will be mounted in their original locations. Sometimes this is inconvenient. You can arrange for all new new mountpoints to be under a new path using the altroot property:

zpool import myPool -o altroot=/my/new/path

If you used partitions instead of whole disks when creating the pool, you cannot import the pool on a machine with a different "endianess" - The information needed to do that is part of the zfs label that can only be present when zfs owns the entire drive.

If you export a pool created using file vdevs, there is no place for them to store their parent directory. To import such a pool, you must specify the parent like this:

zpool import myPool -d ~/zfsPlayroom

Rename a pool

First export the pool:

zpool export myPool

Then import it and specify a new name:

zpool import myPool myNewName

Common pool properties

These can be specified when importing a pool using -o or by using the set command. The default value is indicated first when there are alternatives:

altroot=path
readonly=off | on
autoreplace=off | on
ashift=12

Example:

set readonly=on myPool

Pool information properties

These are read-only:

    health
    size
    free
    capacity
    allocated

free + allocated  = size
health = ONLINE | DEGRADED | FAULTED | OFFLINE | REMOVED | UNAVAIL

Using device ids

When you create a pool, it's easy to type and remember device names. These are the familiar "/dev/sdx" names. But there is a serious problem with device names: For scsi or sata disks, they are associated with the port where you plugged in the cable. And for USB disks and other removable devices, they are determined by the order they were plugged in. If you import a pool with the cables switched, disaster follows. And if you move the disks to another machine and try import them, there will likely be device name conflicts.

To avoid all this, it's better to switch to names that are associated with the volume (the physical disk) rather than the connection. Linux provides several choices. They are listed in the /dev/disk directory. There are only two choices that really useful: device IDs and UUIDs.

Device IDs are composed from the disk model number and serial number. When I have to replace a disk, I can be sure I got the right one because the serial number is printed on the paper label and "zpool status" will show the ID.

To make the switch to device ids:

zpool export myPool
zpool import myPool -d /dev/disk/by-id

You can also use the very fashionable UUIDs:

zpool export myPool
zpool import myPool -d /dev/disk/by-id

The good thing about UUIDs (and device IDs) is that they aren't optional: a disk always has both. A bad thing about UUIDs and device IDs is that they are far too long and complex to type or remember. That's why I usually partition and assemble disks using device name and then switch to device IDs.

You can switch back to the old device names like this:

zpool export myPool
zpool import myPool -d /dev

You can switch device names anytime. They become part of the data structures on disk the next time you export the pool or shut down.

Using device aliases

If you have a lot of drives in many different racks, locating a drive given the device ID isn't easy. ZFS provides a way to assign your own alias names. You specify alias names in this file:

/etc/zfs/vdev_id.conf

In the following example, we have two 4-bay enclosures that contain pools "mrpool" and "mrback."

Run "zpool status" and capture the output in a prototype vdev_id.conf file. Edit the file so it looks like this:

alias mrback_7TNO ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E2YR7TN0-part1
alias mrback_BPT2 ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E4FP8PT2-part1
alias mrback_VNSR ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E7ADVNSR-part1
alias mrback_FNN3 ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E3FLFNN3-part1

alias mrpool_4CX4 ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E1XS4CXA-part1
alias mrpool_AXC9 ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E1EFAXC9-part1
alias mrpool_3Y4P ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5KJ3Y4P-part1
alias mrpool_3C3J ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5KJ3C3J-part1

In this example, the short name was made by combining the pool/box name with the last four characters of the device id, which is part of the drive serial number printed on the disk label. If a drive is reported as defective, you have the location and a positive id you can physically see.

ZFS fires a UDEV rule at boot time to transform this file into a set of shortcuts that will appear in the automatically created directory:

/dev/disk/by-vdev

You can fire the rule without rebooting by entering the command:

udevadm trigger

There is a defect in the current zfs distribution that prevents this script from working if your devices refer to partitions instead of whole drives. The fix is simple. Edit the file:

/usr/lib/udev/rules.d/69-vdev.rules

The first active line in the rule reads:

ENV{DEVTYPE}=="disk", IMPORT{program}="/usr/lib/udev/vdev_id -d %k"

Duplicate this line and change "disk" to "partition". The two lines should look like this:

ENV{DEVTYPE}=="disk", IMPORT{program}="/usr/lib/udev/vdev_id -d %k"
ENV{DEVTYPE}=="partition", IMPORT{program}="/usr/lib/udev/vdev_id -d %k"

Now your vdev_id.conf file will work as expected when you trigger the rule.

After creating and populating the directory /dev/disk/by-vdev, you can rename the drives in your pools:

zpool export mrpool
zpool import -d /dev/disk/by-vdev mrpool

zpool export mrback
zpool import -d /dev/disk/by-vdev mrback

Adding a log device

By default, the mysterious "ZIL" (ZFS intent log) is created in the storage pool. A separate log device can be specified to improve performance. This can be done when a pool is created:

zpool create fastPool /dev/sdb3 log /dev/myOtherSSD

Or it can be added later:

zpool add fastPool log /dev/myOtherSSD

Log vdevs can be redundant:

zpool add fastPool log mirror /dev/ssd1 /dev/ssd2

You can attach a device to a mirrored log vdev to increase the level:

zfs attach fastPool /dev/ssd2 /dev/sdd3

You can also attach a device to a non-mirrored log vdev to create a mirror.

Adding a cache device

ZFS always uses the memory-based "ARC" cache. It is possible to add one or more secondary cache vdevs. In ZFS-speak, this is called an L2ARC. These are usually fast devices such as SSDs.

Cache devices can be specified when a pool is created:

zpool create fastPool /dev/sdb3 cache /dev/mySSD

Or they can be added later:

zpool add fastPool cache /dev/mySSD

Cache vdevs cannot be redundant

Scrubbing a pool

Scrubbing verifies all checksums and fixes any files that fail.

zpool scrub myPool

Although the command returns immediately, the scrub itself my take many hours or even days to complete. Use "zpool status" to view the progress. You can stop a scrub using:

zpool scrub -s myPool

Using the ZED monitor daemon

The zed daemon is part of the zfs installation and runs all the time. There is no explicit systemd enable/disable script.

The configuration file is:

/etc/zfs/zed.d/zed.rc

One essential change is to specify your email address by uncommenting the line:

ZED_EMAIL_ADDER="root"

You can change "root" to your own email address or forward messages to root in /etc/aliases.

To reload the configuration file after editing, you must "HUP" the daamon. First get the process id number for zed:

ps ax | grep zed

Then send the HUP signal:

kill -HUP <pid>

Zed can do lots of nifty things like replacing defective drives with hot spares automatically. Caution: It might be better to take care of backups before you swap in a spare and start resilvering.

Replacing defective devices

Mirror and raidz vdevs can be repaired when a device fails:

  1. Take the old device offline.
  2. Replace the device.
  3. Put the new device online.

As soon as the new device comes online, it will start resilvering - the data will be restored using other drives in the mirror or raidz. You can inspect the progress of resilvering using zpool status.

Taking a device offline

zpool offline myPool aDrive

Taking a drive offline until next reboot (temporary)

zpool offline -t myPool aDrive

Replacing a drive

zpool replace myPool oldDrive newDrive

The newDrive can be a hot spare or any unused drive.

If newDrive is not specified, the oldDrive will be used: That can occur when oldDrive is part of a mirror or raidz pool and it has been replaced by a new drive with the same device path.

ZFS can "tell" if a drive was part of a zfs pool. It will object if you try to add such a drive online in a new roll. To force the issue, add the "-f" option to the replace command.

Putting a drive back online

zpool online myPool aDrive

Using autoreplace

If the pool property autoreplace is "on" and spare drive is part of the pool, the defective drive will be replace by the spare and resilvering will start automatically. Using this option seems attractive, but frequently it is better to attend to your backups before starting a lengthy resilvering process.

Dealing with disk sector sizes

As we stated earlier, ZFS will operate a lot faster if it knows the physical sector size of your disk(s). If you're building a big storage array, it's worth spending a little time to get this right.

Unfortunately, ZFS cannot reliably detect the sector size. To make matters worse, many disks lie about their size. The history of this peculiar behavior is beyond the scope of this article.

If you take the trouble to find your disk on the manufacturer's web site, they will often reveal the physical sector size. Otherwise, they may state that the disk has the Advanced Format, which is marketing-speak for 4096 byte sectors. It might even be printed on the box.

If that's too much trouble, here's a heuristic that's likely to work: First, ask the disk: You can use fdisk:

fdisk /dev/sdx

Or the more impressive:

lsblk -o NAME,PHY-SeC

You'll get back either 512 or 4096. Now use this heuristic:

IF the reported sector size is 4096 THEN
    The true sector size is 4096.
ELSE IF the disk was made before 2010 THEN
    The true sector size is 512.
ELSE IF the disk was made after 2011 THEN
    The true sector size is probably 4096.
ELSE
    Do you really want to use this old P.O.S.?
END IF

The option to specify a sector size of 4096 is:

-o ashift=12

The "12" here is the power of 2 that makes 4096. We're all geeks in here you see.

Example: Creating a pool with specified sector size:

zpool create mrnas -o ashift=12 sdb1 sdc1 sdd1 sde1

This optimization is only effective if all the disks have the same sector size.

Maxims for pools


ZFS Commands

The zfs command is used to create and configure datasets.

The big picture

Dataset types

Creating datasets

A dataset path always begins with a pool name:

zfs create myPool/myDataset

Any hierarchy of datasets can be created:

zfs create myPool/myDataset/mySubset1/mySubset2

Each dataset to the right is called a descendant of the datasets to the left.

Mounting datasets

If the pool was created with the default mountpoint, datasets are automatically mounted under the pool with the same path used to create them.

Alternatively, you can specify a mount point when a dataset is created using a property:

zfs create myPool/myDataset mountpoint=aPath

In this expression, aPath is any filesystem path or one of the keywords none or legacy. If none is specified, the dataset will not be mounted. If legacy is specfied, you can mount the dataset using the regular mount command:

mount -t zfs myPool/myDataset /mnt/here

Or in /etc/fstab:

myPool/myDataset  /mnt/myStuff   zfs

Or in /etc/auto.mount:

myStuff  -fstype=zfs  :myPool/myDataset

You can change the mountpoint of a dataset anytime:

zfs set mountpoint=aPath

The old mountpoint will be removed and a new one created.

To restore the default mountpoint behavior, first give the pool a mountpoint at the root:

zfs set mountpoint=/myPool myPool

Then change the mountpoints of each dataset you've created to be inherited:

zfs inherit mountpoint myPool/myDataset1
...
zfs inherit mountpoint myPool/myDatasetN

When switching back to inherited mountpoints, I found it necessary to delete the old mountpoint directories by hand. Perhaps this is a bug in ZFS for Linux?

Creating a volume

You can use datasets for non-zfs filesystems. These are called "volumes."

This will create a blank 32G volume:

zfs create myPool/myVolume -V 32G

A new device entry will be created automatically at:

/dev/zd0

The next volume you create will be associated with:

/dev/zd1

And so on each time you add a new volume.

Also, for each new volume, a symlink will be created using the names:

/dev/zvol/myPool/myVolume -> ../../zd0

You can now (optionally) partition the device:

fdisk /dev/zvol/myPool/myVolume

Create a filesystem:

mkfs -t ext4 /dev/zvol/myPool/myVolume

And mount the volume:

mount /dev/zvol/myPool/myVolume /mnt/here

In these examples, /dev/zd0 could be used in place of the longer symlink path, but it's easy to loose track if you have many volumes.

Rename a dataset

zfs rename oldPath newPath

Destroy a dataset

Mark a dataset for destruction:

zfs destroy myPool/myDataset

The actual "destruction" is deferred until all descendent datasets, snapshots, and clones are deleted. If you want everything gone immediately:

zfs destroy -R myPool/myDataset

List zfs datasets

zpool list

Dataset compression

Using compression will make zfs faster unless the dataset contains mostly compressed data (such as media.)

Enable default compression scheme:

zfs set compression=on myPool/myDataset

Enable a better compression scheme:

zfs set compression=lz4 myPool/myDataset

Turning on compression only takes effect for files added later.

Dataset access time recording

Although turned on by default, it should usually be turned off:

zfs set atime=off myPool/myDataset

If you don't turn off access time recording, your incremental backups will include every file you've accessed even if it hasn't changed.

Filesharing with NFS

You can make any zfs dataset visible to nfs clients:

zfs set sharenfs=on myPool/myDataset

Linux clients can mount this using:

mount -t nfs zfs.host.com:/myDataset /mnt/here

Note: It is said to be "auspicious" to avoid sharing zfs datasets using native "/etc/exports" and also using "set sharenfs": Use one or the other.

Filesharing with samba

You can make any zfs dataset visible to samba clients:

zfs set sharesmb=on myPool/myDataset

Windows clients will see this at the path:

\\zfs.host.com\mypool_mydataset

Notice that the client sees the path in lowercase.

To make this work takes a little preparation. Add these directives to the smb.conf global section:

[global]
    ...
    usershare path = /var/lib/samba/usershares
    usershare max shares = 100
    usershare allow guests = yes
    usershare owner only = no

Make sure that the mountpoint directory for the dataset is owned by the samba guest account user. If you want more restricted access, the procedure should be obvious.

Note: It is said to be "auspicious" to avoid sharing zfs datasets using native "smb.conf" and also using "set sharesmb": Use one or the other.

Working with properties

We have seen a few dataset properties in the previous sections. Properties are set using the expression:

set optionName=optionValue myPool/myDataset

Here are a few common properties and values:

mountpoint      aPath
compression     off | on | lz4
atime           on  | off
readonly        off | on
exec            off | on
sharenfs        off | on
sharesmb        off | on
quota           size
reservation     size

In these expressions, size can be specified with the M, G, or T suffix. When the option is "off" or "on", the default value is shown first.

Descendant datasets inherit all properties from their parents except the quota and reservation properties.

If you explicitly set a property and later decide it would be better to inherit the value, you can issue the command:

zfs inherit propertyName myPool/myDataset

By adding the "-r" option, all descendants of the specified dataset will inherit the property.

Showing a property value

zfs get propertyName myPool/myDataset

Showing all property values

zfs get all myPool/myDataset

Property sources

When you get a property value (as described in the previous section) the listing will show the source of the property, which can be:

default             Never specified. The default value is used.
inherited from x    Inherited value from parent dataset x.
local               The user previously used "set" to specify a value.
temporary           The property was specified in a (temporary) -o option when mounting.
received            The property was set when the dataset was created by "zfs receive"
(none)              The property is read-only.

Listing properties by source

You can list only properties that have specific source:

zfs get -s local all myPool/myDataset

And you can add -r to list the properties of all children:

zfs get -r all myPool/myDataset

The -s and -r options can be combined.

Inheriting properties

Some properties are automatically inherited from parent datasets. Others are not. To force the issue, you can ask that a specific property be inherited:

zfs inherit *propertyName* myPool/myDataset

This will replace any previously set value.

Snapshots

A snapshot of a dataset behaves like a backup: it contains the dataset frozen in time when the snapshot was created.

Note the special syntax:

zfs snapshot myPool/myDataset@mySnap1

The process of making a snapshot is nearly instantaneous.

All snapshots of a dataset are accessible in a hidden directory below the original dataset mountpoint:

.zfs/snapshot

Inside that directory, you would find, for example, mySnap1 which was created in the previous example.

You can use "zfs rename" on snapshots, but their parent path cannot be changed.

Listing snapshots

Snapshots do not appear unless you add an option:

zfs list -t snapshot

Or using:

zfs list -t all

Renaming snapshots

You can only rename the snapshot, not the dataset path before the "@" symbol:

zfs rename myPool/myDataset@oldName myPool/myDataset@newName

Recursively rename a snapshot and the snapshots of all descendents:

zfs rename -r myPool/myDataset@oldName myPool/myDataset@newName

Recursive snapshots

By default, descendent datasets are not part of snapshot. (Not to be confused with descendent directories in the dataset filesystem, which will be included.) To include descendent datasets, use the -r option. For example,

zfs create myPool/myDataset1
zfs create myPool/myDataset1/myDataset2

This will not include myDataset2:

zfs snapshot myPool/myDataset1@mySnap1

But this will include myDataset2:

zfs snapshot -r myPool/myDataset1@mySnap1

Rollbacks

A rollback transforms a dataset back into the state it was in when a snapshot was created:

First make sure the snapshot itself isn't mounted, then:

zfs rollback -rf myPool/myDataset@mySnap

The -r option removes all dependents snapshots of the snapshot back to the time the snapshot was created. Without this option, you can only rollback to the most recent snapshot. Using -R deletes all dependent clones as well. The -f option unmounts the filesystem if necessary.

Clones

A clone works like a copy of a snapshot, but you can modify the the files inside:

zfs clone myPool/mySub1/@gleep myPool/myFirstClone

Note: Clones must be destroyed before the parent snapshot and dataset can be destroyed.

Promotion

You can promote a clone so it become a normal dataset, independent of the orignal snapshot.

zfs promote myPool/myClone

You might want to rename it to reflect it's elevated status:

zfs rename myPool/myClone myPool/myNewDataset

Snapshots of volumes

The syntax is the same:

zfs snapshot myPool/myVolume@mySnap

A new device entry is automatically created:

/dev/zvol/myPool/myVolume@mySnap

This can be mounted in the usual way. Read-only status is implicit.

Replication

Replication is used to copy a source dataset from one pool to a destination dataset in another pool. The destination dataset may be part of a pool on another host. The most common use of replication is making backups.

In the following example, "myPool" is the source pool and "myBack" is the destination pool. We are going to replicate the dataset "myDataset" so it appears at the top level of myBack.

First time backup:

zfs snapshot myPool/myDataset@now
zfs send myPool/myDataset@now | zfs receive -d myBack

Later incremental backups:

zfs rename myPool/myDataset@now myPool/source@then
zfs snapshot myPool/myDataset@now
zfs send -i myPool/myDataset@then myPool/myDatasetl@now | zfs receive -dF myBack
zfs destroy myPool/myDataset@then

zfs send options:

-i : Incremental (old and new snapshots are the following parameters)

zfs receive options:

-d : Discard the pool name from the sending path
-F : Force rollback to most recent snapshot.

In the example above, using -d produces this result on the receiving side:

myBack/myDataset@now

Without -d, you would get:

myBack/myPool/myDataset@now

The -F option does a rollback on the receiving dataset, discarding any changes since the last snapshot. This seems alarming, but the practice is necessary because simply browsing a mounted backup will alter access times (if enable) and zfs will treat the data as modified, forcing things to be copied. If the destination dataset is used only for backups, there shouldn't be any useful changes since the last snapshot.

An alternative to using -F is to make the destination dataset readonly:

zfs set readonly=on myBack/now

If myBack is used exclusively for backups, you can make the whole pool readonly:

zfs set readonly=on myBack

The readonly property will be inherited by all new descendant datasets created under myBack. It only applies to normal filesystem operations, not to the receive command, which will modify the destination dataset to match the source.

Replicating a pool

By adding a few options, you can backup the entire pool including all descendant datasets.

First time backup:

zfs snapshot -r myPool@now
zfs send -R myPool@now | zfs receive -uFd myBack

Later incremental backups:

zfs rename -r myPool@now myPool@then
zfs snapshot -r myPool@now
zfs send -Ri myPool@then myPool@now | zfs receive -uFd myBack
zfs destroy -r myPool@then

zfs snapshot options:

-r : Rename the snapshots of all descendent datasets.

zfs send options:

-R : Create a replication package (gets all descendants including snapshots and clones)
-i : Incremental (old and new snapshots are the following parameters)

When doing an incremental send, the "old" snapshot must be the one previously sent.

zfs receive options:

-u : Don't mount anything created on the receiving side.
-F : Force rollback to most recent snapshot.
-d : Discard the pool name from the sending path

When used with -R on the sending side, -F will delete everything that doesn't exist on the sending side. This make the readonly option unnecessary.

The -u option is useful when the datasets being created have non-default mountpoint options. It prevents them from being mounted and possibly overriding existing directories or mountpoints in the receiving filesystem.

Secure replication between hosts

You mush have ssh configured and working first. In this example, myBack is assumed to exist on the destination host with ip name "destHost". For a non-incremental backup:

zfs send myPool@now | ssh destHost zfs receive -d myBack

Dealing with received properties

By default, when a pool receives a dataset, whatever property values existed on the sending side will be preserved. However it is possible to override this behavior by setting properties in the zfs receive command or by setting them locally after the dataset is received. In this case, the property will have both a local and a received value. If you subsequently clear the local value, the received value will take effect again. See the "zfs get" section above for a listing of possible property sources.


Problems and solutions

Starting zfs by hand

These commands are useful if you build and install zfs from source or if the modules fail to load during the boot process. Normally, zfs starts using the "zfs.target" under control of systemd. You can do this yourself:

systemctl start zfs.target

Here's what starting the zfs.target will do:

# Load zfs and dependencies:

    modprobe zfs

# Import all pools:

    zpool import -c /etc/zfs/zpool.cache -aN

# Mount all filesystems:

    zfs mount -a

# Activate smb and nfs shares:

    rm -f /etc/dfs/sharetab
    zfs share -a

# Start the zed daemon:

    zed -F

Somehow, systemd makes the last command above daemonize zed, but to start it from the command line, I had to use systemctl:

systemctl start zed.target

ZFS disappears after kernel update

When you update the linux kernel, dkms is supposed to update the zfs modules. Occasionally, this mechanism fails and your zfs filesystems will be missing after a reboot. To fix this problem, you can force a rebuild using dkms commands, which is somewhat tedious, or you can simply re-install the packages:

yum reinstall spl-dkms
yum reinstall zfs-dkms

Now bring up zfs without rebooting:

systemctl start zfs.target

ZFS disappears after Fedora upgrade

After doing a major upgrade, e.g. version 22 to 23, ZFS will not start or automatically rebuild the modules even if the associated dkms packages are re-installed. I haven't figured out why this happens, but the fix is easy: Just add and install the modules by hand:

First, make sure your modules are up to date:

dnf update -y

Find out what version you have:

rpm -q spl-dkms
rpm -q zfs-dkms

Usually spl and zfs will have the same version numbers. Ignore the extension part of the version string: For example If "rpm -q" shows something like:

zfs-dkms-0.6.5.4-1.fc23.noarch

Your version string is just:

0.6.5.4

Now you can add and install the modules:

dkms add -m spl -v 0.6.5.4
dkms add -m zfs -v 0.6.5.4
dkms install -m spl -v 0.6.5.4
dkms install -m zfs -v 0.6.5.4

And bring up zfs:

systemctl start zfs.target

Dealing with ZFS updates

After your package manager updates zfs, you may find new features. To avoid problems where there may be multiple hosts that share pools, this process is controlled by a features protocol. You can decide which new pool features to enable or ignore, even if you accept the rest of a software update. This happened to me recently, so I'll outline the process with a specfic example:

After doing a routine "yum update", I got this message from "zpool status":

status: Some supported features are not enabled on the pool. The pool
can still be used, but some features are unavailable.

action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not
        support the features. See zpool-features(5) for details.

Ok. Next I tried running:

zpool upgrade

In older versions of zfs, the pool data structure had a version number which would be incremented by the upgrade process. It was an "all or nothing" deal. But that was yesterday. Running "zpool upgrade" now reports:

This system supports ZFS pool feature flags.

All pools are formatted using feature flags.

Some supported features are not enabled on the following pools. Once a
feature is enabled the pool may become incompatible with software
that does not support the feature. See zpool-features(5) for details.

POOL  FEATURE
---------------
mrBack
    filesystem_limits
    large_blocks

mrPool
    filesystem_limits
    large_blocks

It turns out you have to enable each feature on each of your pools by hand:

zpool set feature@filesystem_limits=enabled mrBack
zpool set feature@large_blocks=enabled mrBack

zpool set feature@filesystem_limits=enabled mrPool
zpool set feature@large_blocks=enabled mrPool

The possible values for a feature are:

This is clearly a better idea: You can decide how to trade off the value of a new feature against potential compatibility problems if you ever want to import your pools on another host.

References