Reliably boot Fedora with root on ZFS

Verified to work on 2018-07-09 using:

Older versions

As of ZFS-0.7.1, it's a lot easier to do this deed and I've removed several sections. Users of zfs-0.6.x.y will need to follow an older version of this page.


Introduction

This article shows how to boot a recent version of Fedora (circa 26) with root on ZFS. To distinguish this article from the crowd, the features include:

I was motivated to collect and publish these methods for several reasons: Most of the root-on-zfs articles are either out of date, don't apply to Fedora, or have a key defect: They may work (or might have worked), but they don't tell you how to solve the inevitable problems that happen when things go wrong as the system is updated or upgraded. Recipes are not enough. Understanding is required.

Why ZFS?

ZFS has significant advantages because it replaces things like software RAID and LVM with a more flexible and general framework. Here are just a few highlights:

Why root on ZFS?

James Earl Jones voice: "If you only knew the POWER of ZFS (wheeze gasp) other solutions would seem weak and ineffective."

The good

If you like to hack away on your operating system, it's very nice to have zfs snapshots. You can take a snapshot and then do any damn thing without a care in the world. By rolling back the snapshot, all is as before. (Occasionally you might have to do the rollback from a rescue disk!) By taking a snapshot before and after installing a complex package group, you can use "zfs diff" to see what changed. It's like having subversion for your entire operating system.

It's also nice to use an automated rolling snapshot process. You can go back in time by hour, day, week, month - whatever you like. This takes care of regrets that take time to discover. And it doesn't take an enormous amount of space.

Backups are easy: two lines to replicate your whole system on a pool at a remote destination. Using incremental replication makes the process much faster than things like rsync.

There are countless articles on the web extolling the virtues of ZFS. If this is all new to you, you probably shouldn't attempt a root-on-zfs installation right away.

The bad

Running root on ZFS with linux is an experimental setup in most distributions. The Fedora developers don't support ZFS and they occasionally do things (by accident) that break it. You have to be prepared for situations where your system won't boot after updates (rarely) or upgrades (frequently.) Hopefully this article will get you through those little disappointments efficiently. But I must emphasize that root-on-zfs is not a great idea for a production server. But the same could be said about Fedora itself...

The ugly

In the world of ZFS, there has been and still is a great deal of F.U.D. created by licensing concerns. Some of these concerns have abated, but not enough for Redhat or Fedora to embrace the technology. Competing solutions have been created or resurrected such as BTRFS, XFS, and Stratis but it's hard to catch up with what ZFS already achieves. Especially with the reliability that's possible only through years of evolution and deployment in a large community.

Redhat recently deprecated BTRFS after years of spending and tireless promotion. An interesting discussion appears here. Promoting an advanced file system is not so easy.

"It must be remembered that there is nothing more difficult to plan, more doubtful of success, nor more dangerous to manage than a new system. For the initiator has the enmity of all who would profit by the preservation of the old institutions and merely lukewarm defenders in those who gain by the new ones." - Niccolò Machiavelli


How does UEFI boot an operating system on ZFS?

The following description is vastly oversimplified. The Grub2 boot loader is a marvel of complexity that could probably start linux on a soroban. But you only need a superficial understanding to proceed with this tutorial.

At boot time, the UEFI bios on your motherboard looks for a small vfat partition called the EFI or sometimes ESP. This partition contains a binary boot program. The BIOS launches the program, which in this case will be the grub2 boot loader. In the discussion that follows, it's called simply "grub."

Grub executes a plain-text configuration script also found in the EFI partition. This program creates the boot menu the user sees and executes steps to launch a selected operating system.

When a root-on-zfs item is selected, a section of the configuration script will search a device named by UUID for a specified zfs pool and dataset. The dataset must have a filesystem that contains the regular linux boot files, vmlinuz and initramfs.

Grub loads the linux kernel from the vmlinuz file. It also loads and mounts a root filesystem in memory from the initramfs file. This "ramfs" filesystem contains a subset of your complete operating system, including systemd startup scripts and kernel modules to support ZFS.

Systemd proceeds to generate the cacophony displayed on your monitor as the computer boots. One of the early steps switches the root file system from the ramfs to the physical device, pool, and dataset(s) that host the complete operating system. When systemd is done, you are allowed to use the computer.

Believe it or not, there are good reasons for this process, but there's no more space to justify or explain.

Ars longa.
Vita bevis.
Tempus fugit.

Quick summary of the installation process

We will create two bootable Fedora systems. One on a rescue device and one on a target device, which will have root-on-zfs. The rescue device will be used to create the target system. It's essential to keep the rescue device available because accidents will happen. I use a fast USB stick and keep it on top of the target computer case.

  1. Install and run Fedora on the rescue device.
  2. Add support for ZFS. (But the rescue system doesn't boot from zfs.)
  3. Create two partitions on the target device: One for EFI and one for ZFS.
  4. Create a ZFS pool and root dataset on the ZFS partition.
  5. Copy the operating system from the installer to the root dataset.
  6. Install the special boot files and directories on the EFI partition.
  7. Configure the target so it will boot from the EFI.

To simplify the discussion, only one device is used for the ZFS pool. The process of building more complex pools is covered on dozens of ZFS websites. I've written one myself. It is, however, both unusual and unwise to deploy ZFS without some form of redundancy. With that in mind, this article should be treated as a tutorial rather than a practical solution. Proceed at your own risk.


Create an installer system

Obtain Fedora

The procedure has been tested using multiple versions of Fedora:

Fedora-Workstation-Live-x86_64-24-1.2.iso
...
Fedora-Workstation-Live-x86_64-28-1.1.iso

Configure the BIOS

Configure your BIOS to operate the target disk controller in AHCI mode (the default on most modern motherboards) and boot from the device where you've mounted the installation media. You should see two choices for the installation volume: One will mention UEFI. That's the one you must use. Otherwise look for settings that specify booting in UEFI mode. If the installer doesn't believe you booted using EFI firmware, it will gype up the rest of the process, so take time to figure this out.

Install Fedora

The installer is a bootable Fedora installation that has the same GPT, UEFI, and partition structure as our target zfs system.

Because we're going to copy the root of the installer to the target, it makes sense to configure the installer to be as similar to the target as possible. After we do the copy, the target is almost ready to boot. A nice thing about UEFI on GPT disks is that no "funny business" is written to boot blocks or other hidden locations. Everything is in regular files.

We're going to make the target system with GPT partitions. The installer needs to have the same structure, but Anaconda will not cooperate if the installer is a small disk. To force Anaconda to create GPT partitions, we must add an option.

Boot the installation media. On the menu screen, select

Start Fedora-workstation-live xx

Press "e" key to edit the boot line. It will look something like this:

vmlinuz ... quiet

At the end of the line add the string "inst.gpt" so it looks like this:

vmlinuz ... quiet inst.gpt

IMPORTANT: If you see an instruction to use the tab key instead of the "e" key to edit boot options, you have somehow failed to boot in EFI mode. Go back to your BIOS and try again.

Proceed with the installation until you get to the partitioning screen. Here, you must take control and specify standard (not LVM) partitioning.

Create two:

1) A 200M parition mounted at /boot/efi
2) A partition that fills the rest of the disk mounted at root "/".

Anaconda will recognize that you want a UEFI system. Press Done to proceed. You'll have to do it twice because you must be harassed for not creating a swap partition.

The rest of the installation should proceed normally. Reboot the new system and open a terminal session. Elevate yourself to superuser:

su 

Then update all the packages:

dnf update

Disable SELinux

Everything is supposed to work with SELinux. But I'm sorry to report that everything doesn't: Fixes are still being done frequently as of 2016. Unless you understand SELinux thoroughly and know how to fix problems, it's best to turn it off. In another year, this advice could change.

Edit:

/etc/sysconfig/selinux

Inside, disable SELinux:

SELINUX=disabled

Save and exit. Then reboot:

shutdown -r now

Check for correct UEFI installation

rpm -qa | grep grub2-efi
rpm -qa | grep shim

These packages should already be installed. If not, you somehow failed to install a UEFI system.

Add extra grub2 modules

dnf install grub2-efi-modules

This pulls in the zfs module needed by grub2 at boot time.

Create some helper scripts

The following scripts will save a lot of typing and help avoid mistakes. When and how they are used will be explained later. For now, consider the descriptions an overview of the process.

The zmogrify script

Create a text file "zmogrify" in /usr/local/sbin (or somewhere on the PATH) that contains:

#!/bin/bash
# zmogrify - Run dracut and configure grub to boot with root on zfs.

kver=$1
sh -x /usr/bin/dracut -fv --kver $kver
mount /boot/efi
grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg
grubby --set-default-index 0
mkdir -p /boot/efi/EFI/fedora/x86_64-efi
cp -a /usr/lib/grub/x86_64-efi/zfs* /boot/efi/EFI/fedora/x86_64-efi
umount /boot/efi

Save the file and make it executable:

chmod a+x zmogrify

What is zmogify doing?

The first step runs dracut to produce a new /boot/initramfs file that includes zfs support. Dracut is able to do this because of the zfs-dracut package we added, which installs a "plug-in."

The peculiar way of running dracut with sh -x overcomes a totally mysterious problem with Fedora 25 that makes dracut hang otherwise. Dracut will hang after displaying the line:

*** Including module: kernel-modules *** 

If you understand why, I'd very much like to hear from you.

The grub2-mkconfig command creates a new grub.cfg script in the EFI boot partition.

The grubby command makes sure your new kernel is the one that boots by default.

Finally, we create a directory in the EFI partition and copy the boot-time version of the zfs module needed by grub2 to mount your zfs root file system.

The zenter script

Create a text file "zenter" in /usr/local/sbin (or somewhere on the PATH) that contains:

#!/bin/bash
# zenter - Mount system directories and enter a chroot

target=$1

mount -t proc  proc $target/proc
mount -t sysfs sys $target/sys
mount -o bind /dev $target/dev
mount -o bind /dev/pts $target/dev/pts

chroot $target /bin/env -i \
    HOME=/root TERM="$TERM" PS1='[\u@chroot \W]\$ ' \
    PATH=/bin:/usr/bin:/sbin:/usr/sbin:/bin \
    /bin/bash --login

echo "Exiting chroot environment..."

umount $target/dev/pts
umount $target/dev/
umount $target/sys/
umount $target/proc/

Save the file and make it executable:

chmod a+x zenter

What is zenter doing?

The chroot command makes (temporarily) a specified path the root of the linux file system as far as shell commands are concerned. But we need more that just the file system root: The extra mount commands make the linux system devices available inside the the new root. Dracut expects those to be in place when building a boot-time ramfs.

The zenter script puts you in a nested shell session. A new command prompt "chroot>" reminds you that you're in a special place. When you type "exit", the script exits the chroot environment and returns to the original shell.

Make grub2 include zfs support when building configuration files

Edit:

/etc/default/grub

Add the line:

GRUB_PRELOAD_MODULES="zfs"

This will come into play later when we create a grub configuration file on the efi partition.

Also add the line:

GRUB_DISABLE_OS_PROBER=true

Without this fix, running grub2-mkconfig (which we'll need later) will throw an error and possibly disrupt other scripts. This is cased by a problem that has nothing to do with ZFS and will probably be fixed Real Soon Now.

Fix grub2 bug with pool device path names

Later, we will need to run a script called grub2-mkconfig. This script uses the zpool status command to list the components of a pool. But when these components aren't simple device names, the script fails. Many fixes have been proposed, mostly involving new udev rules. A simple fix is to add a system environment variable. Create this file:

/etc/profile.d/grub2_zpool_fix.sh

That contains:

export ZPOOL_VDEV_NAME_PATH=YES

This will take effect the next time you login or reboot. Which will be soon...

Prepare for future kernel updates

ln -s /usr/local/sbin/zmogrify /etc/kernel/postinst.d

The scripts in the postinst.d directory run after every kernel update. The zmogrify script takes care of updating the initramfs and the grub.cfg file on the EFI partition. (Stuff you need to boot with root in a zfs pool.)

Configure the ZFS repository

Using your browser, visit ZFS on Fedora.

Click on the link for your version of Fedora. When prompted, open it with "Software Install" and press the install button. This will add the repository zfs-release to you package database. It tells the package manager where to find zfs packages and updates.

Install zfs

dnf install kernel-devel
dnf install zfs zfs-dracut

During the installation, the process will pause to build the spl module and then the zfs module. A line will appear showing the path to their locations in /lib/modules/x.y.z/extras. If you don't see those lines, you don't have a match between the kernel-devel package you installed and your currently running kernel.

Load zfs

modprobe zfs

If you get a message that zfs isn't present, the modules failed to build during installation. Try rebooting.

Configure systemd services to their presets

For some reason, (2016-10) they don't get installed this way, so execute:

systemctl preset zfs-import-cache
systemctl preset zfs-mount
systemctl preset zfs-share
systemctl preset zfs-zed

I think this step was needed after upgrading from Fedora 24 to 25 with an earlier version of ZFS. When starting over on a new machine, I'd skip this step.

Configure the zfs cache file

ZFS uses a cache file to keep track of pools and their associated devices. It speeds up the boot process, especially when you have complex pools.

When booting from ZFS, the cache can be a headache because you have to rebuild the ramdisk (initrd) using dracut each time you change your pool structure. (If this doesn't make sense to you, just proceed with the advice that follows.)

For that reason, I suggest turning off the cache.

At boot time, by default, systemd runs a script:

zfs-import-cache.service

This loads the previous pool configuration stored in the cache file:

/etc/zfs/zpool.cache

But you can use a different service that scans all devices for pools. Proceed as follows:

systemctl disable zfs-import-cache
systemctl enable zfs-import-scan

Now tell zfs not to make a new cache file and delete the old one.

zfs set cachefile=none mypool
rm /etc/zfs/zpool.cache

It's very important to recreate your initramfs now because otherwise it will contain a zpool.cache that's no longer up to date:

dracut -fv --kver `uname -r`

If you decide to revert to using a cache:

systemctl disable zfs-import-scan
systemctl enable zfs-import-cache
zfs set cachefile=/etc/zfs/zpool.cache mypool

Setting the cachefile path recreates the file immediately.

Note: Setting the cachefile property value has side effects as described above. But you can't see the cachefile path by "getting" the value. This seems to be a bug.


Create the target system

Prepare a target disk

If your target is a USB disk, you simply need to plug it in. Tail the log and note the new disk identifier:

journalctl -f

If you have a "real" disk, first look at your mounted partitions and take note of the names of your active disks. If you're using UUIDs, make a detailed list:

ls /dev/disk/by-uuid

The shortcuts will point to the /dev/sdn devices you're using now. Now you can shutdown, install the new disk, and reboot. You should find a new disk when you take a listing:

ls /dev/sd*

From now on, we'll assume you've found your new disk named "sdx".

If you got your target disk out of a junk drawer, the safest way to proceed is to zero the whole drive. This will get rid of ZFS and any RAID labels that might cause trouble. Bring up a terminal window, enter superuser mode, and run:

dd if=/dev/zero of=/dev/sdx bs=10M status=progress

Zeroing takes a while for large disks. If you know which partition on the old disk has a zfs pool, for example sdxn, you can speed things up by importing the pool and destroying it. But if the partition is already corrupted so the pool won't import properly, you can blast it by zeroing the first and last megabyte:

mysize=`blockdev --getsz /dev/sdx`
dd if=/dev/zero of=/dev/sdxn bs=512 count=2048
dd if=/dev/zero of=/dev/sdxn bs=512 count=2048 seek=$((mysize - 2048))

Alternatively, there is the command:

zpool labelclear /dev/sdx

Which can be forced if necessary:

zpool labelclear -f /dev/sdx

Partition the target device

I'm too lazy to create a step-by-step gdisk walk-through. If you need such a thing, you probably don't belong here. Run gdisk, parted, or whatever tool you prefer on /dev/sdx. Erase any existing partitions and create two new ones:

Partition  Code        Size      File system   Purpose
        1  EF00     200 MiB      EFI System    EFI boot partition
        2  BF01  (rest of disk)  EXT4          ZFS file system

Write the new table to disk and exit. Tell the kernel about the new layout:

partprobe

If you are prompted to reboot, do so. You now have two new devices:

/dev/sdx1 - For the target EFI partition
/dev/sdx2 - For the target ZFS file system

Format the target EFI partition

mkfs.fat -F32 /dev/sdx1

Create a pool

zpool create pool -m none /dev/sdx2 -o ashift=12

By default, zfs will mount the pool and all descendant datasets automatically. We turn that feature off using "-m none". The ashift property specifies the physical sector size of the disk. That turns out to be a Big Ugly Deal, but you don't need to be concerned about it in a tutorial. I've added a section at the end of the article about sector sizes. You need to know this stuff if you're building a production system.

Configure performance options

zfs set compression=on pool
zfs set atime=off pool

Compression improves zfs performance unless your dataset contains mostly already-compressed files. Here we're setting the default value that will be used when creating descendant datasets. If you create a dataset for your music and movies, you can turn it off just for that dataset.

The atime property controls file access time tracking. This hurts performance and most applications don't need it. But it's on by default.

Create the root dataset

zfs create -p pool/ROOT/fedora

Enable extended file attributes

zfs set xattr=sa pool/ROOT/fedora

This has the side effect of making ZFS significantly faster.

Provide some redundancy

If you think you might continue to use this one-device zfs pool, you can slightly improve your odds of survival by using the copies property: This creates extra copies of every block so you can recover from a "bit fatigue" event. A value of 2 will double the space required for your files:

zfs set copies=2 pool/ROOT/fedora

I use this property when booting off USB sticks because I don't trust the little devils.

It would be nice to find two new symbolic links for the partitions we've created on the target disk in here:

/dev/disk/by-uuid

The symbolic link to the EFI partition is always there pointing to /dev/sdx1, but the one for the ZFS partition pointing to /dev/sdx2 is not. I've tried re-running the rules:

udevadm trigger

But the link for /dev/sdx2 still isn't there. Regrettably, you have to reboot now. When you're back up and running proceed...

Export the pool and re-import to an alternate mount point

zpool export pool
zpool import pool -d /dev/disk/by-uuid -o altroot=/sysroot

At this point don't panic if you look for /sysroot: It's not there because there are no specified mount points yet.

You will also notice the clause -d /dev/disk/by-uuid. This will rename the disk(s) so their UUID's appear when you execute zpool status. You could also use -d /dev/disk/by-id - Details about this are covered later.

Note: It's possible to arrive at a state where a newly formatted disk will have an entry in some-but-not-all of the /dev/disk subdirectories. I'm not sure what causes this annoyance, but if the command to import the pool using UUIDs fails, try using /dev/disk/by-id instead. The important thing is to get away from using device names.

Specify the real mount point

zfs set mountpoint=/ pool/ROOT/fedora

Because we imported the pool with the altroot option, /sysroot will now be mounted.

Copy files to the target

We use rsync because it provides some fine control over how things are copied. The options used come from an article by Rudd-O. This is what the options do:

a: Archive
v: Verbose
x: Only local filesystems
H: Preserve hard links
A: Preserve ACLs
S: Handle sparse files efficiently
X: Preserve extended attributes

Copy root from the installer to the target

rsync -avxHASX / /sysroot/

Don't get careless and leave out the trailing "/" character.

You'll see some errors near the end about "journal". That's ok.

Copy the EFI partition from the installer to the target

Recall that "/dev/sdx1" is the device name of the target EFI parition:

cd
mkdir here
mount /dev/sdx1 here
cp -a /boot/efi/* here
umount here
rmdir here

Note the UUID of the target EFI partition

blkid /dev/sdx1

In my case, it was "0FEE-5943"

Edit the target fstab

/sysroot/etc/fstab

Erase everything and add this line:

UUID=0FEE-5943    /boot/efi  vfat  umask=0077,shortname=winnt 0 2

Even that line is unnecessary for booting the system, but utilities such as grub2-mkconfig expect the efi partition to be mounted.

(Obviously, use your EFI partition's UUID.)

Chroot into the target

zenter /sysroot

Deal with a Fedora 28 aggravation

Ignore this section for earlier versions of Fedora. Fedora 28 (and probably beyond) contains patches for the grub2-tools package that break booting on ZFS. The reasons are complex, but there's a simple work-around. Run the command:

zpool status

Note the name of the first device in your root pool. As an example let's assume you used device IDs so the name might be some mess like:

/dev/disk/by-id/wwn-0x5002538d415dc9b5-part2

Edit the file:

/etc/default/grub

Add this line:

GRUB_DEVICE_BOOT=/dev/disk/by-id/wwn-0x5002538d415dc9b5-part2

Run zmogrify

zmogrify `uname -r`

This expression assumes that the kernel you're running is the same as the kernel for the target system. Otherwise, you must supply the target kernel name.

Exit the chroot

exit

Export the pool

zpool export pool

Reboot

If you're the adventurous type, simply reboot. Otherwise, first skip down to Checklist after update and before reboot and run through the tests. Then reboot.

If all is well, you'll be running on zfs. To find out, run:

mount | grep zfs

You should see your pool/ROOT/fedora mounted as root. If it doesn't work, please try the procedure outlined below in Recovering from boot failure.

Take a snapshot

Before disaster strikes:

zfs snapshot pool/ROOT/fedora@firstBoot

We are finished. Now for the ugly bits!


Post installation enhancements

More about device names

When we created the pool, we used the old-style "sdx" device names. They are short, easy to type and remember. But they have a big drawback. Device names are associated with the ports on your motherboard, or sometimes just the order in which the hardware was detected. It would really be better to call them "dynamic connection names." If you removed all your disks and reconnected them to different ports, the mapping from device names to drives would change.

You might think that would play havoc when you try to re-import the pool. Actually, ZOL (ZFS on Linux) protects you by automatically switching the device names to UUIDs when device names in the pool conflict with active names in your system.

Linux provides several ways to name disks. The most useful of these are IDs and UUIDs. Both are long complex strings impossible to remember or type. I prefer to use IDs because they include the serial number of the drive. That number is also printed on the paper label. If you get a report that a disk is bad, you can find it by reading the label.

First, boot into your installer linux system.

Here's how to use IDs:

zpool import pool -o altroot=/sysroot -d /dev/disk/by-id

If you prefer UUIDs:

zpool import pool -o altroot=/sysroot -d /dev/disk/by-uuid

Should you ever want to switch back to device names:

zpool import pool -o altroot=/sysroot -d /dev

To switch between any two, first export and then import as shown above. The next time you export the pool or shut down, the new names will be preserved in the disk data structures.

Optimize performance by specifying sector size

The problem: ZFS will run a lot faster if you correctly specify the physical sector size of the disks used to build the pool. For this optimzation to be effective, all the disks must have the same sector size. Discovering the physical sector size is difficult because disks lie. The history of this conundrum is too involved to go into here.

The short answer: There is no way to be sure about physical sector sizes unless you find authoritative information from the maker. Guessing too large wastes space but insures performance. Guessing too small harms performance.

Here are some heuristic steps:

First, ask the disk for its sector size:

lsblk -o NAME,PHY-SeC /dev/sdx

If it reports any value except 512, you can believe the answer. If it shows 512, you may be seeing the logical sector size.

When creating a pool, the sector size is specified as a power of 2. The default is 9 (512 bytes)

A table of likely values:

n       2^n
------------
9        512
10      1024
11      2048
12      4096
13      8192

Specify sector size using the ashift parameter:

zpool create pool /dev/sdx2 -m none -o ashift=13

To be effective this should be done before the pool is used. If you decide to change it later, it's best to take a backup, re-create the pool and restore the backup.

Configure swapping

There is/was a certain amount of controversy about the safety of swapping on ZFS. I simply quote the advice given on the ZOL GitHub site:

zfs create -V 4G -b $(getconf PAGESIZE) \
    -o compression=zle \
    -o logbias=throughput \
    -o sync=always \
    -o primarycache=metadata \
    -o secondarycache=none \
    -o com.sun:auto-snapshot=false pool/swap

Some notes about the properties:

The swap dataset isn't part of ROOT/fedora. That allows us to share the space with other linux installations.

The -V 4g means this will be a 4G ZVOL - a fixed-size, non-zfs volume allocated from the pool.

A new device will appear as:

/dev/zvol/pool/swap

The reason for turning off data caching is that the operating system will manage swap like a cache anyway.

The same reasoning (and cache settings) are used for ZVOLs created for virtual machine disk images.

The compression algorithm is changed from the default (lz4) to something very fast and simple: zero-length encoding.

After you're running with root on ZFS, complete swap setup by adding a line to /etc/fstab:

/dev/zvol/pool/swap none swap defaults 0 0

Format the swap volume:

mkswap -f /dev/zvol/pool/swap

And enable swapping:

swapon -av

The article reminds us:

  1. Make sure the devices in the pool are named using UUIDs or IDs. (Not device names.)
  2. Don't enable hibernation: The memory can't be restored from swap in ZFS because the pool isn't accessible when the hardware tries to return from hibernation.

Create more datasets

More complex dataset trees are possible and usual. Some reasons:

  1. To share data between operating systems.
  2. To apply special properties to some datasets.
  3. To isolate user accounts and enforce quotas.
  4. To avoid mixing user data with operating system files in the same dataset.
  5. To protect user data and selected O.S. directories from system rollbacks.

Examples:

You might want to boot a previously installed operating system, but keep user data, mail, webpage, etc. current. The ROOT datasets are separate:

pool/fedora24
pool/fedora25
pool/fedora26
...

But other datasets such as /home, /spool, /var/www will be retained no matter which operating system root is selected.

WARNING: It's known that Fedora doesn't like to see /usr mounted on a separate dataset. It has to be part of the root file system or you won't be able to boot.

Real world example

The following section shows the datasets, mountpoints and properties configured on my internet-facing server. It looks complicated, but there's a simple idea at work here: I want to rollback changes to the root file system, but not disturb:

This example uses legacy mountpoints. It's nice to do everything with zfs when possible, but some mountpoints must be processed early and I like to see all the information in one place.

Legacy mountpoints in /etc/fstab
UUID=693A-C0B1             /boot/efi       vfat    umask=0077,shortname=winnt 0 2
/dev/zvol/pool/swap        none            swap    defaults 0 0
pool/fedora/var/cache      /var/cache      zfs     defaults 0 0
pool/fedora/var/lib        /var/lib        zfs     defaults 0 0
pool/fedora/var/log        /var/log        zfs     defaults 0 0
pool/fedora/var/spool      /var/spool      zfs     defaults 0 0
pool/fedora/var/tmp        /var/tmp        zfs     defaults 0 0
pool/fedora/var/www        /var/www        zfs     defaults 0 0
pool/home                  /home           zfs     defaults 0 0
pool/root                  /root           zfs     defaults 0 0

Note that the last column, "pass", is always zero for zfs volumes. Linux fsck can't deal with zfs.

Pool properties
pool                   cachefile              none
pool                   ashift                 12
pool                   mountpoint             none
pool                   compression            on
pool                   atime                  off
Dataset properties
pool/fedora            mountpoint             /
pool/fedora            xattr                  sa

pool/fedora/var        exec                   off
pool/fedora/var        setuid                 off
pool/fedora/var        canmount               off
pool/fedora/var        mountpoint             legacy

pool/fedora/var/cache  com.sun:auto-snapshot  false

pool/fedora/var/tmp    exec                   on
pool/fedora/var/tmp    com.sun:auto-snapshot  false

pool/fedora/var/log    (inherit from var)
pool/fedora/var/spool  (inherit from var)
pool/fedora/var/lib    (inherit from var)
pool/fedora/var/www    (inherit from var)

pool/root              mountpoint             legacy

pool/home              mountpoint             legacy
pool/home              setuid                 off

pool/swap              volsize                4G
pool/swap              compression            zle
pool/swap              refreservation         4.13G
pool/swap              primarycache           metadata
pool/swap              secondarycache         none
pool/swap              logbias                throughput
pool/swap              sync                   always
pool/swap              com.sun:auto-snapshot  false

The dataset pool/fedora/var is created just to provide inheritable properties. If it worries you, "canmount" is not an inhertiable property and this example illustrates why that's a good idea.

Doing experiments

Note that /var/lib is separate, so rolling back the root will not undo package installations. That's better handled by dnf history rollback ....

My usual drill when I want to try something risky is to snapshot the root:

zfs snapshot pool/fedora@mark

Now I can proceed. Perhaps this pattern is familiar to you?

Install packages. Modify configuration files. Test. Install and configure more stuff. Try alternatives. Nothing seems to work out... Get discouraged. Decide to give up.

Now for the recovery:

dnf history rollback (to whatever I installed first)

zfs rollback pool/fedora@mark

Now you can try again or just forget the whole thing:

zfs destroy pool/fedora@mark
Getting there

If you started with a root-on-zfs system using only the EFI and root partitions, you can convert it to a more elaborate arrangement using your rescue system.

For each path "X" you'd prefer to have in a dataset:

mv X oldX
mkdir X
mount X
rsync -axHASX oldX/ X
rm -rf oldX

The fearless may conserve space using:

rsync -axHASX --remove-source-files oldX/ X
Compression properties

In the example above, compression is enabled for the entire pool. The default compression algorithm is lzjb or lz4. This is determined by the value of a feature flag, which you can inspect using:

zpool get feature@lz4_compress zool

The default value for this feature when you create a new pool is "active", so lz4 is effectively the default and that is the recommended algorithm for datasets containing filesystems.

Using zenter with a complex pool

Immediately after using zenter, you will normally want to execute:

mount -a

This will mount all the datasets listed in fstab. Many commands will need these directories to work properly. Worse, if you forget, they will put something into one or more mount point directores, which will prevent them from mounting normally when you try to boot.

Auto snapshots

Although not covered in this article, there are may application level programs to help you create and manage automatic rolling shapshots. These are snapshots take at regular intervals (hourly, daily, etc.) By convention, the user property "com.sun:auto-snapshot" is used to turn this feature on or off for specific datasets. Examples: It makes no sense to snapshot the swap partition and it would be dangerous to rollback nfs lock files.

Limiting ARC memory usage

ZFS wants a lot of memory for its ARC cache. The recommended minimum is 1G per terabyte of storage in your pool(s). Without constraints, ZFS is likely to run off with all your memory and sell it at a pawn shop. To prevent this, you should specify a memory limit. This is done using a module parameter. A reasonable setting is half your total memory:

Edit or create:

/etc/modprobe.d/zfs.conf

This expression (for example) limits the ARC to 16G:

options zfs zfs_arc_max=17179869184

The size is in bytes and must be a power of 2:

16GB  = 17179869184
8GB   = 8589934592
4GB   = 4294967296
2GB   = 2147483648
1GB   = 1073741824
500MB = 536870912
250MB = 268435456

The modprobe.d parameter files are needed at boot time, so it's important to rebuild your initramfs after adding or changing this parameter:

dracut -fv --kver `uname -r`

And then reboot.

Running without ECC memory

To operate with ultimate safety and win the approval of ZFS zealots, you really ought to use ECC memory. ECC memory is typically only available on "server class" motherboards with Intel Xeon processors.

If you don't use ECC memory, you are taking a risk. Just how big a risk is the subject of considerable controversy and beyond the scope of these notes.

First, give yourself a fighting chance by testing your memory. Obtain a copy if this utility and run a 24 hour test:

http://www.memtest86.com

This is particularly important on a new server because memory with defects is often purchased with those defects. If your memory passes the test, there is a good chance it will be reliable for some time.

To avoid ECC, you might be tempted to keep your server in a 1-meter thick lead vault buried 45 miles underground. It turns out that many of the radioactive sources for memory-damaging particles are already in the ceramics used to package integrated circuits. So this time-consuming measure is probably not worth the cost or effort.

Instead, we're going to enable the unsupported ZFS_DEBUG_MODIFY flag. This will mitigate, ut not eliminate, the risk of using ordinary memory. It's supposed to make the zfs software do extra checksums on buffers before they're written. Or something like that. Information is scarce because people who operate without ECC memory are usually dragged down to the Bad Place.

Edit:

/etc/modprobe.d/zfs.conf

Add the line:

options zfs zfs_flags=0x10

As previously discussed, you need to rebuild initramfs after changing this parameter:

dracut -fv --kver `uname -r`

And then reboot. You can confirm the current value here:

cat /sys/module/zfs/parameters/zfs_flags

Rebuilding grubx64.efi

This idea might be interesting to those who want to build a new zfs-friendly Linux distribution.

You'll recall the step where we copied zfs.mod to the EFI partition. This is necessary because zfs is not one of the modules built into Fedora's grub2-efi rpm. Other Linux distributions support zfs without this hack so I decided to find out how it was done.

You can see the list of build-in modules by downloading the grub2-efi source and reading the spec file. It would be straight forward but perhaps a bit tedious to add "zfs" to the list of built-in modules, remake the rpm and install over the old one. An easier way is to rebuild one file found here:

/boot/efi/EFI/fedora/grubx64.efi

The trick is to get hold of the original list of modules. They need to be listed as one giant string without the ".mod" suffix. Here's the list from the current version of grub2.spec with "zfs" already appended:

all_video boot btrfs cat chain configfile echo efifwsetup efinet 
ext2 fat font gfxmenu gfxterm gzio halt hfsplus iso9660 jpeg loadenv 
loopback lvm mdraid09 mdraid1x minicmd normal part_apple part_msdos
part_gpt password_pbkdf2 png reboot search search_fs_uuid search_fs_file 
search_label serial sleep syslinuxcfg test tftp video xfs zfs

Copy the text block above to a temporary file "grub2_modules". Then execute:

grub2-mkimage \
    -O x86_64-efi \
    -p /boot/efi/EFI/fedora \
    -o /boot/efi/EFI/fedora/grubx64.efi \
    `xargs < grub2_modules`

The grub2.spec script does something like this when building the binary rpm.

Now you can delete the directory where we added zfs.mod:

rm -rf /boot/efi/EFI/fedora/x86_64-efi

Better? A matter of taste I suppose. Rebuilding grubx64.efi is dangerous because it could be replaced by a Fedora update. I prefer using the x88_64-efi directory.

The package grub2-efi-modules installs a large module collection in:

/usr/lib/grub/x86_64-efi

The total size of all the .mod files there is about 3MB so you might wonder why not include all the modules? For reasons I don't have the patience to discover, at least two of them conflict and derail the boot process.


Checklist after update and before reboot

Here's a list of steps you can take that help prevent boot problems. I won't hurt to do this after any update that includes a new kernel or version of zfs.

After an update that includes a new kernel and before you reboot, you must get the new kernel version number by listing:

ls /boot/vmlinuz*

To save typing, I'll assume your new kernel version is in the variable kver.

Create it like this (for example)

kver=4.7.7-200.fc24.x86_64

You can see what version of kernel you're running using:

uname -r

If this name matches the newest entry in /boot, you can use the expression uname -r instead of defining and using the kver variable.

Find out what version of zfs you're about to install. It could be newer that what you're running:

rpm -q zfs

Create the variable "zver":

zver=0.6.5.8

First, check to see if the modules for zfs and spl are present in:

/lib/modules/$kver/extras

If they aren't there, build them now:

dkms install -m spl -v $zver -k $kver
dkms install -m zfs -v $zver -k $kver

Next, check to see if there is an initramfs file for your new kernel:

ls /boot/initramfs-$kver.img

If not, you need to run dracut:

dracut -fv --kver $kver

Next, check to see if the zfs module is inside the new initramfs:

lsinitrd -k $kver | grep "zfs.ko"

If it's not, run dracut:

dracut -fv --kver $kver

Check to see if your new kernel is mentioned in grub.cfg:

grep $kver /boot/efi/EFI/fedora/grub.cfg

If it's not, you need to build a new configuration file:

grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg

And make sure your new kernel is the default:

grubby --default-index

The command should return 0 (zero). If it doesn't:

grubby --set-default-index 0

If your system passes all these tests, it will probably boot.

Recovering from boot failure

Don't panic. It happens to me all the time and I never lost anything. Least you feel discouraged by this remark, please know that I get into trouble frequently because I'm constantly messing with the system so I can publish lengthy screeds like this one.

ZFS on Fedora fails to boot for several reasons

Common boot problems after a linux or zfs update:

You can usually avoid these horrors by doing the recommended pro-reboot checks. But accidents happen...

You see the dread prompt "#"

The screen is all black with confusing white text. You see a "#" prompt at the bottom of the screen. This is the emergency shell. Some-but-not-all shell commands work. You can use zfs and zpool commands.

The most common message is:

"cannot import 'mypool': pool was previously in use from another system"

The fix is easy. First import your pool somewhere else to avoid conflicts and supply the "-f" option to force the issue:

zpool import -f mypool -o altroot=/sysroot

Now export the pool:

zpool export mypool

And reboot:

reboot

You may see an message that suggests you run journalctl. Do so and take note of the messages near the end. A common message is that the ZFS can't import the pool using the cache file:

zfs-import-cache.service: Main process exited, ...
Failed to start import ZFS pools by cache file.
zfs-import-cache.service: Unit entered failed state.
...

If you followed my suggestion and disabled the cache, this shouldn't happen to you. But if you do see complaints about the cache file, it means that it's still enabled in your initramfs through some other accident.

1) Check that /etc/zfs/zpool.cache does not exist

2) Make sure zfs-import-cache is disabled:

systemctl status zfs-import-cache

    Should be inactive.

3) Make sure zfs-import-scan is enable:

systemctl status zfs-import-scan

    Should be active. 

4) Rebuild the initrd

dracut -fv --kver `uname -r`

Note that you can't use uname -r unless you're actually running on the same kernel your trying to fix. If you're doing a rescue inside a chroot, list the /boot directory and use the version of the kernel you're trying boot. Example:

dracut -fv --kver 4.16.13-300.fc28.x86_64 

If, on the other hand, you know what your doing and want to use the cache file, you'll have to make a new one:

Boot from your rescue system and go into the target as usual:

zpool import mypool -o altroot=/mypool
zenter /mypool

Now refresh the cache file:

zpool set cachefile=/etc/zfs/zpool.cache

Exit and reboot

exit
...

The pool won't boot for some other reason

This is a cure-all procedure that almost always fixes trouble caused by updates or upgrades. The most common problems are fixed automatically by the zmogrify script we installed. But it doesn't rebuild kernel modules because that process is too time consuming and because the spl and zfs packages are supposed to do this automatically when the kernel is updated. (Unfortunately, that mechanism doesn't work when you upgrade between Fedora versions.)

Begin by booting your recovery device. If you don't have one, go to the beginning of this document, follow the procedures and return here when ready.

The procedure outlined here is similar the pre-reboot checklist above except that now we're operating from a chroot environment.

Display a list of zfs pools

zpool import

One of these will be the one that won't boot. To distinguish this from our other examples, we will use the pool name "cool". Import the problematic pool:

zpool import -f cool -o altroot=/cool

Enter the chroot environment

zenter /cool

Note the new kernel version

ls -lrt /boot

Assign it to a variable

kver=4.13.16-202.fc26.x86_64

Note the zfs version

rpm -q zfs

Assign it to a variable

zver=0.7.3

Note that we omit parts of the zfs version string including and after any hyphen.

Check to see if you have zfs modules

ls /lib/modules/`echo $kver`/extra

If not, build them now

dkms install -m spl -v $zver -k $kver
dkms install -m zfs -v $zver -k $kver

Add the new modules to initramfs

dracut -fv --kver $kver

You may notice a complaint from dracut about having no functioning syslog:

dracut: No '/dev/log' or 'logger' included for syslog logging

This happens because there are "dangling" symbolic links in your chroot. The message is safe to ignore.

Mount the boot partition

mount /boot/efi

Update the grub.cfg file

grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg

Make sure the new kernel is the default

grubby --set-default-index 0

Update the grub2 boot-time zfs module

cp -a /usr/lib/grub/x86_64-efi/zfs.mod /boot/efi/EFI/fedora/x86_64-efi

Unmount the EFI partition

umount /boot/efi

Exit the chroot

exit

Export the pool

zpool export cool

Shutdown, disconnect your recovery device, and try rebooting.

You see the dread prompt "grub>"

The screen is entirely black except for the text "grub>"

You're talking to the grub boot loader. It understands a universe of commands you fortunately won't need to learn. If you're curious, enter "help", but that won't be necessary and it might frighten the horses.

First load the zfs module:

grub> insmod zfs

If that doesn't work, you didn't put the right stuff in the directory:

/boot/efi/EFI/fedora/x86_64-efi

Go back to the catchall procedure and pay attention.

If grub doesn't complain, try listing the top level devices:

grub> ls

    (hd0) (hd1) (hd1,gpt2) (hd1,gpt1) (hd2) (hd2,gpt2) (hd2,gpt1)

In this example, the devices hd1 and hd2 were used to create a ZFS mirror. Each of them is formatted GPT with an EFI partition gpt1 and a ZFS partition gpt2.

You can see the partition types:

grub> ls (hd1,gpt1)

    (hd1,gpt1): Filesystem is fat.

grub> ls (hd1,gpt2)

    (hd1,gpt2): Filesystem is zfs.

Grub2 has the concept of a default location which is stored in a variable named root. To see where you are:

grub> echo $root

    hd2,gpt1

That happens to be the EFI partition on device hd2. To see the directories and files in that partition:

grub> ls /

    efi/ System/ mach_kernel

Unlike linux, using "/" as the path does not mean the absolute root of grub's search space. It refers to wherever you (or some previous script) set the value of the variable "root".

You can change the location of "/" by setting a new value for root:

grub> set root=(hd1,gpt2)

Now we're at the top level of the zfs partition. To see the top level datasets:

grub> ls /

    @/ ROOT/
    error: incorrect dnode type

Grub is telling us that there is one top level dataset in this pool named "ROOT".

Because "ROOT" has no file system, we get the message:

error: incorrect dnode type

A less disturbing message, IMHO, might have been:

No file system

You see the same thing if you use the partition name:

grub> ls (hd1,gpt2)/

In that case, the path is absolute and the value of "root" isn't used.

Let's go deeper:

grub> ls /ROOT

    @/ fedora/
    error: incorrect dnode type

This tells us that the only child dataset of ROOT is the dataset "fedora". And there's no file system here either.

A complete path will have a sequence of dataset names followed by a traditional filesystem path.

To tell grub were done specifying the dataset path and now want regular file system paths, we introduce the "@" character. To display the boot image files:

grub> ls ROOT/fedora@/boot

    <big list of files including vmlinuz and initramfs>

Here, "ROOT" and "fedora" are datasets. "boot" is a filesystem directory.

The "@" symbol on "fedora@" refers to a snapshot. In this case, we don't have snapshot, but the "@" is required to separate dataset path components from filesystem path components.

If you want to look into a particular snapshot, "2018-06-05" for example, the expression would be:

grub> ls ROOT/fedora@2018-06-05/boot

    ...

The last dataset element "fedora@" can be followed by a regular filesystem path.

You may see lines like this mixed in with other files and directories:

error: compression algorithm inherit not supported

This means that grub found something it can't "look into" for some reason.

If you try to list files in the efi partition:

grub> ls ROOT/fedora@/boot/efi

    <nothing>

That's because we're not dealing with linux mount points here. The "efi" is just an empty directory. You have to look on the EFI partition. In this case:

grub> ls (hd1,gpt1)/

    efi/ System/ mach_kernel

If you know the uuid of the root device, you can make that the default root:

grub> search --no-floppy --fs-uuid --set=root cb2cb31ba32112e6

    <no output means it found something>

This expression tells grub to search through all the devices until it finds one with the uuid specified. Then it makes that the root. You can see what it found:

grub> echo $root

    hd1,gpt2

On bootable devices like thumb drives, uuid's are essential because device numbers will be different depending on the number of other devices on the system. In fact, it's a bad idea In Our Time to use any device numbers, especially in a grub.cfg file. History leaves much debris.

To boot from a cantankerous system that drops you into the dread prompt grub, you have to poke around using "ls" to find your vmlinuz and initramfs image files. Once found you tell grub about them:

grub> set root=(hd1,gpt2)
grub> linuxefi /ROOT/fedora@/boot/vmlinuz-4.16.13-300.fc28.x86_64 root=ZFS=cool/ROOT/fedora ro rhgb quiet
grub> initrdefi /ROOT/fedora@/boot/initramfs-4.16.13-300.fc28.x86_64.img

If you found the location using "search", you don't need to set the root again.

(On non-efi systems, the keywords are just "linux" and "initrd")

Now you're ready to boot:

grub> boot

For reference, here's the complete grub2 boot script for this system:

load_video
set gfxpayload=keep
insmod gzio
insmod part_gpt
insmod zfs
search --no-floppy --fs-uuid --set=root cb2cb31ba32112e6
linuxefi /ROOT/fedora@/boot/vmlinuz-4.16.13-300.fc28.x86_64 root=ZFS=cool/ROOT/fedora ro rhgb quiet
initrdefi /ROOT/fedora@/boot/initramfs-4.16.13-300.fc28.x86_64.img

You can get a list of all grub2 commands:

grub> help

And details:

grub> help <command name>

Hopefully, grub is now a little less dreadful for you.

Using zfsinfo at the dread prompt grub>

While you're messing around with zfs in grub, you can use another command "zfsinfo" to show the name and structure of a zfs pool. To access zfsinfo, the file zfsinfo.mod has to be in your boot-time grub2 modules folder. Here's a list of all the modules in my system: (We're running linux now, not grub.)

mount /boot/efi
ls /boot/efi/EFI/fedora/x86_64-efi/

    zfscrypt.mod
    zfsinfo.mod
    zfs.mod

If you're missing *zfsinfo.mod", copy it to the EFI like this:

cp -a /usr/lib/grub/x86_64/zfsinfo.mod /boot/efi/EFI/fedora/x86_64-efi/

Back at the grub prompt, you have to install it:

grub> insmod zfsinfo

Now you can run:

grub> zfsinfo (hd1,gpt2)

And see (in this case)

Pool name: cool
Pool GUID: cb2cb31ba32112e6
Pool state: Active
This VDEV is a mirror
 VDEV with 2 children

 VDEV element number 0:
  Leaf virtual device (file or disk)
  Virtual device is online
  Bootpath: unavailable
  Path: nvme1n1p2
  Devid: unavailable

 VDEV element number 1:
  Leaf virtual device (file or disk)
  Virtual device is online
  Bootpath: unavailable
  Path: nvme0n1p2
  Devid: unavailable

Update all the EFI boot files

"These things never happen, but always are." -- Sallust (4th century A.D.)

Here's how to repopulate the EFI from scratch.

Boot from your rescue system and go into the target as usual:

zpool import mypool -o altroot=/mypool
zenter /mypool
mount /boot/efi

Before you can proceed, networking must operate. That usually isn't a problem, but resolving network names will be a problem because Fedora has elaborated the traditional /etc/resolv.conf into a symbolic link that points into nowhere when you're in a chroot. To fix that, move the link somewhere else temporarily and create a new resolv.conf:

mv /etc/resolv.conf /root

Create a new /etc/resolv.conf that contains:

nameserver www.xxx.yyy.zzz

If you don't know the numbers for your nameserver, this always works:

nameserver 8.8.8.8

If you like, you can delete everything inside /boot/efi:

rm -f /boot/efi/*

Prior to Fedora 28 use:

dnf reinstall grub2-efi
dnf reinstall shim
dnf reinstall fwupdate-efi
dnf reinstall mactel-boot

Fedora 28 and later:

dnf reinstall grub2-efi-x64
dnf reinstall shim-x64
dnf reinstall fwupdate-efi
dnf reinstall mactel-boot

Now add the necessary zfs modules:

mkdir -p /boot/efi/EFI/fedora/x86_64-efi
cp -a /usr/lib/grub/x86_64-efi/zfs* /boot/efi/EFI/fedora/x86_64-efi

Restore the old resolv.conf:

mv /root/resolv.con /etc

Unmount the EFI partition and exit the chroot:

umount /boot/efi
exit

Export the pool

export mypool

Try booting again.

Fedora 28 won't boot on multi-device pools

This fresh horror came to my attention when I upgraded from FC27 to FC28 on a system with a pair of nvme devices organized as a mirror. When the system rebooted after the upgrade, I saw this on the black screen of death:

error: file '/ROOT/fedora@/boot/vmlinuz-4.16.13-300.fc28.x86_64 not found.
error: you need to load the kernel first.

The problem is in the grub configuration file:

/boot/efi/EFI/fedora/grub.cfg

If you examine the file, all the linuxefi entries will look something like this:

...
    load_video
    set gfxpayload=keep
    insmod gzio
    insmod part_gpt
    insmod zfs
    if [ x$feature_platform_search_hint = xy ]; then
      search --no-floppy --fs-uuid --set=/dev/sdb2  5f34daa4b2ae5710
    else
      search --no-floppy --fs-uuid --set=/dev/sdb2 5f34daa4b2ae5710
    fi
    linuxefi /ROOT/fedora@/boot/vmlinuz-4.17.3-200.fc28.x86_64 root=ZFS=cool/ROOT/fedora ro rhgb quiet intel_iommu=on
    initrdefi /ROOT/fedora@/boot/initramfs-4.17.3-200.fc28.x86_64.img
...

The problem is in the search expression. The --set parameter should be the name of a variable, not the first pool device. In this case, we wish it was "root":

search --no-floppy --fs-uuid --set=root 5f34daa4b2ae5710

To get going temporarily, reboot and "catch" the boot process at the kernel selection screen. Select the first entry and type "e" to edit the script.

Scroll down and edit the search line in the "else" clause so it reads:

search --no-floppy --fs-uuid --set=root 5f34daa4b2ae5710

Obviously, use the UUID number shown in your case.

Now type "control c" to reboot.

A work-around was described earlier in the section Deal with a Fedora 28 aggravation. That solution is a bit of a hack, but it's immune to regression by future Fedora updates.

A better fix involves changing one of the scripts in the grub2-tools package. If you do this, you won't have to define GRUB_DEVICE_BOOT as desribed above. But changing a script like this means you have to keep track of future updates that might change it back again.

Perhaps I can convince the ZFS-On-Linux group or Fedora to make this change permanent.

Edit the file:

/usr/share/grub/grub-mkconfig_lib

Locate the function prepare_grub_to_access_device. It begins like this:

prepare_grub_to_access_device ()
{
  local device=$1 && shift
  if [ "$#" -gt 0 ]; then
    local variable=$1 && shift
  else
    local variable=root
  fi
...

Change the first few lines to read:

prepare_grub_to_access_device ()
{
  local device=$1
  if [ "x${GRUB_ENABLE_BLSCFG}" = "xtrue" ] ; then
    local variable=boot
  else
    local variable=root
  fi
...

Now run grub2-mkconfig again:

grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg

You should be good to go.

When performing "dnf update" in the future, pay attention if package grub2-tools is updated. If so, the fix will have to be installed again.

Deal with ZFS updates

Most of this article is about making kernel updates as painless with root on zfs as they are without. But when an update comes out for ZFS without a kernel update, there can be problems.

After performing a dnf update that includes a new ZFS, The dkms process will run and build new spl and zfs modules. Unfortunately, the installers don't run dracut so the initramfs won't contain the new modules. You should be able to boot since the initramfs is self-consistent. If you want everything to be up to date, run this before rebooting:

zmogrify `uname -r`

If the update group includes both zfs and kernel updates, there is a puzzle - if the kernel was installed last, it will contain the new zfs modules because of our post-install script. But if zfs gets installed last, things will be out of whack. To be sure:

zmogrify <your new kernel version>

This annoyance is the last frontier to making root on zfs transparent to updates. Without sounding peevish (I hope) I want to point out that there's a dkms.conf file in the zfs source package that contains the setting:

REMAKE_INITRD="no"

It would be nice if this option was changed to "yes" - Then zfs updates would be totally carefree provided you've done all the other hacks described in this article. The issue was debated by the zfs-on-linux developers and they decided otherwise. Dis aliter visum.

Pool property conflicts at boot time

It's possible for the zfs.mod that lives on the EFI partition to get out sorts with the zfs.ko in the root file system. Or more precisely, the set of zfs properties supported in the pool that contains the root file system may not be supported by zfs.mod if the pool was created by a more recent zfs.ko.

Right now (2016-10) I live in a brief era when the version of zfs.mod supplied by grub2-efi-modules agrees with the property set created by default using zfs 0.6.8.5. But such a happy situation cannot last. The pool itself retains the property set used when it was created. But when you create a new pool with a future version of zfs, it may not be mountable by an older zfs.mod.

The fix is to enumerate all the properties your zfs.mod supports and specify only those when creating a new pool. To discover the set of properties supported by a given version of zfs.mod, it appears that you have to study the source. Being a lazy person, I'll just quote a reference that show an example of how a pool is created with a subset of available features.

The following quotation and code block come from the excellent article ArchLinux - ZFS in the section GRUB compatible pool creation.

"By default, zpool will enable all features on a pool. If /boot resides on ZFS and when using GRUB, you must only enable read-only, or non-read-only features supported by GRUB, otherwise GRUB will not be able to read the pool. As of GRUB 2.02.beta3, GRUB supports all features in ZFS-on-Linux 0.6.5.7. However, the Git master branch of ZoL contains one extra feature, large_dnodes that is not yet supported by GRUB."

zpool create -f -d \
    -o feature@async_destroy=enabled \
    -o feature@empty_bpobj=enabled \
    -o feature@lz4_compress=enabled \
    -o feature@spacemap_histogram=enabled \
    -o feature@enabled_txg=enabled \
    -o feature@hole_birth=enabled \
    -o feature@bookmarks=enabled \
    -o feature@filesystem_limits \
    -o feature@embedded_data=enabled \
    -o feature@large_blocks=enabled \
    <pool_name> <vdevs>

The article goes on to say: "This example line is only necessary if you are using the Git branch of ZoL."

Evidently I got away with ZFS 0.6.5.8 because it doesn't have the large_dnodes feature yet or grub got updated. To find out, run:

zpool get all pool | grep feature

For my pool, this shows:

pool  feature@async_destroy       enabled                     local
pool  feature@empty_bpobj         active                      local
pool  feature@lz4_compress        active                      local
pool  feature@spacemap_histogram  active                      local
pool  feature@enabled_txg         active                      local
pool  feature@hole_birth          active                      local
pool  feature@extensible_dataset  enabled                     local
pool  feature@embedded_data       active                      local
pool  feature@bookmarks           enabled                     local
pool  feature@filesystem_limits   enabled                     local
pool  feature@large_blocks        enabled                     local

No sign of large_dnodes yet, but be aware that it's out there waiting for you.


Complaints and suggestions

Share your woes by mail with Hugh Sparks. (I like to hear good news too.)

References