Reliably boot Fedora with root on ZFS

Updated on 2017-12-05 using:

Fedora 26

Older versions

As of ZFS-0.7.1, it's a lot easier to do this deed and I've removed several sections. Users of zfs-0.6.x.y will need to follow an older version of this page.

Quick links


Introduction

This article shows how to boot a recent version of Fedora (circa 26) with root on ZFS. To distinguish this article from the crowd, the features include:

I was motivated to collect and publish these methods for several reasons: Most of the root-on-zfs articles are either out of date, don't apply to Fedora, or have a key defect: They may work (or might have worked), but they don't survive automatic kernel updates. In other words, if you run dnf update and it includes a new kernel, the system won't boot. That is the problem this article shows how to fix.

UPDATE: My fans and detractors remind me to say that ZFS is now very robust on almost all Linux distributions. And I should add that I've never lost a byte of data. Any hint of peevishness comes from pushing the envelope: Both ZFS and Fedora are rapidly evolving: With two moving targets, some extra effort should be expected.

Why ZFS?

ZFS has significant advantages because it replaces things like software RAID and LVM with a more flexible and general framework. Here are just a few highlights:

Why root on ZFS?

If you only knew the power of ZFS (wheeze gasp), other solutions would seem ugly and ineffective.

Quick summary of the process

  1. Create a separate Fedora installation on installer.
  2. Add support for ZFS.
  3. Create two partitions on the target disk: efi, zfs.
  4. Create a ZFS pool and root dataset on zfs.
  5. Copy the operating system on installer to the root dataset on zfs.
  6. Configure efi so linux will use the root on zfs.

To simplify the discussion, only one disk is used for the ZFS pool. The process of building more complex pools is covered on dozens of ZFS websites. I've written one myself. It is, however, both unusual and unwise to deploy ZFS without some form of redundancy. With that in mind, this article should be treated as a tutorial rather than a practical solution. Proceed at your own risk.


Create an installer system

Obtain Fedora

I used:

Fedora-Workstation-Live-x86_64-24-1.2.iso

Configure the BIOS

Configure your BIOS to operate the target disk controller in AHCI mode (the default on most modern motherboards) and boot from the device where you've mounted the installation media. You should see two choices for the installation volume: One will mention UEFI. That's the one you must use. Otherwise look for settings that specify booting in UEFI mode. If the installer doesn't believe you booted using EFI firmware, it will gype up the rest of the process, so take time to figure this out.

Install Fedora

The installer is a bootable Fedora installation that has the same GPT, UEFI, and partition structure as our target zfs system.

Because we're going to copy the root of the installer to the target, it makes sense to configure the installer to be as similar to the target as possible. After we do the copy, the target is almost ready to boot. A nice thing about UEFI on GPT disks is that no "funny business" is written to boot blocks or other hidden locations. Everything is in regular files.

We're going to make the target system with GPT partitions. The installer needs to have the same structure, but Anaconda will not cooperate if the installer is a small disk. To force Anaconda to create GPT partitions, we must add an option.

Boot the installation media. On the menu screen, select

Start Fedora-workstation-live xx

Press "e" key to edit the boot line. It will look something like this:

vmlinuz ... quiet

At the end of the line add the string "inst.gpt" so it looks like this:

vmlinuz ... quiet inst.gpt

IMPORTANT: If you see an instruction to use the tab key instead of the "e" key to edit boot options, you have somehow failed to boot in EFI mode. Go back to your BIOS and try again.

Proceed with the installation until you get to the partitioning screen. Here, you must take control and specify standard (not LVM) partitioning.

Create two:

1) A 200M parition mounted at /boot/efi
2) A partition that fills the rest of the disk mounted at root "/".

Anaconda will recognize that you want a UEFI system. Press Done to proceed. You'll have to do it twice because you must be harassed for not creating a swap partition.

The rest of the installation should proceed normally. Reboot the new system and open a terminal session. Elevate yourself to superuser:

su 

Then update all the packages:

dnf update

Disable SELinux

Everything is supposed to work with SELinux. But I'm sorry to report that everything doesn't: Fixes are still being done frequently as of 2016. Unless you understand SELinux thoroughly and know how to fix problems, it's best to turn it off. In another year, this advice could change.

Edit:

/etc/sysconfig/selinux

Inside, disable SELinux:

SELINUX=disabled

Save and exit. Then reboot:

shutdown -r now

Check for correct UEFI installation

dnf list grub2-efi shim

These packages should already be installed. If not, you somehow failed to install a UEFI system.

Add extra grub2 modules

dnf install grub2-efi-modules

This pulls in the zfs module needed by grub2 at boot time.

Create some helper scripts

The following scripts will save a lot of typing and help avoid mistakes. When and how they are used will be explained later. For now, consider the descriptions an overview of the process.

The zmogrify script

Create a text file "zmogrify" in /usr/local/sbin (or somewhere on the PATH) that contains:

#!/bin/bash
# zmogrify - Run dracut and configure grub to boot with root on zfs.

kver=$1
sh -x /usr/bin/dracut -fv --kver $kver
mount /boot/efi
grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg
grubby --set-default-index 0
mkdir -p /boot/efi/EFI/fedora/x86_64-efi
cp -a /usr/lib/grub/x86_64-efi/zfs.mod /boot/efi/EFI/fedora/x86_64-efi
umount /boot/efi

Save the file and make it executable:

chmod a+x zmogrify

What is zmogify doing?

The first step runs dracut to produce a new /boot/initramfs file that includes zfs support. Dracut is able to do this because of the zfs-dracut package we added, which installs a "plug-in."

The peculiar way of running dracut with "sh -x" overcomes a totally mysterious problem with Fedora 25 that makes dracut hang otherwise. Dracut will hang after displaying the line;

*** Including module: kernel-modules *** 

If you understand why, I'd very much like to hear from you.

The grub2-mkconfig command creates a new grub.cfg script in the EFI boot partition.

The grubby command makes sure your new kernel is the one that boots by default.

Finally, we create a directory in the EFI partition and copy the boot-time version of the zfs module needed by grub2 to mount your zfs root file system.

The zenter script

Create a text file "zenter" in /usr/local/sbin (or somewhere on the PATH) that contains:

#!/bin/bash
# zenter - Mount system directories and enter a chroot

target=$1

mount -t proc /proc $target/proc
mount --rbind /sys $target/sys
mount --rbind /dev $target/dev

chroot $target

Save the file and make it executable:

chmod a+x zenter

What is zenter doing?

The chroot command makes (temporarily) a specified path the root of the linux file system as far as shell commands are concerned. We need to that before running zmogrify. But we need more that just the file system root: The extra mount commands make the linux system devices available inside the the new root. Dracut expects those to be in place when building a boot-time ramfs.

Make grub2 include zfs support when building configuration files

Edit:

/etc/default/grub

Add the line:

GRUB_PRELOAD_MODULES="zfs"

This will come into play later when we create a grub configuration file on the efi partition.

Also add the line:

GRUB_DISABLE_OS_PROBER=true

Without this fix, running grub2-mkconfig (which we'll need later) will throw an error and possibly disrupt other scripts. This is cased by a problem that has nothing to do with ZFS and will probably be fixed Real Soon Now.

Fix grub2 bug with pool device path names

Later, we will need to run a script called grub2-mkconfig. This script uses the "zpool status" command to list the components of a pool. But when these components aren't simple device names, the script fails. Many fixes have been proposed, mostly involving new udev rules. A simple fix is to add a system environment variable. Create this file:

/etc/profile.d/grub2_zpool_fix.sh

That contains:

export ZPOOL_VDEV_NAME_PATH=YES

This will take effect the next time you login or reboot. Which will be soon...

Prepare for future kernel updates

ln -s /usr/local/sbin/zmogrify /etc/kernel/postinst.d

The scripts in the postinst.d directory run after every kernel update. The zmogrify script takes care of updating the initramfs and the grub.cfg file on the EFI partition. (Stuff you need to boot with root in a zfs pool.)

Configure the ZFS repository

Using your browser, visit ZFS on Fedora.

Click on the link for your version of Fedora. When prompted, open it with "Software Install" and press the install button. This will add the repository zfs-release to you package database. It tells the package manager where to find zfs packages and updates.

Install zfs

dnf install kernel-devel
dnf install zfs zfs-dracut

During the installation, the process will pause to build the spl module and then the zfs module. A line will appear showing the path to their locations in /lib/modules/x.y.z/extras. If you don't see those lines, you don't have a match between the kernel-devel package you installed and your currently running kernel.

Load zfs

modprobe zfs

If you get a message that zfs isn't present, the modules failed to build during installation. Try rebooting.

Configure systemd services to their presets

For some reason, (2016-10) they don't get installed this way, so execute:

systemctl preset zfs-import-cache
systemctl preset zfs-mount
systemctl preset zfs-share
systemctl preset zfs-zed

I think this step was needed after upgrading from Fedora 24 to 25 with an earlier version of ZFS. When starting over on a new machine, I'd skip this step.


Create the target system

Prepare a target disk

If your target is a USB disk, you simply need to plug it in. Tail the log and note the new disk identifier:

journalctl -f

If you have a "real" disk, first look at your mounted partitions and take note of the names of your active disks. If you're using UUIDs, make a detailed list:

ls /dev/disk/by-uuid

The shortcuts will point to the /dev/sdn devices you're using now. Now you can shutdown, install the new disk, and reboot. You should find a new disk when you take a listing:

ls /dev/sd*

From now on, we'll assume you've found your new disk named "sdx".

If you got your target disk out of a junk drawer, the safest way to proceed is to zero the whole drive. This will get rid of ZFS and any RAID labels that might cause trouble. Bring up a terminal window, enter superuser mode, and run:

dd if=/dev/zero of=/dev/sdx bs=10M status=progress

This takes quite a while for large disks. If you know which partition on the old disk has a zfs pool, for example sdxn, you can speed things up by importing the pool and destroying it. But if the partition is already corrupted so the pool won't import properly, you can blast it by zeroing the first and last megabyte:

mysize=`blockdev --getsz /dev/sdx`
dd if=/dev/zero of=/dev/sdxn bs=512 count=2048
dd if=/dev/zero of=/dev/sdxn bs=512 count=2048 seek=$((mysize - 2048))

Alternatively, there is the command:

zpool labelclear /dev/sdx

Which can be forced if necessary:

zpool labelclear -f /dev/sdx

Partition your target device

I'm too lazy to create a step-by-step gdisk walk-through. If you need such a thing, you probably don't belong here. Run gdisk, parted, or whatever tool you prefer on /dev/sdx. Erase any existing partitions and create two new ones:

Partition  Code        Size      File system   Purpose
        1  EF00     200 MiB      EFI System    EFI boot partition
        2  BF01  (rest of disk)  EXT4          ZFS file system

Write the new table to disk and exit. Tell the kernel about the new layout:

partprobe

If you are prompted to reboot, do so. You now have two new devices:

/dev/sdx1 - For the target EFI partition
/dev/sdx2 - For the target ZFS file system

Format the target EFI partition

mkfs.fat -F32 /dev/sdx1

Create a pool

zpool create pool -m none /dev/sdx2 -o ashift=12

By default, zfs will mount the pool and all descendant datasets automatically. We turn that feature off using "-m none". The ashift property specifies the physical sector size of the disk. That turns out to be a Big Ugly Deal, but you don't need to be concerned about it in a tutorial. I've added a section at the end of the article about sector sizes. You need to know this stuff if you're building a production system.

Configure performance options

zfs set compression=on pool
zfs set atime=off pool

Compression improves zfs performance unless your dataset contains mostly already-compressed files. Here we're setting the default value that will be used when creating descendant datasets. If you create a dataset for your music and movies, you can turn it off just for that dataset.

The atime property controls file access time tracking. This hurts performance and most applications don't need it. But it's on by default.

Create the root dataset

zfs create -p pool/ROOT/fedora

Enable extended file attributes

zfs set xattr=sa pool/ROOT/fedora

This has the side effect of making ZFS significantly faster.

Provide some redundancy

If you think you might continue to use this one-device zfs pool, you can slightly improve your odds of survival by using the copies property: This creates extra copies of every block so you can recover from a "bit fatigue" event. A value of 2 will double the space required for your files:

zfs set copies=2 pool/ROOT/fedora

I use this property when booting off USB sticks because I don't trust the little devils.

Check for device symbolic links

It would be nice to find two new symbolic links for the partitions we've created on the target disk in here:

/dev/disk/by-uuid

The symbolic link to the EFI partition is always there pointing to /dev/sdx1, but the one for the ZFS partition pointing to /dev/sdx2 is not. I've tried re-running the rules:

udevadm trigger

But the link for /dev/sdx2 still isn't there. Regrettably, you have to reboot now. When you're back up and running proceed...

Export the pool and re-import to an alternate mount point

zpool export pool
zpool import pool -d /dev/disk/by-uuid -o altroot=/sysroot

At this point don't panic if you look for /sysroot: It's not there because there are no specified mount points yet.

You will also notice the clause -d /dev/disk/by-uuid. This will rename the disk(s) so their UUID's appear when you execute zpool status. You could also use -d /dev/disk/by-id - Details about this are covered later.

Note: It's possible to arrive at a state where a newly formatted disk will have an entry in some-but-not-all of the /dev/disk subdirectories. I'm not sure what causes this annoyance, but if the command to import the pool using UUIDs fails, try using /dev/disk/by-id instead. The important thing is to get away from using device names.

Specify the real mount point

zfs set mountpoint=/ pool/ROOT/fedora

Because we imported the pool with the altroot option, /sysroot will now be mounted.

Copy root from the installer to the target

rsync -avxHASX / /sysroot/

Don't get careless and leave out the trailing "/" character.

You'll see some errors near the end about "journal". That's ok.

Copy /boot from the installer to the target:

rsync -avxHASX /boot/ /sysroot/boot/

Understanding the rsync command

The nice incantation is from Rudd-O. This is what the options do:

a: Archive
v: Verbose
x: Only local filesystems
H: Preserve hard links
A: Preserve ACLs
S: Handle sparce files efficiently
X: Preserve extended attributes

Copy the EFI partition from the installer to the target

Recall that "/dev/sdx1" is the device name of the target EFI parition:

cd
mkdir here
mount /dev/sdx1 here
cp -a /boot/efi/EFI here
umount here
rmdir here

Note the UUID of the target EFI partition

blkid /dev/sdx1

In my case, it was "0FEE-5943"

Edit the target fstab

/sysroot/etc/fstab

It should have only one line:

UUID=0FEE-5943    /boot/efi  vfat  umask=0077,shortname=winnt 0 2

Even that line is unnecessary for booting the system, but utilities such as grub2-mkconfig expect the efi partition to be mounted.

(Obviously, use your EFI partition's UUID.)

Chroot into the target

zenter /sysroot

Run zmogrify

zmogrify `uname -r`

Exit the chroot

exit
umount -lf /sysroot

Export the pool

zpool export pool

Reboot

If you're the adventurous type, simply reboot. Otherwise, first skip down to Checklist after update and before reboot and run through the tests. Then reboot.

If all is well, you'll be running on zfs. To find out, run:

mount | grep zfs

You should see your pool/ROOT/fedora mounted as root. If it doesn't work, please try the procedure outlined below in Recovering from boot failure.

Take a snapshot

Before disaster strikes:

zfs snapshot pool/ROOT/fedora@firstBoot

We are finished. Now for the ugly bits!


Post installation enhancements

More about device names

When we created the pool, we used the old-style "sdx" device names. They are short, easy to type and remember. But they have a big drawback. Device names are associated with the ports on your motherboard, or sometimes just the order in which the hardware was detected. It would really be better to call them "dynamic connection names." If you removed all your disks and reconnected them to different ports, the mapping from device names to drives would change.

You might think that would play havoc when you try to re-import the pool. Actually, ZOL (ZFS on Linux) protects you by automatically switching the device names to UUIDs when device names in the pool conflict with active names in your system.

Linux provides several ways to name disks. The most useful of these are IDs and UUIDs. Both are long complex strings impossible to remember or type. I prefer to use IDs because they include the serial number of the drive. That number is also printed on the paper label. If you get a report that a disk is bad, you can find it by reading the label.

First, boot into your installer linux system.

Here's how to use IDs:

zpool import pool -o altroot=/sysroot -d /dev/disk/by-id

If you prefer UUIDs:

zpool import pool -o altroot=/sysroot -d /dev/disk/by-uuid

Should you ever want to switch back to device names:

zpool import pool -o altroot=/sysroot -d /dev

To switch between any two, first export and then import as shown above. The next time you export the pool or shut down, the new names will be preserved in the disk data structures.

Optimize performance by specifying sector size

The problem: ZFS will run a lot faster if you correctly specify the physical sector size of the disks used to build the pool. For this optimzation to be effective, all the disks must have the same sector size. Discovering the physical sector size is difficult because disks lie. The history of this conundrum is too involved to go into here.

The short answer: There is no way to be sure about physical sector sizes unless you find authoritative information from the maker. Guessing too large wastes space but insures performance. Guessing too small harms performance.

Here are some heuristic steps:

First, ask the disk for its sector size:

lsblk -o NAME,PHY-SeC /dev/sdx

If it reports any value except 512, you can believe the answer. If it shows 512, you may be seeing the logical sector size.

When creating a pool, the sector size is specified as a power of 2. The default is 9 (512 bytes)

A table of likely values:

n       2^n
------------
9        512
10      1024
11      2048
12      4096
13      8192

Specify sector size using the ashift parameter:

zpool create pool /dev/sdx2 -m none -o ashift=13

To be effective this should be done before the pool is used. If you decide to change it later, it's best to take a backup, re-create the pool and restore the backup.

Create more datasets

More complex dataset trees are possible and usual. Some reasons:

  1. To share content between operating systems.
  2. To apply special properties to system directories.
  3. To isolate user accounts and apply quotas.

Examples:

In the future, you may have multiple versions of fedora and possibly other operating systems:

pool/ROOT/fedora24
pool/ROOT/fedora25
pool/ROOT/fedora26
...

Using this construct, all of them will share the same home directories:

zfs create pool/home -o mountpoint=/home

You could break it down further by having a dataset for each user.

For security reasons, it's a good idea to restrict what goes in in the /var tree. To do that:

zfs create pool/ROOT/fedora/var -o canmount=off -o setuid=off  -o exec=off

But some installers blow up if they don't have execute privilege in /var/tmp, so you have to create another:

zfs create pool/ROOT/fedora/var/tmp -o -o exec=on

Dissecting the /tmp directory into datasets with different properties is actually far more complex. This subject will take you into deep water rapidly. Search for advice and make up your own mind.

UPDATE: It is known that Fedora doesn't like to see /usr mounted on a separate dataset. It has to be part of the root file system or the system won't boot.

Example

On my internet gateway server with root on zfs, here are the choices I made:

zool/ROOT/fedora25
    mountpoint             /

zool/home
    mountpoint             /home

zool/var
    exec                   off
    setuid                 off
    canmount               off

zool/var/spool
    mountpoint             legacy

zool/var/www
    mountpoint             /var/www
    sharesmb               on

zool/swap
    volsize                16G
    compression            zle
    refreservation         17.0G
    primarycache           metadata
    secondarycache         none
    logbias                throughput
    sync                   always
    com.sun:auto-snapshot  false

zool/archive
    mountpoint             /archive
    sharenfs               on
    sharesmb               on
    compression            on

The only tricky bit was the swap partition. It deserves a new section:

Enable swapping

There is/was a certain amount of controversy about the safety of swapping on ZFS. I simply quote the advice given on the ZOL GitHub site:

zfs create -V 4G -b $(getconf PAGESIZE) \
    -o compression=zle \
    -o logbias=throughput \
    -o sync=always \
    -o primarycache=metadata \
    -o secondarycache=none \
    -o com.sun:auto-snapshot=false pool/swap

Note that "swap" isn't part of ROOT/fedora. That allows us to share the space with other linux installations. The "-V 4g" means this will be a 4G ZVOL - a fixed-size, non-zfs filesystem allocated from the pool. The reason for turning off data caching is that the operating system will manage swap like a cache anyway. The same reasoning and cache settings are used for ZVOLs created for virtual machine disk images.

After you're running with root on ZFS, complete swap setup by adding a line to /etc/fstab:

/dev/zvol/pool/swap none swap defaults 0 0

Format the swap dataset:

mkswap -f /dev/zvol/pool/swap

And enable swapping:

swapon -av

The article reminds us:

  1. Make sure the devices in the pool are named using UUIDs or IDs. (Not device names.)
  2. Don't enable hibernation: The memory can't be restored from swap in ZFS because the pool isn't accessible when the hardware tries to return from hibernation.

Limiting ARC memory usage

ZFS wants a lot of memory for its ARC cache. The recommended minimum is 1G per terabyte of storage in your pool(s). Without constraints, ZFS is likely to run off with all your memory and sell it at a pawn shop. To prevent this, you should specify a memory limit. This is done using a module parameter. A reasonable setting is half your total memory:

Edit or create:

/etc/modprobe.d/zfs.conf

This expression (for example) limits the ARC to 16G:

options zfs zfs_arc_max=17179869184

The size is in bytes and must be a power of 2:

16GB  = 17179869184
8GB   = 8589934592
4GB   = 4294967296
2GB   = 2147483648
1GB   = 1073741824
500MB = 536870912
250MB = 268435456

The modprobe.d parameter files are needed at boot time, so it's important to rebuild your initramfs after adding or changing this parameter:

dracut -fv --kver `uname -r`

And then reboot.

Running without ECC memory

To operate with ultimate safety and win the approval of ZFS zealots, you really ought to use ECC memory. ECC memory is typically only available on "server class" motherboards with Intel Xeon processors.

If you don't use ECC memory, you are taking a risk. Just how big a risk is the subject of considerable controversy and beyond the scope of these notes.

First, give yourself a fighting chance by testing your memory. Obtain a copy if this utility and run a 24 hour test:

http://www.memtest86.com

This is particularly important on a new server because memory with defects is often purchased with those defects. If your memory passes the test, there is a good chance it will be reliable for some time.

To avoid ECC, you might be tempted to keep your server in a 1-meter thick lead vault buried 45 miles underground. It turns out that many of the radioactive sources for memory-damaging particles are already in the ceramics used to package integrated circuits. So this time-consuming measure is probably not worth the cost or effort.

Instead, we're going to enable the unsupported ZFS_DEBUG_MODIFY flag. This will mitigate, but not eliminate, the risk of using ordinary memory. It's supposed to make the zfs software do extra checksums on buffers before they're written. Or something like that. Information is scarce because people who operate without ECC memory are usually dragged down to the Bad Place.

Edit:

/etc/modprobe.d/zfs.conf

Add the line:

options zfs zfs_flags=0x10

As previously discussed, you need to rebuild initramfs after changing this parameter:

dracut -fv --kver `uname -r`

And then reboot. You can confirm the current value here:

cat /sys/module/zfs/parameters/zfs_flags

Rebuilding grubx64.efi

This idea might be interesting to those who want to build a new zfs-friendly Linux distribution.

You'll recall the step where we copied zfs.mod to the EFI partition. This is necessary because zfs is not one of the modules built into Fedora's grub2-efi rpm. Other Linux distributions support zfs without this hack so I decided to find out how it was done.

You can see the list of build-in modules by downloading the grub2-efi source and reading the spec file. It would be straight forward but perhaps a bit tedious to add "zfs" to the list of built-in modules, remake the rpm and install over the old one. An easier way is to rebuild one file found here:

/boot/efi/EFI/fedora/grubx64.efi

The trick is to get hold of the original list of modules. They need to be listed as one giant string without the ".mod" suffix. Here's the list from the current version of grub2.spec with "zfs" already appended:

all_video boot btrfs cat chain configfile echo efifwsetup efinet 
ext2 fat font gfxmenu gfxterm gzio halt hfsplus iso9660 jpeg loadenv 
loopback lvm mdraid09 mdraid1x minicmd normal part_apple part_msdos
part_gpt password_pbkdf2 png reboot search search_fs_uuid search_fs_file 
search_label serial sleep syslinuxcfg test tftp video xfs zfs

Copy the text block above to a temporary file "grub2_modules". Then execute:

grub2-mkimage \
    -O x86_64-efi \
    -p /boot/efi/EFI/fedora \
    -o /boot/efi/EFI/fedora/grubx64.efi \
    `xargs < grub2_modules`

The grub2.spec script does something like this when building the binary rpm.

Now you can delete the directory where we added zfs.mod:

rm -rf /boot/efi/EFI/fedora/x86_64-efi

Better? A matter of taste I suppose. Rebuilding grubx64.efi is dangerous because it could be replaced by a Fedora update. I prefer using the x88_64-efi directory.

The package grub2-efi-modules installs a large module collection in:

/usr/lib/grub/x86_64-efi

The total size of all the .mod files there is about 3MB so you might wonder why not include all the modules? For reasons I don't have the patience to discover, at least two of them conflict and derail the boot process.


Problem solving

Checklist after update and before reboot

Here's a list of steps you can take that help prevent boot problems. I won't hurt to do this after any update that includes a new kernel or version of zfs.

To save typing, I'll assume your new kernel version is in the variable kver.

Create it like this (for example)

kver=4.7.7-200.fc24.x86_64

You also may need the version of zfs. For example:

zver=0.6.5.8

Make sure you set the variables kver and zver properly: they my be different from the version you're running on your boot repair disk.

First, check to see if the modules for zfs and spl are present in:

/lib/modules/$kver/extras

If they aren't there, build them now:

dkms install -m spl -v $zver -k $kver
dkms install -m zfs -v $zver -k $kver

Next, check to see if there is an initramfs file for your new kernel:

ls /boot/initramfs-$kver.img

If not, you need to run dracut:

dracut -fv --kver $kver

Next, check to see if the zfs module is inside the new initramfs:

lsinitrd -k $kver | grep "zfs.ko"

If it's not, run dracut:

dracut -fv --kver $kver

Check to see if your new kernel is mentioned in grub.cfg:

grep $kver /boot/efi/EFI/fedora/grub.cfg

If it's not, you need to build a new configuration file:

grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg

And make sure your new kernel is the default:

grubby --default-index

The command should return 0 (zero). If it doesn't:

grubby --set-default-index 0

If your system passes all these tests, it will probably boot.

Recovering from boot failure

Don't panic. It happened to me many times when I was learning. I never lost anything and it's always easy to fix the problem once you know how.

Common boot problems after a kernel or zfs update:

  1. The system boots to the previous kernel, rather than the new one.
  2. Dracut complains about importing the pool.
  3. The pool won't boot for some other reason.

You can usually avoid these horrors by doing the recommended checks. But accidents happen...

The system boots to the previous kernel

Some part of the installation process failed. It's best, in this case, to go through the Checklist after update and before reboot.

Dracut complains about importing the pool

If you did maintenance with a USB stick and forgot to export the pool, your system may fail to boot, showing the message:

"cannot import 'mypool': pool was previously in use from another system"

At this stage, you'll be in the emergency shell. The fix is easy. First import your pool somewhere else to avoid conflicts and supply the "-f" option to force the issue:

zpool import -f mypool -o altroot=/sysroot

Now export the pool:

zpool export mypool

And reboot.

The pool won't boot for some other reason

There can be many causes, but most can be fixed by using your recovery system. Upgrading to a new Fedora major version is a reliable way to experience this problem.

Boot your recovery device. If you don't have one, go to the beginning of this document, follow the procedures and return here when ready.

Display a list of zfs pools

zpool import

One of these will be the one that won't boot. To distinguish this from our other examples, we will use the pool name "cool". Import the problematic pool:

zpool import -f cool -o altroot=/cool

Enter the chroot environment

zenter /cool

Note the new kernel version

ls -lrt /boot

Assign it to a variable

kver=4.13.16-202.fc26.x86_64

Note the zfs version

rpm -qa | grep zfs

Assign it to a variable

zver=0.7.3

(Note that we omit parts of the zfs version string including and after any hyphen.)

Check to see if you have zfs modules

ls /lib/modules/`echo $kver`/extra

If not, build them now

dkms install -m spl -v $zver -k $kver
dkms install -m zfs -v $zver -k $kver

Add the new modules to initramfs

dracut -fv --kver $kver

Mount the boot partition

mount /boot/efi

Update the grub.cfg file

grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg

Make sure the new kernel is the default

grubby --set-default-index 0

Update the grub2 boot-time zfs module

cp -a /usr/lib/grub/x86_64-efi/zfs.mod /boot/efi/EFI/fedora/x86_64-efi

Exit the chroot

exit

Unmount the special directories

umount -lf /cool

Export the pool

zpool export cool

Shutdown, disconnect your recovery device, and try rebooting.

Deal with ZFS updates

Most of this article is about making kernel updates as painless with root on zfs as they are without. But when an update comes out for ZFS without a kernel update, there can be problems.

After performing a dnf update that includes a new ZFS, The dkms process will run and build new spl and zfs modules. Unfortunately, the installers don't run dracut so the initramfs won't contain the new modules. You should be able to boot since the initramfs is self-consistent. If you want everything to be up to date, run this before rebooting:

zmogrify `uname -r`

If the update group includes both zfs and kernel updates, there is a puzzle - if the kernel was installed last, it will contain the new zfs modules because of our post-install script. But if zfs gets installed last, things will be out of whack. To be sure:

zmogrify <your new kernel version>

This annoyance is the last frontier to making root on zfs transparent to updates. Without sounding peevish (I hope) I want to point out that there's a dkms.conf file in the zfs source package that contains the setting:

REMAKE_INITRD="no"

It would be nice if this option was changed to "yes" - Then zfs updates would be totally carefree provided you've done all the other hacks described in this article. The issue was debated by the zfs-on-linux developers and they decided otherwise. Dis aliter visum.

Pool property conflicts at boot time

It's possible for the zfs.mod that lives on the EFI partition to get out sorts with the zfs.ko in the root file system. Or more precisely, the set of zfs properties supported in the pool that contains the root file system may not be supported by zfs.mod if the pool was created by a more recent zfs.ko.

Right now (2016-10) I live in a brief era when the version of zfs.mod supplied by grub2-efi-modules agrees with the property set created by default using zfs 0.6.8.5. But such a happy situation cannot last. The pool itself retains the property set used when it was created. But when you create a new pool with a future version of zfs, it may not be mountable by an older zfs.mod.

The fix is to enumerate all the properties your zfs.mod supports and specify only those when creating a new pool. To discover the set of properties supported by a given version of zfs.mod, it appears that you have to study the source. Being a lazy person, I'll just quote a reference that show an example of how a pool is created with a subset of available features.

The following quotation and code block come from the excellent article ArchLinux - ZFS in the section GRUB compatible pool creation.

"By default, zpool will enable all features on a pool. If /boot resides on ZFS and when using GRUB, you must only enable read-only, or non-read-only features supported by GRUB, otherwise GRUB will not be able to read the pool. As of GRUB 2.02.beta3, GRUB supports all features in ZFS-on-Linux 0.6.5.7. However, the Git master branch of ZoL contains one extra feature, large_dnodes that is not yet supported by GRUB."

zpool create -f -d \
    -o feature@async_destroy=enabled \
    -o feature@empty_bpobj=enabled \
    -o feature@lz4_compress=enabled \
    -o feature@spacemap_histogram=enabled \
    -o feature@enabled_txg=enabled \
    -o feature@hole_birth=enabled \
    -o feature@bookmarks=enabled \
    -o feature@filesystem_limits \
    -o feature@embedded_data=enabled \
    -o feature@large_blocks=enabled \
    <pool_name> <vdevs>

The article goes on to say: "This example line is only necessary if you are using the Git branch of ZoL."

Evidently I got away with ZFS 0.6.5.8 because it doesn't have the large_dnodes feature yet or grub got updated. To find out, run:

zpool get all pool | grep feature

For my pool, this shows:

pool  feature@async_destroy       enabled                     local
pool  feature@empty_bpobj         active                      local
pool  feature@lz4_compress        active                      local
pool  feature@spacemap_histogram  active                      local
pool  feature@enabled_txg         active                      local
pool  feature@hole_birth          active                      local
pool  feature@extensible_dataset  enabled                     local
pool  feature@embedded_data       active                      local
pool  feature@bookmarks           enabled                     local
pool  feature@filesystem_limits   enabled                     local
pool  feature@large_blocks        enabled                     local

No sign of large_dnodes yet, but be aware that it's out there waiting for you.


Complaints and suggestions

Share your woes by mail with Hugh Sparks. (I like to hear good news too.)

References