Reliably boot Fedora with root on ZFS

Revised 2020-10-01

What's all this?

This article is a walk-through for installing Fedora linux with root on ZFS. It has been tested with:

zfs-0.8.4 requires a patch to work with kernel-5.8.x. The procedure for obtaining and applying the patch is described below. If you do an update from kernel-5.7.x to kernel-5.8.x and forget the patch, a reboot will leave you at the Black Screen Of Dracut. To avoid this, apply the patch before running "dnf update". Otherwise, you can apply the patch in chroot using your rescue/installer system.

Update: As of zfs-0.8.5, the patch is no longer needed.

Prior art

Earlier (and unfortunately far more complex) versions of this document exist:

Success with BLS

This version of the guide features BLS - "Boot Loader Specification", which (along with other improvements by the packagers) makes it possible to update the kernel or upgrade Fedora and reboot successfully.

Fear, Uncertainty and Doubt

Fedora is a rapidly evolving distribution. Sometimes the kernel package gets ahead of the ZFS package making it impossible to build zfs modules. To make matters worse, the documentation for ZFS is sometimes out of date so you really have no recourse but reading the ZFS Issue Tracker to see if people are complaining. Since it takes only a few minutes to create a virtual machine using these instructions, you can use that to foresee difficulties.

Followup

Preliminaries

Hardware

You can work with real hardware or a virtual machine. Some section names start with [RH] "Real hardware" or [VM] "Virtual machine" - they only apply to those respective cases. Everything else applies to both. If this is your first time, following the virtual machine path is good way to learn without commiting hardware or accidentally reformatting your working system disk.

Installer system

You'll need a fedora linux system that has support for ZFS to follow this guide. After installing Fedora, visit the ZFS on Linux site and follow the instructions.

I suggest creating this system on a removable device and keeping it in a safe place because it's occasionally necessary to rescue root-on-zfs systems.

Helper script

We will create a root-on-zfs operating system by running commands mostly in the host environment. But some steps have to taken inside the target which is done via the "chroot" command. But without additional configuration, many linux commands won't work inside a chroot. To fix that, we need special script, "zenter." Some-but-not-all linux distributions provide a command that does this. (Not Fedora...)

Here's the source. Save it in a file "zenter.sh" and proceed. (Or you can download zenter here.)

#!/bin/bash
# zenter - Mount system directories and enter a chroot

target=$1

mount -t proc  proc $target/proc
mount -t sysfs sys $target/sys
mount -o bind /dev $target/dev
mount -o bind /dev/pts $target/dev/pts

chroot $target /bin/env -i \
    HOME=/root TERM="$TERM" PS1='[\u@chroot \W]\$ ' \
    PATH=/bin:/usr/bin:/sbin:/usr/sbin:/bin \
    /bin/bash --login

echo "Exiting chroot environment..."

umount $target/dev/pts
umount $target/dev/
umount $target/sys/
umount $target/proc/

Install the script to a directory on your PATH:

cp -a zenter.sh /usr/local/sbin/zenter
Variables

Installation variables

VER=32
POOL=Magoo
USER=hugh
PASW=mxyzptlk
NAME="Hugh Sparks"

Define a group of variables from one of the following two sections:

[RH] Variables for working with a real storage device
DEVICE=/dev/sda
PART1=1
PART2=2
PART3=3

The device name is only an example: when you add a physical disk, you must identify the new device name and carefully avoid blasting a device that's already part of your operating system.

IMPORTANT: Adding or removing devices can alter all device and partition names after reboot. This is why modern linux distributions avoid using them in places like fstab. We will convert device names to UUIDs as we proceed.

[VM] Variables for working with a virtual machine
DEVICE=/dev/nbd0
PART1=p1
PART2=p2
PART3=p3
IMAGE=/var/lib/libvirt/images/$POOL.qcow2

In the virtual machine case, the device name will always be the same unless you're using nbd devices for some other purpose.

[VM] Create a virtual disk
qemu-img create -f qcow2 ${IMAGE} 10G
[VM] Mount the virtual disk in the host file system
modprobe nbd
qemu-nbd --connect=/dev/nbd0 ${IMAGE} -f qcow2
[RH] Deal with old ZFS residue

If your target disk was ever part of a zfs pool, you need to clear the label before you repartition the device. First list all partitions:

sgdisk -p $DEVICE

For each partition number "n" that has type BF01 "Solaris /usr & Mac ZFS", execute:

zpool labelclear -f ${DEVICE}n

If you suspect the whole disk (no partitions) was part of a zfs array, clear that label using:

zpool labelclear -f ${DEVICE}

Partition the target

This example uses a very simple layout: An EFI partition, a boot partition and a ZFS partition that fills the rest of the disk.

Erase the existing partition table
sgdisk -Z $DEVICE
Create a 200MB EFI partition (PART1)
sgdisk -n 1:0:+200Mib -t 1:EF00 -c 1:EFI $DEVICE
Create a 500MB boot partition (PART2)
sgdisk -n 2:0:+500Mib -t 2:8300 -c 2:Boot $DEVICE
Create a ZFS partition (PART3) using the rest of the disk:
sgdisk -n 3:0:0 -t 3:BF01 -c 3:ZFS $DEVICE
Format EFI and boot partitions
mkfs.fat -F32 ${DEVICE}${PART1}
mkfs.ext4 ${DEVICE}${PART2}

Create the ZFS pool and datasets

Create a pool
zpool create $POOL -m none ${DEVICE}${PART3} -o ashift=12 -o cachefile=none

This is a very simple layout that has no redundancy. For a production system, you would create a mirror, raidz array or some combination. These topics are covered on many websites such as ZFS Without Tears

If for some reason you want to keep using a system with one device, adding the following option to zpool create will give you 2x redundancy (and half the space):

-o copies=2
Set pool properties
zfs set compression=on $POOL
zfs set atime=off $POOL
Re-import the pool so devices are identified by UUIDs
zpool export $POOL
udevadm trigger --settle
zpool import $POOL -d /dev/disk/by-uuid -o altroot=/target -o cachefile=none
Create datasets
zfs create $POOL/fedora -o xattr=sa -o acltype=posixacl
zfs create $POOL/fedora/var       -o exec=off -o setuid=off -o canmount=off
zfs create $POOL/fedora/var/cache 
zfs create $POOL/fedora/var/log
zfs create $POOL/fedora/var/spool 
zfs create $POOL/fedora/var/lib   -o exec=on
zfs create $POOL/fedora/var/tmp   -o exec=on
zfs create $POOL/www              -o exec=off -o setuid=off 
zfs create $POOL/home                         -o setuid=off
zfs create $POOL/root

The motivation for using multiple datasets is similar to the reason more conventional systems use multiple LVM volumes:

Set ZFS mountpoints
zfs set mountpoint=/        $POOL/fedora
zfs set mountpoint=/var     $POOL/fedora/var
zfs set mountpoint=/var/www $POOL/www
zfs set mountpoint=/home    $POOL/home
zfs set mountpoint=/root    $POOL/root

The reason for using ZFS mountpoints during installation is to avoid modifying the host system's fstab and to smooth the transition to the chroot environment for the final installation steps.

Later we'll switch to legacy mountpoints. During Fedora updates or upgrades, files sometimes get saved in mountpoint directories before ZFS gets around to mounting the datasets at boot time. This is a catastrophy because datasets can't be mounted on non-empty directories. The files they contain will become invisible and the system will fail to boot or exhibit bizarre symptoms. Fedora's update scripts know about fstab and make sure things are mounted at the right time. Hence we must accommodate.

Don't snapshot useless data
zfs set com.sun:auto-snapshot=false $POOL/fedora/var/tmp 
zfs set com.sun:auto-snapshot=false $POOL/fedora/var/cache

When com.sun:auto-snapshot=false, 3rd party snapshot software is supposed to exclude the dataset. Otherwise all datasets are included in snapshots.

This is an example of a user-created property. ZFS itself doesn't attach any meaning to such properties. They conventionally have "owned" names based on DNS to avoid conflicts.

Mount the boot partition
mkdir /target/boot
mount -U `lsblk -nr ${DEVICE}${PART2} -o UUID` /target/boot
rm -rf /target/boot/*
Mount the EFI partition
mkdir /target/boot/efi
mount -U `lsblk -nr ${DEVICE}${PART1} -o UUID` /target/boot/efi -o umask=0077,shortname=winnt
rm -rf /target/boot/efi/*

The "rm -f" expressions are there in case you're repeating these instructions on a previously partitioned device where an operating system was installed.


Install the operating system

Install a minimal Fedora system
dnf install -y --installroot=/target --releasever=$VER \
    @minimal-environment \
    kernel kernel-modules kernel-modules-extra \
    grub2-efi-x64 shim-x64 mactel-boot

Optional: Add your favorite desktop environment to the list e.g. @cinnamon-desktop.

Install ZFS
dnf install -y --installroot=/target --releasever=$VER \
    http://download.zfsonlinux.org/fedora/zfs-release.fc$VER.noarch.rpm

dnf install -y --installroot=/target --releasever=$VER \
    zfs zfs-dracut

Configure the target

Configure name resolver
cat > /target/etc/resolv.conf <<-EOF
search csparks.com
nameserver 192.168.1.2
EOF

(Be yourself.)

You may object that NetworkManager likes to use a symbolic link here that vectors off into NetworkManager Land. This concept has caused numerous boot failures on most of the systems I manage because of permission problems in the target directory. These can be corrected by hand, but I've had an easier life since I took over this file and used the traditional contents. Your mileage may vary. Someday Fedora will correct the problem. If you're in the mood to find out, don't create this file.

Show full path names in "zpool status"
cat > /target/etc/profile.d/grub2_zpool_fix.sh <<-EOF
export ZPOOL_VDEV_NAME_PATH=YES
EOF
[VM] Tell dracut to include the virtio_blk device
cat > /target/etc/dracut.conf.d/fs.conf <<-EOF
filesystems+=" virtio_blk "
EOF

Keep the spaces around virtio_blk!

Don't use zfs.cache
cat > /target/etc/default/zfs <<-EOF
ZPOOL_CACHE="none"
ZPOOL_IMPORT_OPTS="-o cachefile=none"
EOF
Set grub parameters
cat > /target/etc/default/grub <<-EOF
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=Fedora
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT=console
GRUB_DISABLE_RECOVERY=true
GRUB_DISABLE_OS_PROBER=true
GRUB_PRELOAD_MODULES=zfs
GRUB_ENABLE_BLSCFG=false
EOF

We're going to switch to BLS later.

Disable selinux
sed -i 's/SELINUX=enforcing/SELINUX=disabled/' /target/etc/selinux/config
Create a hostid file
chroot /target zgenhostid
Add user+password
chroot /target useradd $USER -c "$NAME" -G wheel
echo "$USER:$PASW" | chpasswd -R /target
Prepare for first boot
systemd-firstboot \
--root=/target \
--locale=C.UTF-8 \
--keymap=us \
--hostname=$POOL \
--setup-machine-id
Create fstab for legacy mountpoints
cat > /target/etc/fstab <<-EOF
UUID=`lsblk -nr ${DEVICE}${PART2} -o UUID` /boot ext4 defaults 0 0
UUID=`lsblk -nr ${DEVICE}${PART1} -o UUID` /boot/efi vfat umask=0077,shortname=winnt 0 2
$POOL/fedora/var/cache   /var/cache  zfs   defaults 0 0
$POOL/fedora/var/lib     /var/lib    zfs   defaults 0 0
$POOL/fedora/var/log     /var/log    zfs   defaults 0 0
$POOL/fedora/var/spool   /var/spool  zfs   defaults 0 0
$POOL/fedora/var/tmp     /var/tmp    zfs   defaults 0 0
$POOL/www                /var/www    zfs   defaults 0 0
$POOL/home               /home       zfs   defaults 0 0
$POOL/root               /root       zfs   defaults 0 0
EOF
Switch to legacy mountpoints
zfs set mountpoint=legacy $POOL/fedora/var
zfs set mountpoint=legacy $POOL/www
zfs set mountpoint=legacy $POOL/home
zfs set mountpoint=legacy $POOL/root
Chroot into the target
zenter /target
mount -a
Prepare for grub2-mkconfig
source /etc/profile.d/grub2_zpool_fix.sh

Running grub2-mkconfig will fail without this definition. It will always be defined after logging into the target, but we're not there yet.

Configure boot loader
grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg
grub2-switch-to-blscfg
Use import scanning instead of zfs cache:
systemctl disable zfs-import-cache
systemctl enable zfs-import-scan
Collect kernel and zfs version strings
kver=`rpm -q --last kernel | sed '1q' | sed 's/kernel-//' | sed 's/ .*$//'`
zver=`rpm -q zfs | sed 's/zfs-//' | sed 's/\.fc.*$//' | sed 's/-[0-9]//'`
Patch zfs for kernel-5.8.x with zfs-0.8.4

This topic is very active on the ZOL issue tracker and has already been fixed in the upstream project. As of zfs-0.8.5, the patch is no longer needed. If you're running zfs-0.8.4, proceed:

dnf install -y patch wget
cd /usr/src/zfs-$zver
wget -q https://server.csparks.com/BootFedoraZFS/vmalloc.patch
patch -u -s -p1 < vmalloc.patch

This ugliness patches two lines in the zfs source code where the the __vmalloc function is used. As of kernel-5.8.x, the function has two parameters rather than three.

Build and install zfs modules
dkms install -m zfs -v $zver -k $kver
Exit the chroot
umount /boot/efi
umount /boot
exit
Export the pool
zpool export $POOL 

Boot the target

[RH] Reboot and select the new UEFI disk
It works!
[VM] Disconnect the virtual disk
qemu-nbd --disconnect /dev/nbd0

If you forget to disconnect the nbd device, the virtual machine won't be able to access the virtual disk.

[VM] Create a virtual machine
virt-install \
--name=$POOL \
--os-variant=fedora$VER \
--vcpus=4 \
--memory=32000 \
--boot=uefi \
--disk path=$IMAGE,format=qcow2 \
--import \
--noreboot \
--noautoconsole \
--wait=-1

You only need to do this once. By replacing the disk image file, other configurations can be tested on the same vm.

[VM] Startup

Use the VirtManager GUI or:

virsh start $POOL
virt-viewer $POOL

Additional configuration

Things to do after you've successfully logged in.

Set the timezone
timedatectl set-timezone America/Chicago
timedatectl set-ntp true
Give your system a nice name
hostnamectl set-hostname magoo

Complaints and suggestions

I detest superstitions, gratuitous complications, obscure writing, and bugs. If you get stuck or if your understanding exceeds mine, please share your thoughts. (I like to hear good news too.)


References


Appendix - Deal with updates and upgrades

In the past, it was necessary to be vigilant when doing "dnf update" or a Fedora upgrade because a new kernel or zfs version made it necessary to run a fixup script before rebooting. In the dark ages before Fedora 31, this script was fairly complicated.

With the advent of BLS combined with other improvements by the kernel and zfs packagers, this is no longer necessary. After any update you can reboot with confidence that you'll never see the Black Screen Of Dracut or the Dread Prompt Of Grub.

If you're rash enough to be booting Fedora on ZFS in a production system, it's almost imperative that you maintain a simple virtual machine in parallel. When you see that updates are available, clone the VM and update that first. If it won't boot, attempt your fixes there. If all else fails, freeze kernel updates on your production system and wait for better times. (See Appendix - Freeze kernel updates )


Appendix - Pure ZFS systems

With UEFI motherboards, the only way to "ZFS purity" is to put your EFI partition on a separate device, rather than on a partition of a device that also has all or part of a ZFS pool. It's also possible to do away with the ext4 /boot partition by keeping it in a dataset, but this will put you into contention with the "pool features vs grub supported features" typhoon of uncertainty. (See Grub-compatible pool creation.)

A better way, in my opinion, is to use a small SSD with both EFI and boot partitions. The ZFS pool for the rest of the operating system can be assembled from disks without partitions, "whole disks", which most ZFS pundits recommend. This example doesn't follow that advice because it's intended to be a simplified tutorial.

If you still want to have /boot on ZFS, it's necessary to add the grub2 zfs modules to the efi partition:

dnf install grub2-efi-x64-modules
mkdir -p /target/boot/efi/EFI/fedora/x86_64-efi
cp -a /target/usr/lib/grub/x86_64-efi/zfs* /target/boot/efi/EFI/fedora/x86_64-efi

The zfs.mod file in that collection does not support all possible pool features, but it will work if you find a compromise. Currently, the zfs.mod with Fedora32 will handle a ZFS pool with default "compression=on" settings created using zfs-0.8.4.


Appendix - Fix boot problems

You'll need a thumb drive or other detachable device that has a linux system and ZFS support. Boot the device.

Import the pool
zpool import -f $POOL -o altroot=/target
Chroot into the system
zenter /target
mount -a
Rebuild the zfs modules
dnf reinstall zfs-dkms

If you see errors from dkms, you'll probably have have to revert to an earlier kernel and/or version of zfs. Such problems are temporary and rare.

Rebuild the EFI partition

First make sure you're running in chroot (zenter) and that the right /boot/efi partition is mounted:

df -h

Next run:

rm -rf /boot/efi/*
dnf reinstall grub2-efi-x64 shim-x64 fwupdate-efi mactel-boot
Reinstall BLS

Edit /etc/default/grub and disable BLS:

...
GRUB_ENABLE_BLSCFG=false
...

Then run:

grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg
grub2-switch-to-blscfg
Delete the Abominable Cache File
rm -f /etc/zfs/zfs.cache

This thing has a way of rising from the dead..

Update initrd
kver=`rpm -q --last kernel | sed '1q' | sed 's/kernel-//' | sed 's/ .*$//'`
dracut -fv --kver $kver
After any or all of these interventions, exit with:
umount /boot/efi
exit
zpool export $POOL
Reboot
Learn from others

Visit the ZFS Issue Tracker and see what others discover. If your problem is unique, join up and post a question.


Appendix - Freeze kernel updates

If you discover that you can't build the zfs modules for a new kernel, you'll have to use your recovery device and revert. (Or use a virtual machine to find out without blowing yourself up.)

Once you've got your system running again, you can "version lock" the kernel packages. This will allow other fedora updates to proceed, but hold the kernel at the current version:

dnf versionlock add kernel-`uname -r`
dnf versionlock add kernel-core-`uname -r`
dnf versionlock add kernel-devel-`uname -r`
dnf versionlock add kernel-modules-`uname -r`
dnf versionlock add kernel-modules-extra-`uname -r`
dnf versionlock add kernel-headers-`uname -r`

When it's safe to allow kernel updates, you can release all locks using the expression:

dnf versionlock clear

If you have locks on other packages and don't want to clear all of them, you can release only the previous kernel locks:

dnf versionlock delete kernel-`uname -r`
dnf versionlock delete kernel-core-`uname -r`
dnf versionlock delete kernel-devel-`uname -r`
dnf versionlock delete kernel-modules-`uname -r`
dnf versionlock delete kernel-modules-extra-`uname -r`
dnf versionlock delete kernel-headers-`uname -r`

Appendix - Stuck in the emergency shell

The screen is mostly black with plain text. You see:

[  OK  ] Started Emergency Shell.
[  OK  ] Reached target Emergency Mode.

This is the Black Screen Of Dracut.

You'll be invited to run journalctl which will list the whole boot sequence. Near the end, carefully inspect lines that mention ZFS. There are three common cases:

1) Journal entry looks like this:

systemd[1]: Failed to start Import ZFS pools by cache file.

You are a victim of the Abominable Cache File. The fix is easy. Boot your recovery device, enter the target, and follow the section that deals with getting rid of the cache file in Appendix - Fix boot problems.

2) Journal entry looks like this:

...
Starting Import ZFS pools by device scanning...
cannot import 'Magoo': pool was previously in use from another system.

You probably forget to export the pool after tampering with it from another system. (Such as when you previously used the recovery device.) You can fix the problem from the emergency shell:

zpool import -f myPool -N
zpool export myPool
reboot

3) If you see messages about not being able to load the zfs modules, that may be normal because it takes several tries during the boot sequence. But if ends up being unable to load the modules, try this:

modprobe zfs

If that fails, the zfs modules were never built or they were left out of the initramfs. To fix that, go through the entire s equence describe in Appendix - Fix boot problems.

If you can execute the modprobe sucessfully, you should try the next fix:


Appendix - Work-around for a race condition

During boot, it's normal to see a few entries like this in the journal:

dracut-pre-mount[508]: The ZFS modules are not loaded.
dracut-pre-mount[508]: Try running '/sbin/modprobe zfs' as root to load them.

But if the zfs modules aren't loaded by the time dracut wants to mount the root filesystem, the boot will fail. This problems was reported in 2019 ZOL 0.8 Not Loading Modules or ZPools on Boot #8885. I never saw this until I tried to boot a fast flash drive on a slow computer. Since I knew the flash drive worked on other machines, I was surprised to see The Black Screen Of Dracut.

Here's a fix you can apply when your root-on-zfs device is mounted for repair on /target:

mkdir /target/etc/systemd/system/systemd-udev-settle.service.d
cat > /target/etc/systemd/system/systemd-udev-settle.service.d/override.conf <<-EOF
[Service]
ExecStartPre=/usr/bin/sleep 5
EOF

Appendix - Stuck in grub

A black screen with an enigmatic prompt:

grub>

This is the Dread Prompt Of Grub.

Navigating this little world merits a separate document Grub Expressions. A nearly-foolproof solution is to run through Appendix - Fix boot problems. Pay particular attention to the step where the entire /boot/efi partition is recreated.


Appendix - Enable swapping

Using a zvol for swapping is problematic. (as of 2020-08, zfs 0.8.4) If you feel the urge to try, first read the swap deadlock thread.

Sooner or later, the issues will be fixed. (Maybe now?) Here's how to try it out:

Create a swap dataset
zfs create $POOL/swap \
    -o volsize=4G \
    -o volblocksize=4k \
    -o compression=zle \
    -o refreservation=4.13G \
    -o primarycache=metadata \
    -o secondarycache=none \
    -o logbias=throughput \
    -o sync=always \
    -o com.sun:auto-snapshot=false
Add the swap volume to fstab:
...
/dev/zvol/pool/swap   none  swap   defaults 0 0
...

After you're running the target, enable swapping:

swapon -av

This setting is remembered so swapping will operate after reboot.

Don't enable hibernation. It tries to use swap space for the memory image but the dataset is not available early enough in the boot process.