Reliably boot Fedora with root on ZFS

Revised 2024-10-09, Corrections/complaints contact Hugh Sparks

What's all this?

This article is a walk-through for installing Fedora linux with root on ZFS. It has been tested with:

Fedora 32, kernel-5.7.*,   zfs-0.8.3
...
Fedora 39, kernel-6.10.10, zfs-2.2.7

Notes:

1) If running "dnf update" shows that both zfs and the kernel will be updated, it's best to cancel the update and do the zfs update by itself first. Then reboot and continue the update.

dnf update zfs zfs-dkms zfs-dracut
dnf update

2) Sometimes Fedora will introduce a kernel beyond what is currently supported by ZFS. When you see an update that includes a new kernel, check the ZFS on Linux website to make sure your prospective new kernel is supported.

Prior art

Earlier (and unfortunately far more complex) versions of this document exist:

Followup

Appendix - A script to build ZFS modules
Appendix - Deal with updates and upgrades
Appendix - Pure ZFS systems
Appendix - Fix boot problems
Appendix - Stuck in the emergency shell
Appendix - Work-around for a race condition
Appendix - Stuck in grub
Appendix - Freeze kernel updates
Appendix - Enable swapping

Preliminaries

Hardware

You can work with real hardware or a virtual machine. Some section names start with [RH] "Real hardware" or [VM] "Virtual machine" - they only apply to those respective cases. Everything else applies to both. If this is your first time, following the virtual machine path is good way to learn without commiting hardware or accidentally reformatting your working system disk.

Installer system

You'll need a fedora linux system that has support for ZFS to follow this guide. After installing Fedora, visit the ZFS on Linux site and follow the instructions.

I suggest creating this system on a removable device and keeping it in a safe place because it's occasionally necessary to rescue root-on-zfs systems.

Helper script

We will create a root-on-zfs operating system by running commands mostly in the host environment. But some steps have to taken inside the target which is done via the "chroot" command. But without additional configuration, many linux commands won't work inside a chroot. To fix that, we need special script, "zenter." Some-but-not-all linux distributions provide a command that does this. (Not Fedora...)

Here's the source. Save it in a file "zenter.sh" and proceed. (Or you can download zenter here.)

#!/bin/bash
# zenter - Mount system directories and enter a chroot

target=$1

mount -t proc  proc $target/proc
mount -t sysfs sys $target/sys
mount -o bind /dev $target/dev
mount -o bind /dev/pts $target/dev/pts

chroot $target /bin/env -i \
    HOME=/root TERM="$TERM" PS1='[\u@chroot \W]\$ ' \
    PATH=/bin:/usr/bin:/sbin:/usr/sbin:/bin \
    /bin/bash --login

echo "Exiting chroot environment..."

umount $target/dev/pts
umount $target/dev/
umount $target/sys/
umount $target/proc/

Install the script to a directory on your PATH:

cp -a zenter.sh /usr/local/sbin/zenter

Variables

Installation variables:

VER=34
POOL=Magoo
USER=hugh
PASW=mxyzptlk
NAME="Hugh Sparks"

Define a group of variables from one of the following two sections:

[RH] Variables for working with a real storage device

DEVICE=/dev/sda
PART1=1
PART2=2
PART3=3

The device name is only an example: when you add a physical disk, you must identify the new device name and carefully avoid blasting a device that's already part of your operating system.

IMPORTANT: Adding or removing devices can alter all device and partition names after reboot. This is why modern linux distributions avoid using them in places like fstab. We will convert device names to UUIDs as we proceed.

[VM] Variables for working with a virtual machine

DEVICE=/dev/nbd0
PART1=p1
PART2=p2
PART3=p3
IMAGE=/var/lib/libvirt/images/$POOL.qcow2

In the virtual machine case, the device name will always be the same unless you're using nbd devices for some other purpose.

[VM] Create a virtual disk

qemu-img create -f qcow2 ${IMAGE} 10G

[VM] Mount the virtual disk in the host file system

modprobe nbd
qemu-nbd --connect=/dev/nbd0 ${IMAGE} -f qcow2

[RH] Deal with old ZFS residue

If your target disk was ever part of a zfs pool, you need to clear the label before you repartition the device. First list all partitions:

sgdisk -p $DEVICE

For each partition number "n" that has type BF01 "Solaris /usr & Mac ZFS", execute:

zpool labelclear -f ${DEVICE}n

If you suspect the whole disk (no partitions) was part of a zfs array, clear that label using:

zpool labelclear -f ${DEVICE}

Partition the target

This example uses a very simple layout: An EFI partition, a boot partition and a ZFS partition that fills the rest of the disk.

Erase the existing partition table

sgdisk -Z $DEVICE

Create a 200MB EFI partition (PART1)

sgdisk -n 1:0:+200Mib -t 1:EF00 -c 1:EFI $DEVICE

Create a 500MB boot partition (PART2)

sgdisk -n 2:0:+500Mib -t 2:8300 -c 2:Boot $DEVICE

Create a ZFS partition (PART3) using the rest of the disk:

sgdisk -n 3:0:0 -t 3:BF01 -c 3:ZFS $DEVICE

Format EFI and boot partitions

mkfs.fat -F32 ${DEVICE}${PART1}
mkfs.ext4 ${DEVICE}${PART2}

Create the ZFS pool and datasets

Create a pool

zpool create $POOL -m none ${DEVICE}${PART3} -o ashift=12 -o cachefile=none

This is a very simple layout that has no redundancy. For a production system, you would create a mirror, raidz array or some combination. These topics are covered on many websites such as ZFS Without Tears

If for some reason you want to keep using a system with one device, adding the following option to zpool create will give you 2x redundancy (and half the space):

-o copies=2

Set pool properties

zfs set compression=on $POOL
zfs set atime=off $POOL

Re-import the pool so devices are identified by UUIDs

zpool export $POOL
udevadm trigger --settle
zpool import $POOL -d /dev/disk/by-uuid -o altroot=/target -o cachefile=none

Create datasets

zfs create $POOL/fedora -o xattr=sa -o acltype=posixacl
zfs create $POOL/fedora/var       -o exec=off -o setuid=off -o canmount=off
zfs create $POOL/fedora/var/cache 
zfs create $POOL/fedora/var/log
zfs create $POOL/fedora/var/spool 
zfs create $POOL/fedora/var/lib   -o exec=on
zfs create $POOL/fedora/var/tmp   -o exec=on
zfs create $POOL/www              -o exec=off -o setuid=off 
zfs create $POOL/home                         -o setuid=off
zfs create $POOL/root

The motivation for using multiple datasets is similar to the reason more conventional systems use multiple LVM volumes:

To preserve user data between operating systems.
To assign special properties to selected datasets and their children.
To isolate user accounts and enforce quotas.
To avoid mixing user data with operating system files.
To segregate static and dynamic operating system files.
To control the snapshot process

Set ZFS mountpoints

zfs set mountpoint=/        $POOL/fedora
zfs set mountpoint=/var     $POOL/fedora/var
zfs set mountpoint=/var/www $POOL/www
zfs set mountpoint=/home    $POOL/home
zfs set mountpoint=/root    $POOL/root

The reason for using ZFS mountpoints during installation is to avoid modifying the host system's fstab and to smooth the transition to the chroot environment for the final installation steps.

Later we'll switch to legacy mountpoints. During Fedora updates or upgrades, files sometimes get saved in mountpoint directories before ZFS gets around to mounting the datasets at boot time. This is a catastrophe because datasets can't be mounted on non-empty directories. The files they contain will become invisible and the system will fail to boot or exhibit bizarre symptoms. Fedora's update scripts know about fstab and make sure things are mounted at the right time. Hence we must accommodate.

Don't snapshot volitile directories

zfs set com.sun:auto-snapshot=false $POOL/fedora/var/tmp 
zfs set com.sun:auto-snapshot=false $POOL/fedora/var/cache

When com.sun:auto-snapshot=false, 3rd party snapshot software is supposed to exclude the dataset. Otherwise all datasets are included in snapshots.

This is an example of a user-created property. ZFS itself doesn't attach any meaning to such properties. They conventionally have "owned" names based on DNS to avoid conflicts.

Mount the boot partition

mkdir /target/boot
mount -U `lsblk -nr ${DEVICE}${PART2} -o UUID` /target/boot
rm -rf /target/boot/*

Mount the EFI partition

mkdir /target/boot/efi
mount -U `lsblk -nr ${DEVICE}${PART1} -o UUID` /target/boot/efi -o umask=0077,shortname=winnt
rm -rf /target/boot/efi/*

The "rm -f" expressions are there in case you're repeating these instructions on a previously partitioned device where an operating system was installed.

Install the operating system

Install a minimal Fedora system

dnf install -y --installroot=/target --releasever=$VER \
    @minimal-environment \
    kernel kernel-modules kernel-modules-extra \
    grub2-efi-x64 shim-x64 mactel-boot

Optional: Add your favorite desktop environment to the list e.g. @cinnamon-desktop.

UPDATE: A few errors/warnings will be reported because some of the grub2 components expect the system to be live. This gets resolved in a later step.

Install the ZFS repository

dnf install -y --installroot=/target --releasever=$VER \
    http://download.zfsonlinux.org/fedora/zfs-release.fc$VER.noarch.rpm

Install ZFS

dnf install -y --installroot=/target --releasever=$VER zfs zfs-dracut

Configure the target

Configure name resolver

cat > /target/etc/resolv.conf <<-EOF
search csparks.com
nameserver 192.168.1.2
EOF

(Be yourself.)

You may object that NetworkManager likes to use a symbolic link here that vectors off into NetworkManager Land. This concept has caused numerous boot failures on most of the systems I manage because of permission problems in the target directory. These can be corrected by hand, but I've had an easier life since I took over this file and used the traditional contents. Your mileage may vary. Someday Fedora will correct the problem. If you're in the mood to find out, don't create this file.

Show full path names in "zpool status"

cat > /target/etc/profile.d/grub2_zpool_fix.sh <<-EOF
export ZPOOL_VDEV_NAME_PATH=YES
EOF

[VM] Tell dracut to include the virtio_blk device

cat > /target/etc/dracut.conf.d/fs.conf <<-EOF
filesystems+=" virtio_blk "
EOF

Keep the spaces around virtio_blk!

Don't use zfs.cache

cat > /target/etc/default/zfs <<-EOF
ZPOOL_CACHE="none"
ZPOOL_IMPORT_OPTS="-o cachefile=none"
EOF

Set grub parameters

cat > /target/etc/default/grub <<-EOF
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=Fedora
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT=console
GRUB_DISABLE_RECOVERY=true
GRUB_DISABLE_OS_PROBER=true
GRUB_PRELOAD_MODULES=zfs
GRUB_ENABLE_BLSCFG=false
EOF

We're going to switch to BLS later.

Disable selinux

sed -i 's/SELINUX=enforcing/SELINUX=disabled/' /target/etc/selinux/config

Create a hostid file

chroot /target zgenhostid

Add user+password

chroot /target useradd $USER -c "$NAME" -G wheel
echo "$USER:$PASW" | chpasswd -R /target

Prepare for first boot

systemd-firstboot \
--root=/target \
--locale=C.UTF-8 \
--keymap=us \
--hostname=$POOL \
--setup-machine-id

Create fstab for legacy mountpoints

cat > /target/etc/fstab <<-EOF
UUID=`lsblk -nr ${DEVICE}${PART2} -o UUID` /boot ext4 defaults 0 0
UUID=`lsblk -nr ${DEVICE}${PART1} -o UUID` /boot/efi vfat umask=0077,shortname=winnt 0 2
$POOL/fedora/var/cache   /var/cache  zfs   defaults 0 0
$POOL/fedora/var/lib     /var/lib    zfs   defaults 0 0
$POOL/fedora/var/log     /var/log    zfs   defaults 0 0
$POOL/fedora/var/spool   /var/spool  zfs   defaults 0 0
$POOL/fedora/var/tmp     /var/tmp    zfs   defaults 0 0
$POOL/www                /var/www    zfs   defaults 0 0
$POOL/home               /home       zfs   defaults 0 0
$POOL/root               /root       zfs   defaults 0 0
EOF

Switch to legacy mountpoints

zfs set mountpoint=legacy $POOL/fedora/var
zfs set mountpoint=legacy $POOL/www
zfs set mountpoint=legacy $POOL/home
zfs set mountpoint=legacy $POOL/root

Chroot into the target

zenter /target
mount -a

Prepare for grub2-mkconfig

source /etc/profile.d/grub2_zpool_fix.sh

Running grub2-mkconfig will fail without this definition. It will always be defined after logging into the target, but we're not there yet.

Configure boot loader

grub2-mkconfig -o /etc/grub2-efi.cfg
grub2-switch-to-blscfg

Use import scanning instead of zfs cache:

systemctl disable zfs-import-cache
systemctl enable zfs-import-scan

Collect kernel and zfs version strings

kver=`rpm -q --last kernel | sed '1q' | sed 's/kernel-//' | sed 's/ .*$//'`
zver=`rpm -q zfs | sed 's/zfs-//' | sed 's/\.fc.*$//' | sed 's/-[0-9]//'`

If using the zfs testing repository, strip off the "-rcN" suffix:

zver=`echo $zver | sed 's/-rc[0-9]//'`

Build and install zfs modules

dkms install -m zfs -v $zver -k $kver

Add zfs modules to initrd

dracut -fv --kver $kver

Exit the chroot

umount /boot/efi
umount /boot
exit

Export the pool

zpool export $POOL

Boot the target

[RH] Reboot and select the new UEFI disk

It works!

[VM] Disconnect the virtual disk

qemu-nbd --disconnect /dev/nbd0

If you forget to disconnect the nbd device, the virtual machine won't be able to access the virtual disk.

[VM] Create a virtual machine

virt-install \
--name=$POOL \
--os-variant=fedora$VER \
--vcpus=4 \
--memory=32000 \
--boot=uefi \
--disk path=$IMAGE,format=qcow2 \
--import \
--noreboot \
--noautoconsole \
--wait=-1

You only need to do this once. By replacing the disk image file, other configurations can be tested using the same vm.

[VM] Startup

Use the VirtManager GUI or:

virsh start $POOL
virt-viewer $POOL

Additional configuration

Things to do after you've successfully logged in.

Set the timezone

timedatectl set-timezone America/Chicago
timedatectl set-ntp true

Give your system a nice name

hostnamectl set-hostname magoo

Complaints and suggestions

I detest superstitions, gratuitous complications, obscure writing, and bugs. If you get stuck or if your understanding exceeds mine, please share your thoughts. (I like to hear good news too.)

References

Appendix - Deal with updates and upgrades

The cardinal rule when running "dnf update" is to check for the situation where both the kernel and zfs will be updated at the same time. Cancel the update and instead update zfs by itself. Then update the rest and reboot.

dnf update zfs zfs-dkms zfs-dracut
dnf update

If you forget to do this, all is not lost: Run this script to build and install zfs in the new kernel.

If you're rash enough to be booting Fedora on ZFS in a production system, it's almost imperative that you maintain a simple virtual machine in parallel. When you see that updates are available, clone the VM and update that first. If it won't boot, attempt your fixes there. If all else fails, freeze kernel updates on your production system and wait for better times. See Appendix - Freeze kernel updates )

Appendix - Pure ZFS systems

With UEFI motherboards, the only way to "ZFS purity" is to put your EFI partition on a separate device, rather than on a partition of a device that also has all or part of a ZFS pool. It's also possible to do away with the ext4 /boot partition by keeping it in a dataset, but this will put you into contention with the "pool features vs grub supported features" typhoon of uncertainty. (See Grub-compatible pool creation.)

A better way, in my opinion, is to use a small SSD with both EFI and boot partitions. The ZFS pool for the rest of the operating system can be assembled from disks without partitions, "whole disks", which most ZFS pundits recommend. This example doesn't follow that advice because it's intended to be a simplified tutorial.

If you still want to have /boot on ZFS, it's necessary to add the grub2 zfs modules to the efi partition:

dnf install grub2-efi-x64-modules
mkdir -p /target/boot/efi/EFI/fedora/x86_64-efi
cp -a /target/usr/lib/grub/x86_64-efi/zfs* /target/boot/efi/EFI/fedora/x86_64-efi

The zfs.mod file in that collection does not support all possible pool features, but it will work if you find a compromise. Currently, the zfs.mod with Fedora32 will handle a ZFS pool with default "compression=on" settings created using zfs-0.8.4.

Appendix - Fix boot problems

Prevention

By far the best way to fix boot problems it to avoid them by recognizing problematic situations before you reboot after an update.

Before rebooting after an update, check to make sure that a new initramfs was created in the /boot directory. Then check that file to make sure it contains a zfs module:

cd /boot
ls -lrt

The commands above will list the contents of /boot such that the last file listed is the newest. It should be the initramfs file with the current date and most recent kernel version. Example:

...
initramfs-5.13.8-200.fc34.x86_64.img

Now list the contents of the initramfs and check for zfs:

lsinitrd initramfs-5.13.8-200.fc34.x86_64.img | grep zfs.ko

If zfs.ko is present, you are probably good to go for a reboot.

If zfs.ko is not present, run this script to build and install the zfs modules.

Disaster recovery

You reboot and get the Black Screen of Death.

You'll need a thumb drive or other detachable device that has a linux system and ZFS support. Boot the device.

Import the pool

zpool import -f $POOL -o altroot=/target

Chroot into the system

zenter /target
mount -a

Rebuild the zfs modules

dnf reinstall zfs-dkms

If you see errors from dkms, you'll probably have have to revert to an earlier kernel and/or version of zfs. Such problems are temporary and rare.

Rebuild the EFI partition

First make sure you're running in chroot (zenter) and that the right /boot/efi partition is mounted:

df -h

Next run:

rm -rf /boot/efi/*
dnf reinstall grub2-efi-x64 shim-x64 fwupdate-efi mactel-boot

Reinstall BLS

Edit /etc/default/grub and disable BLS:

...
GRUB_ENABLE_BLSCFG=false
...

Then run:

grub2-mkconfig -o /etc/grub2-efi.cfg
grub2-switch-to-blscfg

Delete the Abominable Cache File

rm -f /etc/zfs/zfs.cache

This thing has a way of rising from the dead..

Update initrd

kver=`rpm -q --last kernel | sed '1q' | sed 's/kernel-//' | sed 's/ .*$//'`
dracut -fv --kver $kver

After any or all of these interventions, exit with:

umount /boot/efi
exit
zpool export $POOL

Reboot

Learn from others

Visit the ZFS Issue Tracker and see what others discover. If your problem is unique, join up and post a question.

Appendix - A script to build zfs modules

This script builds zfs into the most recently installed kernel, which may not be the running kernel. It also updates initramfs.

#!/bin/sh
# zfsupdate.sh - Build and install zfs modules
# 2020-08-11 

# Exit on error

    set -o errexit
    set -o pipefail
    set -o nounset

# Get version number of newest kernel

    kver=`rpm -q --last kernel \
        | sed '1q' \
        | sed 's/kernel-//' \
        | sed 's/ .*$//'`

# Get version number of newest zfs

    zver=`rpm -q zfs \
        | sed 's/zfs-//' \
        | sed 's/\.fc.*$//' \
        | sed 's/-[0-9]//'`

# Install the new zfs module

    dkms install -m zfs -v $zver -k $kver

# Build initrd

    dracut -fv --kver $kver

# EOF

Appendix - Freeze kernel updates

If you discover that you can't build the zfs modules for a new kernel, you'll have to use your recovery device and revert. (Or use a virtual machine to find out without blowing yourself up.)

Once you've got your system running again, you can "version lock" the kernel packages. This will allow other fedora updates to proceed, but hold the kernel at the current version:

dnf versionlock add kernel-`uname -r`
dnf versionlock add kernel-core-`uname -r`
dnf versionlock add kernel-devel-`uname -r`
dnf versionlock add kernel-modules-`uname -r`
dnf versionlock add kernel-modules-extra-`uname -r`
dnf versionlock add kernel-headers-`uname -r`

When it's safe to allow kernel updates, you can release all locks using the expression:

dnf versionlock clear

If you have locks on other packages and don't want to clear all of them, you can release only the previous kernel locks:

dnf versionlock delete kernel-`uname -r`
dnf versionlock delete kernel-core-`uname -r`
dnf versionlock delete kernel-devel-`uname -r`
dnf versionlock delete kernel-modules-`uname -r`
dnf versionlock delete kernel-modules-extra-`uname -r`
dnf versionlock delete kernel-headers-`uname -r`

Appendix - Stuck in the emergency shell

The screen is mostly black with plain text. You see:

[  OK  ] Started Emergency Shell.
[  OK  ] Reached target Emergency Mode.

This is the Black Screen Of Dracut.

You'll be invited to run journalctl which will list the whole boot sequence. Near the end, carefully inspect lines that mention ZFS. There are three common cases:

1) Journal entry looks like this:

systemd[1]: Failed to start Import ZFS pools by cache file.

You are a victim of the Abominable Cache File. The fix is easy. Boot your recovery device, enter the target, and follow the section that deals with getting rid of the cache file in Appendix - Fix boot problems.

2) Journal entry looks like this:

...
Starting Import ZFS pools by device scanning...
cannot import 'Magoo': pool was previously in use from another system.

You probably forget to export the pool after tampering with it from another system. (Such as when you previously used the recovery device.) You can fix the problem from the emergency shell:

zpool import -f myPool -N
zpool export myPool
reboot

3) If you see messages about not being able to load the zfs modules, that may be normal because it takes several tries during the boot sequence. But if ends up being unable to load the modules, try this:

modprobe zfs

If that fails, the zfs modules were never built or they were left out of the initramfs. To fix that, go through the entire sequence describe in Appendix - Fix boot problems.

If you can execute the modprobe sucessfully, you should try the next fix:

Appendix - Work-around for a race condition

During boot, it's normal to see a few entries like this in the journal:

dracut-pre-mount[508]: The ZFS modules are not loaded.
dracut-pre-mount[508]: Try running '/sbin/modprobe zfs' as root to load them.

But if the zfs modules aren't loaded by the time dracut wants to mount the root filesystem, the boot will fail. This problems was reported in 2019 ZOL 0.8 Not Loading Modules or ZPools on Boot #8885. I never saw this until I tried to boot a fast flash drive on a slow computer. Since I knew the flash drive worked on other machines, I was surprised to see The Black Screen Of Dracut.

Here's a fix you can apply when your root-on-zfs device is mounted for repair on /target:

mkdir /target/etc/systemd/system/systemd-udev-settle.service.d
cat > /target/etc/systemd/system/systemd-udev-settle.service.d/override.conf <<-EOF
[Service]
ExecStartPre=/usr/bin/sleep 5
EOF

Appendix - Stuck in grub

A black screen with an enigmatic prompt:

grub>

This is the Dread Prompt Of Grub.

Navigating this little world merits a separate document Grub Expressions. A nearly-foolproof solution is to run through Appendix - Fix boot problems. Pay particular attention to the step where the entire /boot/efi partition is recreated.

Appendix - Enable swapping

Using a zvol for swapping is problematic. (as of 2020-08, zfs 0.8.4) If you feel the urge to try, first read the swap deadlock thread.

Sooner or later, the issues will be fixed. (Maybe now?) Here's how to try it out:

Create a swap dataset

zfs create $POOL/swap \
    -o volsize=4G \
    -o volblocksize=4k \
    -o compression=zle \
    -o refreservation=4.13G \
    -o primarycache=metadata \
    -o secondarycache=none \
    -o logbias=throughput \
    -o sync=always \
    -o com.sun:auto-snapshot=false

Add the swap volume to fstab:

...
/dev/zvol/pool/swap   none  swap   defaults 0 0
...

After you're running the target, enable swapping:

swapon -av

This setting is remembered so swapping will operate after reboot.

Don't enable hibernation. It tries to use swap space for the memory image but the dataset is not available early enough in the boot process.