As of ZFS-0.7.1, it's a lot easier to do this deed and I've removed several sections. Users of zfs-0.6.x.y will need to follow an older version of this page.
This article shows how to boot a recent version of Fedora (circa 26) with root on ZFS. To distinguish this article from the crowd, the features include:
I was motivated to collect and publish these methods for several reasons: Most of the root-on-zfs articles are either out of date, don't apply to Fedora, or have a key defect: They may work (or might have worked), but they don't tell you how to solve the inevitable problems that happen when things go wrong as the system is updated or upgraded. Recipes are not enough. Understanding is required.
ZFS has significant advantages because it replaces things like software RAID and LVM with a more flexible and general framework. Here are just a few highlights:
James Earl Jones voice: "If you only knew the POWER of ZFS (wheeze gasp) other solutions would seem weak and ineffective."
If you like to hack away on your operating system, it's very nice to have zfs snapshots. You can take a snapshot and then do any damn thing without a care in the world. By rolling back the snapshot, all is as before. (Occasionally you might have to do the rollback from a rescue disk!) By taking a snapshot before and after installing a complex package group, you can use "zfs diff" to see what changed. It's like having subversion for your entire operating system.
It's also nice to use an automated rolling snapshot process. You can go back in time by hour, day, week, month - whatever you like. This takes care of regrets that take time to discover. And it doesn't take an enormous amount of space.
Backups are easy: two lines to replicate your whole system on a pool at a remote destination. Using incremental replication makes the process much faster than things like rsync.
There are countless articles on the web extolling the virtues of ZFS. If this is all new to you, you probably shouldn't attempt a root-on-zfs installation right away.
Running root on ZFS with linux is an experimental setup in most distributions. The Fedora developers don't support ZFS and they occasionally do things (by accident) that break it. You have to be prepared for situations where your system won't boot after updates (rarely) or upgrades (frequently.) Hopefully this article will get you through those little disappointments efficiently. But I must emphasize that root-on-zfs is not a great idea for a production server. But the same could be said about Fedora itself...
In the world of ZFS, there has been and still is a great deal of F.U.D. created by licensing concerns. Some of these concerns have abated, but not enough for Redhat or Fedora to embrace the technology. Competing solutions have been created or resurrected such as BTRFS, XFS, and Stratis but it's hard to catch up with what ZFS already achieves. Especially with the reliability that's possible only through years of evolution and deployment in a large community.
Redhat recently deprecated BTRFS after years of spending and tireless promotion. An interesting discussion appears here. Promoting an advanced file system is not so easy.
"It must be remembered that there is nothing more difficult to plan, more doubtful of success, nor more dangerous to manage than a new system. For the initiator has the enmity of all who would profit by the preservation of the old institutions and merely lukewarm defenders in those who gain by the new ones." - Niccolò Machiavelli
The following description is vastly oversimplified. The Grub2 boot loader is a marvel of complexity that could probably start linux on a soroban. But you only need a superficial understanding to proceed with this tutorial.
At boot time, the UEFI bios on your motherboard looks for a small vfat partition called the EFI or sometimes ESP. This partition contains a binary boot program. The BIOS launches the program, which in this case will be the grub2 boot loader. In the discussion that follows, it's called simply "grub."
Grub executes a plain-text configuration script also found in the EFI partition. This program creates the boot menu the user sees and executes steps to launch a selected operating system.
When a root-on-zfs item is selected, a section of the configuration script will search a device named by UUID for a specified zfs pool and dataset. The dataset must have a filesystem that contains the expected linux boot files, vmlinuz and initramfs.
Grub loads the linux kernel from the vmlinuz file. It also loads and mounts a root filesystem in memory from the initramfs file. This "ramfs" filesystem contains a subset of your complete operating system, including systemd startup scripts and kernel modules to support ZFS.
Systemd proceeds to generate the cacophony displayed on your monitor as the computer boots. One of the early steps switches the root file system from the ramfs to the disk-based root-on-zfs filesystem. When systemd is done, you're allowed to use the computer.
We will create two bootable Fedora systems. One on a rescue device and one on a target device, which will have root-on-zfs. The rescue device will be used to create the target system. It's essential to keep the rescue device available because accidents will happen. I use a fast USB stick and keep it on top of the target computer case.
To simplify the discussion, only one device is used for the ZFS pool. The process of building more complex pools is covered on dozens of ZFS websites. I've written one myself. It is, however, both unusual and unwise to deploy ZFS without some form of redundancy. With that in mind, this article should be treated as a tutorial rather than a practical solution. Proceed at your own risk.
The procedure has been tested using multiple versions of Fedora:
Fedora-Workstation-Live-x86_64-24-1.2.iso ... Fedora-Workstation-Live-x86_64-28-1.1.iso
Configure your BIOS to operate the target disk controller in AHCI mode (the default on most modern motherboards) and boot from the device where you've mounted the installation media. You should see two choices for the installation volume: One will mention UEFI. That's the one you must use. Otherwise look for settings that specify booting in UEFI mode. If the installer doesn't believe you booted using EFI firmware, it will gype up the rest of the process, so take time to figure this out.
The installer is a bootable Fedora installation that has the same GPT, UEFI, and partition structure as our target zfs system.
Because we're going to copy the root of the installer to the target, it makes sense to configure the installer to be as similar to the target as possible. After we do the copy, the target is almost ready to boot. A nice thing about UEFI on GPT disks is that no "funny business" is written to boot blocks or other hidden locations. Everything is in regular files.
We're going to make the target system with GPT partitions. The installer needs to have the same structure, but Anaconda will not cooperate if the installer is a small disk. To force Anaconda to create GPT partitions, we must add an option.
Boot the installation media. On the menu screen, select
Start Fedora-workstation-live xx
Press "e" key to edit the boot line. It will look something like this:
vmlinuz ... quiet
At the end of the line add the string "inst.gpt" so it looks like this:
vmlinuz ... quiet inst.gpt
IMPORTANT: If you see an instruction to use the tab key instead of the "e" key to edit boot options, you have somehow failed to boot in EFI mode. Go back to your BIOS and try again.
Proceed with the installation until you get to the partitioning screen. Here, you must take control and specify standard (not LVM) partitioning.
1) A 200M parition mounted at /boot/efi 2) A partition that fills the rest of the disk mounted at root "/".
Anaconda will recognize that you want a UEFI system. Press Done to proceed. You'll have to do it twice because you must be harassed for not creating a swap partition.
The rest of the installation should proceed normally. Reboot the new system and open a terminal session. Elevate yourself to superuser:
Then update all the packages:
Everything is supposed to work with SELinux. But I'm sorry to report that everything doesn't: Fixes are still being done frequently as of 2016. Unless you understand SELinux thoroughly and know how to fix problems, it's best to turn it off. In another year, this advice could change.
Inside, disable SELinux:
Save and exit. Then reboot:
shutdown -r now
rpm -qa | grep grub2-efi rpm -qa | grep shim
These packages should already be installed. If not, you somehow failed to install a UEFI system.
dnf install grub2-efi-modules
This pulls in the zfs module needed by grub2 at boot time.
The following scripts will save a lot of typing and help avoid mistakes. When and how they are used will be explained later. For now, consider the descriptions an overview of the process.
Create a text file "zmogrify" in /usr/local/sbin (or somewhere on the PATH) that contains:
#!/bin/bash # zmogrify - Run dracut and configure grub to boot with root on zfs. kver=$1 sh -x /usr/bin/dracut -fv --kver $kver mount /boot/efi grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg grubby --set-default-index 0 mkdir -p /boot/efi/EFI/fedora/x86_64-efi cp -a /usr/lib/grub/x86_64-efi/zfs* /boot/efi/EFI/fedora/x86_64-efi umount /boot/efi
Save the file and make it executable:
chmod a+x zmogrify
The first step runs dracut to produce a new /boot/initramfs file that includes zfs support. Dracut is able to do this because of the zfs-dracut package we added, which installs a "plug-in."
The peculiar way of running dracut with
sh -x overcomes a totally mysterious
problem with Fedora 25 that makes dracut hang otherwise. Dracut will hang after
displaying the line:
*** Including module: kernel-modules ***
If you understand why, I'd very much like to hear from you.
The grub2-mkconfig command creates a new grub.cfg script in the EFI boot partition.
The grubby command makes sure your new kernel is the one that boots by default.
Finally, we create a directory in the EFI partition and copy the boot-time version of the zfs module needed by grub2 to mount your zfs root file system.
Create a text file "zenter" in /usr/local/sbin (or somewhere on the PATH) that contains:
#!/bin/bash # zenter - Mount system directories and enter a chroot target=$1 mount -t proc proc $target/proc mount -t sysfs sys $target/sys mount -o bind /dev $target/dev mount -o bind /dev/pts $target/dev/pts chroot $target /bin/env -i \ HOME=/root TERM="$TERM" PS1='[\u@chroot \W]\$ ' \ PATH=/bin:/usr/bin:/sbin:/usr/sbin:/bin \ /bin/bash --login echo "Exiting chroot environment..." umount $target/dev/pts umount $target/dev/ umount $target/sys/ umount $target/proc/
Save the file and make it executable:
chmod a+x zenter
The chroot command makes (temporarily) a specified path the root of the linux file system as far as shell commands are concerned. But we need more that just the file system root: The extra mount commands make the linux system devices available inside the the new root. Dracut expects those to be in place when building a boot-time ramfs.
The zenter script puts you in a nested shell session. A new command prompt "chroot>" reminds you that you're in a special place. When you type "exit", the script exits the chroot environment and returns to the original shell.
Add the line:
This will come into play later when we create a grub configuration file on the efi partition.
Also add the line:
Without this fix, running grub2-mkconfig (which we'll need later) will throw an error and possibly disrupt other scripts. This is cased by a problem that has nothing to do with ZFS and will probably be fixed Real Soon Now.
Later, we will need to run a script called grub2-mkconfig.
This script uses the
zpool status command to list the components of a pool.
But when these components aren't simple device names, the script fails.
Many fixes have been proposed, mostly involving new udev rules. A simple
fix is to add a system environment variable. Create this file:
This will take effect the next time you login or reboot. Which will be soon...
ln -s /usr/local/sbin/zmogrify /etc/kernel/postinst.d
The scripts in the postinst.d directory run after every kernel update. The zmogrify script takes care of updating the initramfs and the grub.cfg file on the EFI partition. (Stuff you need to boot with root in a zfs pool.)
Using your browser, visit ZFS on Fedora.
Click on the link for your version of Fedora. When prompted, open it with "Software Install" and press the install button. This will add the repository zfs-release to you package database. It tells the package manager where to find zfs packages and updates.
dnf install kernel-devel dnf install zfs zfs-dracut
During the installation, the process will pause to build the spl module and then
the zfs module. A line will appear showing the path to their locations
/lib/modules/x.y.z/extras. If you don't see those lines, you don't have
a match between the kernel-devel package you installed and your currently
If you get a message that zfs isn't present, the modules failed to build during installation. Try rebooting.
For some reason, (2016-10) they don't get installed this way, so execute:
systemctl preset zfs-import-cache systemctl preset zfs-mount systemctl preset zfs-share systemctl preset zfs-zed
I think this step was needed after upgrading from Fedora 24 to 25 with an earlier version of ZFS. When starting over on a new machine, I'd skip this step.
ZFS uses a cache file to keep track of pools and their associated devices. It speeds up the boot process, especially when you have complex pools.
When booting from ZFS, the cache can be a headache because you have to rebuild the ramdisk (initrd) using dracut each time you change your pool structure. (If this doesn't make sense to you, just proceed with the advice that follows.)
For that reason, I suggest turning off the cache.
At boot time, by default, systemd runs a script:
This loads the previous pool configuration stored in the cache file:
But you can use a different service that scans all devices for pools. Proceed as follows:
systemctl disable zfs-import-cache systemctl enable zfs-import-scan
Now tell zfs not to make a new cache file and delete the old one.
zpool set cachefile=none mypool rm /etc/zfs/zpool.cache
It's very important to recreate your initramfs now because otherwise it will contain a zpool.cache that's no longer up to date:
dracut -fv --kver `uname -r`
If you decide to revert to using a cache:
systemctl disable zfs-import-scan systemctl enable zfs-import-cache zfs set cachefile=/etc/zfs/zpool.cache mypool
Setting the cachefile path recreates the file immediately.
Note: Setting the cachefile property value has side effects as described above. But you can't see the cachefile path by "getting" the value. This seems to be a bug.
If your target is a USB disk, you simply need to plug it in. Tail the log and note the new disk identifier:
If you have a "real" disk, first look at your mounted partitions and take note of the names of your active disks. If you're using UUIDs, make a detailed list:
The shortcuts will point to the /dev/sdn devices you're using now. Now you can shutdown, install the new disk, and reboot. You should find a new disk when you take a listing:
From now on, we'll assume you've found your new disk named "sdx".
If you got your target disk out of a junk drawer, the safest way to proceed is to zero the whole drive. This will get rid of ZFS and any RAID labels that might cause trouble. Bring up a terminal window, enter superuser mode, and run:
dd if=/dev/zero of=/dev/sdx bs=10M status=progress
Zeroing takes a while for large disks. If you know which partition on the old disk has a zfs pool, for example sdxn, you can speed things up by importing the pool and destroying it. But if the partition is already corrupted so the pool won't import properly, you can blast it by zeroing the first and last megabyte:
mysize=`blockdev --getsz /dev/sdx` dd if=/dev/zero of=/dev/sdxn bs=512 count=2048 dd if=/dev/zero of=/dev/sdxn bs=512 count=2048 seek=$((mysize - 2048))
Alternatively, there is the command:
zpool labelclear /dev/sdx
Which can be forced if necessary:
zpool labelclear -f /dev/sdx
I'm too lazy to create a step-by-step gdisk walk-through. If you need such a thing, you probably don't belong here. Run gdisk, parted, or whatever tool you prefer on /dev/sdx. Erase any existing partitions and create two new ones:
Partition Code Size File system Purpose 1 EF00 200 MiB EFI System EFI boot partition 2 BF01 (rest of disk) EXT4 ZFS file system
Write the new table to disk and exit. Tell the kernel about the new layout:
If you are prompted to reboot, do so. You now have two new devices:
/dev/sdx1 - For the target EFI partition /dev/sdx2 - For the target ZFS file system
mkfs.fat -F32 /dev/sdx1
zpool create pool -m none /dev/sdx2 -o ashift=12
By default, zfs will mount the pool and all descendant datasets automatically. We turn that feature off using "-m none". The ashift property specifies the physical sector size of the disk. That turns out to be a Big Ugly Deal, but you don't need to be concerned about it in a tutorial. I've added a section at the end of the article about sector sizes. You need to know this stuff if you're building a production system.
zfs set compression=on pool zfs set atime=off pool
Compression improves zfs performance unless your dataset contains mostly already-compressed files. Here we're setting the default value that will be used when creating descendant datasets. If you create a dataset for your music and movies, you can turn it off just for that dataset.
The atime property controls file access time tracking. This hurts performance and most applications don't need it. But it's on by default.
zfs create -p pool/ROOT/fedora
zfs set xattr=sa pool/ROOT/fedora
This has the side effect of making ZFS significantly faster.
If you think you might continue to use this one-device zfs pool, you can slightly improve your odds of survival by using the copies property: This creates extra copies of every block so you can recover from a "bit fatigue" event. A value of 2 will double the space required for your files:
zfs set copies=2 pool/ROOT/fedora
I use this property when booting off USB sticks because I don't trust the little devils.
It would be nice to find two new symbolic links for the partitions we've created on the target disk in here:
The symbolic link to the EFI partition is always there pointing to /dev/sdx1, but the one for the ZFS partition pointing to /dev/sdx2 is not. I've tried re-running the rules:
But the link for
/dev/sdx2 still isn't there. Regrettably, you have to
reboot now. When you're back up and running proceed...
zpool export pool zpool import pool -d /dev/disk/by-uuid -o altroot=/sysroot
At this point don't panic if you look for /sysroot: It's not there because there are no specified mount points yet.
You will also notice the clause
-d /dev/disk/by-uuid. This will rename the
disk(s) so their UUID's appear when you execute zpool status. You could also
use -d /dev/disk/by-id - Details about this are covered later.
Note: It's possible to arrive at a state where a newly formatted disk will have an entry in some-but-not-all of the /dev/disk subdirectories. I'm not sure what causes this annoyance, but if the command to import the pool using UUIDs fails, try using /dev/disk/by-id instead. The important thing is to get away from using device names.
zfs set mountpoint=/ pool/ROOT/fedora
Because we imported the pool with the altroot option, /sysroot will now be mounted.
We use rsync because it provides some fine control over how things are copied. The options used come from an article by Rudd-O. This is what the options do:
a: Archive v: Verbose x: Only local filesystems H: Preserve hard links A: Preserve ACLs S: Handle sparse files efficiently X: Preserve extended attributes
rsync -avxHASX / /sysroot/
Don't get careless and leave out the trailing "/" character.
You'll see some errors near the end about "journal". That's ok.
Recall that "/dev/sdx1" is the device name of the target EFI parition:
cd mkdir here mount /dev/sdx1 here cp -a /boot/efi/* here umount here rmdir here
In my case, it was "0FEE-5943"
UUID=0FEE-5943 /boot/efi vfat umask=0077,shortname=winnt 0 2
Even that line is unnecessary for booting the system, but utilities such as grub2-mkconfig expect the efi partition to be mounted.
(Obviously, use your EFI partition's UUID.)
Fedora 28 adds optional support for something called "Boot Loader Specification" (BLS). They did this by patching the grub2-tools package in a way that conflicts with booting ZFS when root is on a multi-device pool.
The example shown in this tutorial uses a single device, but if you plan to deploy a "real" root-on-zfs system, you'll need this work-around,
zmogrify `uname -r`
This expression assumes that the kernel you're running is the same as the kernel for the target system. Otherwise, you must supply the target kernel name.
zpool export pool
If you're the adventurous type, simply reboot. Otherwise, first skip down to Checklist after update and before reboot and run through the tests. Then reboot.
If all is well, you'll be running on zfs. To find out, run:
mount | grep zfs
You should see your
pool/ROOT/fedora mounted as root. If it doesn't work,
please try the procedure outlined below in Recovering from boot
Before disaster strikes:
zfs snapshot pool/ROOT/fedora@firstBoot
We are finished. Now for the ugly bits!
When we created the pool, we used the old-style "sdx" device names. They are short, easy to type and remember. But they have a big drawback. Device names are associated with the ports on your motherboard, or sometimes just the order in which the hardware was detected. It would really be better to call them "dynamic connection names." If you removed all your disks and reconnected them to different ports, the mapping from device names to drives would change.
You might think that would play havoc when you try to re-import the pool. Actually, ZOL (ZFS on Linux) protects you by automatically switching the device names to UUIDs when device names in the pool conflict with active names in your system.
Linux provides several ways to name disks. The most useful of these are IDs and UUIDs. Both are long complex strings impossible to remember or type. I prefer to use IDs because they include the serial number of the drive. That number is also printed on the paper label. If you get a report that a disk is bad, you can find it by reading the label.
First, boot into your installer linux system.
Here's how to use IDs:
zpool import pool -o altroot=/sysroot -d /dev/disk/by-id
If you prefer UUIDs:
zpool import pool -o altroot=/sysroot -d /dev/disk/by-uuid
Should you ever want to switch back to device names:
zpool import pool -o altroot=/sysroot -d /dev
To switch between any two, first export and then import as shown above. The next time you export the pool or shut down, the new names will be preserved in the disk data structures.
The problem: ZFS will run a lot faster if you correctly specify the physical sector size of the disks used to build the pool. For this optimzation to be effective, all the disks must have the same sector size. Discovering the physical sector size is difficult because disks lie. The history of this conundrum is too involved to go into here.
The short answer: There is no way to be sure about physical sector sizes unless you find authoritative information from the maker. Guessing too large wastes space but insures performance. Guessing too small harms performance.
Here are some heuristic steps:
First, ask the disk for its sector size:
lsblk -o NAME,PHY-SeC /dev/sdx
If it reports any value except 512, you can believe the answer. If it shows 512, you may be seeing the logical sector size.
When creating a pool, the sector size is specified as a power of 2. The default is 9 (512 bytes)
A table of likely values:
n 2^n ------------ 9 512 10 1024 11 2048 12 4096 13 8192
Specify sector size using the ashift parameter:
zpool create pool /dev/sdx2 -m none -o ashift=13
To be effective this should be done before the pool is used. If you decide to change it later, it's best to take a backup, re-create the pool and restore the backup.
This is widely-cited example and I use it on my systems:
zfs create pool/swap \ -V 4G \ -o volblocksize=$(getconf PAGESIZE) \ -o compression=zle \ -o logbias=throughput \ -o sync=always \ -o primarycache=metadata \ -o secondarycache=none \ -o com.sun:auto-snapshot=false
Some notes about the properties:
The swap dataset isn't part of the linux root filesystem. That allows us to share swap space with other bootable datasets.
-V 4G means this will be a 4G ZVOL - a fixed-size, non-zfs volume
allocated from the pool.
A new device will appear as:
The reason for turning off caching is that the operating system will manage swap like a cache anyway.
The same reasoning (and cache settings) are used for ZVOLs created for virtual machine disk images.
The compression algorithm is changed from the default (lz4) to something very fast and simple: zero-length encoding.
After you're running with root on ZFS, complete swap setup by adding a line to
/dev/zvol/pool/swap none swap defaults 0 0
Format the swap volume:
mkswap -f /dev/zvol/pool/swap
And enable swapping:
More complex dataset trees are possible and usual. Some reasons:
You might want to boot a previously installed operating system, but keep user data, mail, webpage, etc. current. The ROOT datasets are separate:
pool/fedora24 pool/fedora25 pool/fedora26 ...
But other datasets such as /home, /spool, /var/www will be retained no matter which operating system root is selected.
WARNING: It's known that Fedora doesn't like to see /usr mounted on a separate dataset. It has to be part of the root file system or you won't be able to boot.
The following section shows the datasets, mountpoints and properties configured on my internet-facing server. It looks complicated, but there's a simple idea at work here: I want to rollback changes to the root file system, but not disturb:
This example uses some legacy mountpoints. It's nice to do everything with zfs when possible, but some mountpoints must be processed early. Also note that you can't use the sharesmb property on datasets with legacy mountpoints.
UUID=693A-C0B1 /boot/efi vfat umask=0077,shortname=winnt 0 2 /dev/zvol/pool/swap none swap defaults 0 0 pool/fedora/var/cache /var/cache zfs defaults 0 0 pool/fedora/var/lib /var/lib zfs defaults 0 0 pool/fedora/var/log /var/log zfs defaults 0 0 pool/fedora/var/spool /var/spool zfs defaults 0 0 pool/fedora/var/tmp /var/tmp zfs defaults 0 0
Note that the last column, "pass", is always zero for zfs volumes. Linux fsck can't deal with zfs.
pool cachefile none pool ashift 12 pool mountpoint none pool compression on pool atime off
pool/fedora mountpoint / pool/fedora xattr sa pool/fedora/var exec off pool/fedora/var setuid off pool/fedora/var canmount off pool/fedora/var mountpoint legacy pool/fedora/var/cache com.sun:auto-snapshot false pool/fedora/var/tmp com.sun:auto-snapshot false pool/fedora/var/tmp exec on pool/fedora/var/log (inherit from var) pool/fedora/var/spool (inherit from var) pool/fedora/var/lib (inherit from var) pool/www mountpoint /var/www pool/www exec off pool/www setuid off pool/root mountpoint /root pool/home mountpoint /home pool/home setuid off pool/swap volsize 4G pool/swap compression zle pool/swap refreservation 4.13G pool/swap primarycache metadata pool/swap secondarycache none pool/swap logbias throughput pool/swap sync always pool/swap com.sun:auto-snapshot false
pool/fedora/var is created just to provide inheritable properties.
If it worries you, "canmount" is not an inhertiable property and this example
illustrates why that's a good idea.
/var directory and all its subdirectories except those mounted from
other zfs datasets is part of the
This allows you to rollback
pool/fedora while preserviing
state information in some-but-not-all off the
/var/lib is separate, so rolling back the root will not undo
package installations. That's better handled by
dnf history rollback ....
My usual drill when I want to try something risky is to snapshot the root:
zfs snapshot pool/fedora@mark
Now I can proceed. Perhaps this pattern is familiar to you?
Install packages. Modify configuration files. Test. Install and configure more stuff. Try alternatives. Nothing seems to work out... Get discouraged. Decide to give up.
Now for the recovery:
dnf history rollback (to whatever I installed first) zfs rollback pool/fedora@mark
Now you can try again or just forget the whole thing:
zfs destroy pool/fedora@mark
If you started with a root-on-zfs system using only the EFI and root partitions, you can convert it to a more elaborate arrangement using your rescue system.
For each path "X" you'd prefer to have in a dataset:
mv X oldX mkdir X mount X rsync -axHASX oldX/ X rm -rf oldX
The fearless may conserve space using:
rsync -axHASX --remove-source-files oldX/ X
In the example above, compression is enabled for the entire pool. The default compression algorithm is lzjb or lz4. This is determined by the value of a feature flag, which you can inspect using:
zpool get feature@lz4_compress zool
The default value for this feature when you create a new pool is "active", so lz4 is effectively the default and that is the recommended algorithm for datasets containing filesystems.
Immediately after using zenter, you will normally want to execute:
This will mount all the datasets listed in fstab. Many commands will need these directories to work properly. Worse, if you forget, they will put something into one or more mount point directores, which will prevent them from mounting normally when you try to boot.
When it's time to exit the chroot, move up to the root and unmount everything:
cd / umount -a
Although not covered in this article, there are may application level programs to help you create and manage automatic rolling shapshots. These are snapshots take at regular intervals (hourly, daily, etc.) By convention, the user property "com.sun:auto-snapshot" is used to turn this feature on or off. With the software I use, the default is to include every dataset where com.sun:auto-snapshot isn't explicitly false.
ZFS wants a lot of memory for its ARC cache. The recommended minimum is 1G per terabyte of storage in your pool(s). Without constraints, ZFS is likely to run off with all your memory and sell it at a pawn shop. To prevent this, you should specify a memory limit. This is done using a module parameter. A reasonable setting is half your total memory:
Edit or create:
This expression (for example) limits the ARC to 16G:
options zfs zfs_arc_max=17179869184
The size is in bytes and must be a power of 2:
16GB = 17179869184 8GB = 8589934592 4GB = 4294967296 2GB = 2147483648 1GB = 1073741824 500MB = 536870912 250MB = 268435456
The modprobe.d parameter files are needed at boot time, so it's important to rebuild your initramfs after adding or changing this parameter:
dracut -fv --kver `uname -r`
And then reboot.
To operate with ultimate safety and win the approval of ZFS zealots, you really ought to use ECC memory. ECC memory is typically only available on "server class" motherboards with Intel Xeon processors.
If you don't use ECC memory, you are taking a risk. Just how big a risk is the subject of considerable controversy and beyond the scope of these notes.
First, give yourself a fighting chance by testing your memory. Obtain a copy if this utility and run a 24 hour test:
This is particularly important on a new server because memory with defects is often purchased with those defects. If your memory passes the test, there is a good chance it will be reliable for some time.
To avoid ECC, you might be tempted to keep your server in a 1-meter thick lead vault buried 45 miles underground. It turns out that many of the radioactive sources for memory-damaging particles are already in the ceramics used to package integrated circuits. So this time-consuming measure is probably not worth the cost or effort.
Instead, we're going to enable the unsupported ZFS_DEBUG_MODIFY flag. This will mitigate, but not eliminate, the risk of using ordinary memory. It's supposed to make the zfs software do extra checksums on buffers before they're written. Or something like that. Information is scarce because people who operate without ECC memory are usually dragged down to the Bad Place.
Add the line:
options zfs zfs_flags=0x10
As previously discussed, you need to rebuild initramfs after changing this parameter:
dracut -fv --kver `uname -r`
And then reboot. You can confirm the current value here:
This idea might be interesting to those who want to build a new zfs-friendly Linux distribution.
You'll recall the step where we copied zfs.mod to the EFI partition. This is necessary because zfs is not one of the modules built into Fedora's grub2-efi rpm. Other Linux distributions support zfs without this hack so I decided to find out how it was done.
You can see the list of build-in modules by downloading the grub2-efi source and reading the spec file. It would be straight forward but perhaps a bit tedious to add "zfs" to the list of built-in modules, remake the rpm and install over the old one. An easier way is to rebuild one file found here:
The trick is to get hold of the original list of modules. They need to be listed as one giant string without the ".mod" suffix. Here's the list from the current version of grub2.spec with "zfs" already appended:
all_video boot btrfs cat chain configfile echo efifwsetup efinet ext2 fat font gfxmenu gfxterm gzio halt hfsplus iso9660 jpeg loadenv loopback lvm mdraid09 mdraid1x minicmd normal part_apple part_msdos part_gpt password_pbkdf2 png reboot search search_fs_uuid search_fs_file search_label serial sleep syslinuxcfg test tftp video xfs zfs
Copy the text block above to a temporary file "grub2_modules". Then execute:
grub2-mkimage \ -O x86_64-efi \ -p /boot/efi/EFI/fedora \ -o /boot/efi/EFI/fedora/grubx64.efi \ `xargs < grub2_modules`
The grub2.spec script does something like this when building the binary rpm.
Now you can delete the directory where we added zfs.mod:
rm -rf /boot/efi/EFI/fedora/x86_64-efi
Better? A matter of taste I suppose. Rebuilding grubx64.efi is dangerous because it could be replaced by a Fedora update. I prefer using the x88_64-efi directory.
The package grub2-efi-modules installs a large module collection in:
The total size of all the .mod files there is about 3MB so you might wonder why not include all the modules? For reasons I don't have the patience to discover, at least two of them conflict and derail the boot process.
Here's a list of steps you can take that help prevent boot problems. I won't hurt to do this after any update that includes a new kernel or version of zfs.
After an update that includes a new kernel and before you reboot, you must get the new kernel version number by listing:
To save typing, I'll assume your new kernel version is in the variable kver.
Create it like this (for example)
You can see what version of kernel you're running using:
If this name matches the newest entry in /boot, you can use the expression
uname -r instead of defining and using the kver variable.
Find out what version of zfs you're about to install. It could be newer that what you're running:
rpm -q zfs
Create the variable "zver":
First, check to see if the modules for zfs and spl are present in:
If they aren't there, build them now:
dkms install -m spl -v $zver -k $kver dkms install -m zfs -v $zver -k $kver
Next, check to see if there is an initramfs file for your new kernel:
If not, you need to run dracut:
dracut -fv --kver $kver
Next, check to see if the zfs module is inside the new initramfs:
lsinitrd -k $kver | grep "zfs.ko"
If it's not, run dracut:
dracut -fv --kver $kver
Check to see if your new kernel is mentioned in grub.cfg:
grep $kver /boot/efi/EFI/fedora/grub.cfg
If it's not, you need to build a new configuration file:
grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg
And make sure your new kernel is the default:
The command should return 0 (zero). If it doesn't:
grubby --set-default-index 0
If your system passes all these tests, it will probably boot.
Don't panic. It happens to me all the time and I never lost anything. Least you feel discouraged by this remark, please know that I get into trouble frequently because I'm constantly messing with the system so I can publish lengthy screeds like this one.
ZFS on Fedora fails to boot for several reasons
Common boot problems after a linux or zfs update:
You can usually avoid these horrors by doing the recommended pro-reboot checks. But accidents happen...
The screen is all black with confusing white text. You see a "#" prompt at the bottom of the screen. This is the emergency shell. Some-but-not-all shell commands work. You can use zfs and zpool commands.
The most common message is:
"cannot import 'mypool': pool was previously in use from another system"
The fix is easy. First import your pool somewhere else to avoid conflicts and supply the "-f" option to force the issue:
zpool import -f mypool -o altroot=/sysroot
Now export the pool:
zpool export mypool
You may see an message that suggests you run journalctl. Do so and take note of the messages near the end. A common message is that the ZFS can't import the pool using the cache file:
zfs-import-cache.service: Main process exited, ... Failed to start import ZFS pools by cache file. zfs-import-cache.service: Unit entered failed state. ...
If you followed my suggestion and disabled the cache, this shouldn't happen to you. But if you do see complaints about the cache file, it means that it's still enabled in your initramfs through some other accident.
1) Check that /etc/zfs/zpool.cache does not exist
2) Make sure zfs-import-cache is disabled:
systemctl status zfs-import-cache Should be inactive.
3) Make sure zfs-import-scan is enable:
systemctl status zfs-import-scan Should be active.
4) Rebuild the initrd
dracut -fv --kver `uname -r`
Note that you can't use
uname -r unless you're actually running on the same
kernel your trying to fix. If you're doing a rescue inside a chroot, list the
/boot directory and use the version of the kernel you're trying boot. Example:
dracut -fv --kver 4.16.13-300.fc28.x86_64
If, on the other hand, you know what your doing and want to use the cache file, you'll have to make a new one:
Boot from your rescue system and go into the target as usual:
zpool import mypool -o altroot=/mypool zenter /mypool
Now refresh the cache file:
zpool set cachefile=/etc/zfs/zpool.cache
Exit and reboot
For reasons I don't know how to fix, doing a combined update will fail to run the zmogify script and frequently the spl module will be built by dkms, but not the zfs module. These be Mysteries.
Boot to your previous working kernel (or failing that, use your recovery device.) Follow the steps to confirm everything as noted in the catchall section.
In addition, before running dracut, confirm that you don't have a cache file:
And that the systemd scripts are configured correctly:
systemctl enable zfs-import-scan systemctl disable zfs-import-cache
After that, you should be good to go.
This is a cure-all procedure that almost always fixes trouble caused by updates or upgrades. The most common problems are fixed automatically by the zmogrify script we installed. But it doesn't rebuild kernel modules because that process is too time consuming and because the spl and zfs packages are supposed to do this automatically when the kernel is updated. (Unfortunately, that mechanism doesn't work when you upgrade between Fedora versions.)
Begin by booting your recovery device. If you don't have one, go to the beginning of this document, follow the procedures and return here when ready.
The procedure outlined here is similar the pre-reboot checklist above except that now we're operating from a chroot environment.
Display a list of zfs pools
One of these will be the one that won't boot. To distinguish this from our other examples, we will use the pool name "cool". Import the problematic pool:
zpool import -f cool -o altroot=/cool
Enter the chroot environment
Note the new kernel version
ls -lrt /boot
Assign it to a variable
Note the zfs version
rpm -q zfs
Assign it to a variable
Note that we omit parts of the zfs version string including and after any hyphen.
Check to see if you have zfs modules
ls /lib/modules/`echo $kver`/extra
If not, build them now
dkms install -m spl -v $zver -k $kver dkms install -m zfs -v $zver -k $kver
If, as suggested previously, you have configured systemd to use zfs-import-scan.service, make sure that a cache file doesn't exist:
And for good measure
systemctl disable zfs-import-cache systemctl enable zfs-import-scan
This precaution is also necessary because zfs updates will occasionally (and silently) switch you back to using zfs-import-cache service.
Add the new modules to initramfs
dracut -fv --kver $kver
You may notice a complaint from dracut about having no functioning syslog:
dracut: No '/dev/log' or 'logger' included for syslog logging
This happens because there are "dangling" symbolic links in your chroot. The message is safe to ignore.
Mount the boot partition
Update the grub.cfg file
grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg
Make sure the new kernel is the default
grubby --set-default-index 0
Update the grub2 boot-time zfs module
cp -a /usr/lib/grub/x86_64-efi/zfs.mod /boot/efi/EFI/fedora/x86_64-efi
Unmount the EFI partition
Exit the chroot
Export the pool
zpool export cool
Shutdown, disconnect your recovery device, and try rebooting.
The screen is entirely black except for the text "grub>"
You're talking to the grub boot loader. It understands a universe of commands you fortunately won't need to learn. If you're curious, enter "help", but that won't be necessary and it might frighten the horses.
First load the zfs module:
grub> insmod zfs
If that doesn't work, you didn't put the zfs grub modules in the directory:
Go back to the catchall procedure and pay attention.
If grub doesn't complain, try listing the top level devices:
grub> ls (hd0) (hd1) (hd1,gpt2) (hd1,gpt1) (hd2) (hd2,gpt2) (hd2,gpt1)
In this example, the devices hd1 and hd2 were used to create a ZFS mirror. Each of them is formatted GPT with an EFI partition gpt1 and a ZFS partition gpt2.
You can see the partition types:
grub> ls (hd1,gpt1) (hd1,gpt1): Filesystem is fat. grub> ls (hd1,gpt2) (hd1,gpt2): Filesystem is zfs.
Grub2 has the concept of a default location which is stored in a variable named root. To see where you are:
grub> echo $root hd2,gpt1
That happens to be the EFI partition on device hd2. To see the directories and files in that partition:
grub> ls / efi/ System/ mach_kernel
Unlike linux, using "/" as the path does not mean the absolute root of grub's search space. It refers to wherever you (or some previous script) set the value of the variable "root".
You can change the location of "/" by setting a new value for root:
grub> set root=(hd1,gpt2)
Now we're at the top level of the zfs partition. To see the top level datasets:
grub> ls / @/ ROOT/ error: incorrect dnode type
Grub is telling us that there is one top level dataset in this pool named "ROOT".
Because "ROOT" has no file system, we get the message:
error: incorrect dnode type
A less disturbing message, IMHO, might have been:
No file system
You see the same thing if you use the partition name:
grub> ls (hd1,gpt2)/
In that case, the path is absolute and the value of "root" isn't used.
Let's go deeper:
grub> ls /ROOT @/ fedora/ error: incorrect dnode type
This tells us that the only child dataset of ROOT is the dataset "fedora". And there's no file system here either.
A complete path will have a sequence of dataset names followed by a traditional filesystem path.
To tell grub were done specifying the dataset path and now want regular file system paths, we introduce the "@" character. To display the boot image files:
grub> ls ROOT/fedora@/boot <big list of files including vmlinuz and initramfs>
Here, "ROOT" and "fedora" are datasets. "boot" is a filesystem directory.
The "@" symbol on "fedora@" refers to a snapshot. In this case, we don't have snapshot, but the "@" is required to separate dataset path components from filesystem path components.
If you want to look into a particular snapshot, "2018-06-05" for example, the expression would be:
grub> ls ROOT/fedora@2018-06-05/boot ...
The last dataset element "fedora@" can be followed by a regular filesystem path.
You may see lines like this mixed in with other files and directories:
error: compression algorithm inherit not supported
This means that grub found something it can't "look into" for some reason.
If you try to list files in the efi partition:
grub> ls ROOT/fedora@/boot/efi <nothing>
That's because we're not dealing with linux mount points here. The "efi" is just an empty directory. You have to look on the EFI partition. In this case:
grub> ls (hd1,gpt1)/ efi/ System/ mach_kernel
If you know the uuid of the root device, you can make that the default root:
grub> search --no-floppy --fs-uuid --set=root cb2cb31ba32112e6 <no output means it found something>
This expression tells grub to search through all the devices until it finds one with the uuid specified. Then it makes that the root. You can see what it found:
grub> echo $root hd1,gpt2
On bootable devices like thumb drives, uuid's are essential because device numbers will be different depending on the number of other devices on the system. In fact, it's a bad idea In Our Time to use any device numbers, especially in a grub.cfg file. History leaves much debris.
To boot from a cantankerous system that drops you into the dread prompt grub, you have to poke around using "ls" to find your vmlinuz and initramfs image files. Once found you tell grub about them:
grub> set root=(hd1,gpt2) grub> linuxefi /ROOT/fedora@/boot/vmlinuz-4.16.13-300.fc28.x86_64 root=ZFS=cool/ROOT/fedora ro rhgb quiet grub> initrdefi /ROOT/fedora@/boot/initramfs-4.16.13-300.fc28.x86_64.img
If you found the location using "search", you don't need to set the root again.
(On non-efi systems, the keywords are just "linux" and "initrd")
Now you're ready to boot:
For reference, here's the complete grub2 boot script for this system:
load_video set gfxpayload=keep insmod gzio insmod part_gpt insmod zfs search --no-floppy --fs-uuid --set=root cb2cb31ba32112e6 linuxefi /ROOT/fedora@/boot/vmlinuz-4.16.13-300.fc28.x86_64 root=ZFS=cool/ROOT/fedora ro rhgb quiet initrdefi /ROOT/fedora@/boot/initramfs-4.16.13-300.fc28.x86_64.img
You can get a list of all grub2 commands:
grub> help <command name>
Hopefully, grub is now a little less dreadful for you.
While you're messing around with zfs in grub, you can use another command "zfsinfo" to show the name and structure of a zfs pool. To access zfsinfo, the file zfsinfo.mod has to be in your boot-time grub2 modules folder. Here's a list of all the modules in my system: (We're running linux now, not grub.)
mount /boot/efi ls /boot/efi/EFI/fedora/x86_64-efi/ zfscrypt.mod zfsinfo.mod zfs.mod
If you're missing *zfsinfo.mod", copy it to the EFI like this:
cp -a /usr/lib/grub/x86_64/zfsinfo.mod /boot/efi/EFI/fedora/x86_64-efi/
Back at the grub prompt, you have to install it:
grub> insmod zfsinfo
Now you can run:
grub> zfsinfo (hd1,gpt2)
And see (in this case)
Pool name: cool Pool GUID: cb2cb31ba32112e6 Pool state: Active This VDEV is a mirror VDEV with 2 children VDEV element number 0: Leaf virtual device (file or disk) Virtual device is online Bootpath: unavailable Path: nvme1n1p2 Devid: unavailable VDEV element number 1: Leaf virtual device (file or disk) Virtual device is online Bootpath: unavailable Path: nvme0n1p2 Devid: unavailable
"These things never happen, but always are." -- Sallust (4th century A.D.)
Here's how to repopulate the EFI from scratch.
Boot from your rescue system and go into the target as usual:
zpool import mypool -o altroot=/mypool zenter /mypool mount /boot/efi
Before you can proceed, networking must operate. That usually isn't a problem,
but resolving network names will be a problem because Fedora has elaborated
/etc/resolv.conf into a symbolic link that points into
nowhere when you're in a chroot. To fix that, move the link somewhere else
temporarily and create a new
mv /etc/resolv.conf /root
Create a new /etc/resolv.conf that contains:
If you don't know the numbers for your nameserver, Google usually works:
If you like, you can delete everything inside /boot/efi:
rm -f /boot/efi/*
Prior to Fedora 28 use:
dnf reinstall grub2-efi dnf reinstall shim dnf reinstall fwupdate-efi dnf reinstall mactel-boot
Fedora 28 and later:
dnf reinstall grub2-efi-x64 dnf reinstall shim-x64 dnf reinstall fwupdate-efi dnf reinstall mactel-boot
Now add the necessary zfs modules:
mkdir -p /boot/efi/EFI/fedora/x86_64-efi cp -a /usr/lib/grub/x86_64-efi/zfs* /boot/efi/EFI/fedora/x86_64-efi
Restore the old resolv.conf:
mv /root/resolv.con /etc
Unmount the EFI partition and exit the chroot:
umount /boot/efi exit
Export the pool
Try booting again.
Most of this article is about making kernel updates as painless with root on zfs as they are without. But when an update comes out for ZFS without a kernel update, there can be problems.
After performing a dnf update that includes a new ZFS, The dkms process will run and build new spl and zfs modules. Unfortunately, the installers don't run dracut so the initramfs won't contain the new modules. You should be able to boot since the initramfs is self-consistent. If you want everything to be up to date, run this before rebooting:
zmogrify `uname -r`
If the update group includes both zfs and kernel updates, there is a puzzle - if the kernel was installed last, it will contain the new zfs modules because of our post-install script. But if zfs gets installed last, things will be out of whack. To be sure:
zmogrify <your new kernel version>
This annoyance is the last frontier to making root on zfs transparent to updates. Without sounding peevish (I hope) I want to point out that there's a dkms.conf file in the zfs source package that contains the setting:
It would be nice if this option was changed to "yes" - Then zfs updates would be totally carefree provided you've done all the other hacks described in this article. The issue was debated by the zfs-on-linux developers and they decided otherwise. Dis aliter visum.
It's possible for the zfs.mod that lives on the EFI partition to get out sorts with the zfs.ko in the root file system. Or more precisely, the set of zfs properties supported in the pool that contains the root file system may not be supported by zfs.mod if the pool was created by a more recent zfs.ko.
Right now (2016-10) I live in a brief era when the version of zfs.mod supplied by grub2-efi-modules agrees with the property set created by default using zfs 0.6.8.5. But such a happy situation cannot last. The pool itself retains the property set used when it was created. But when you create a new pool with a future version of zfs, it may not be mountable by an older zfs.mod.
The fix is to enumerate all the properties your zfs.mod supports and specify only those when creating a new pool. To discover the set of properties supported by a given version of zfs.mod, it appears that you have to study the source. Being a lazy person, I'll just quote a reference that show an example of how a pool is created with a subset of available features.
The following quotation and code block come from the excellent article ArchLinux - ZFS in the section GRUB compatible pool creation.
"By default, zpool will enable all features on a pool. If /boot resides on ZFS and when using GRUB, you must only enable read-only, or non-read-only features supported by GRUB, otherwise GRUB will not be able to read the pool. As of GRUB 2.02.beta3, GRUB supports all features in ZFS-on-Linux 0.6.5.7. However, the Git master branch of ZoL contains one extra feature, large_dnodes that is not yet supported by GRUB."
zpool create -f -d \ -o feature@async_destroy=enabled \ -o feature@empty_bpobj=enabled \ -o feature@lz4_compress=enabled \ -o feature@spacemap_histogram=enabled \ -o feature@enabled_txg=enabled \ -o feature@hole_birth=enabled \ -o feature@bookmarks=enabled \ -o feature@filesystem_limits \ -o feature@embedded_data=enabled \ -o feature@large_blocks=enabled \ <pool_name> <vdevs>
The article goes on to say: "This example line is only necessary if you are using the Git branch of ZoL."
Evidently I got away with ZFS 0.6.5.8 because it doesn't have the large_dnodes feature yet or grub got updated. To find out, run:
zpool get all pool | grep feature
For my pool, this shows:
pool feature@async_destroy enabled local pool feature@empty_bpobj active local pool feature@lz4_compress active local pool feature@spacemap_histogram active local pool feature@enabled_txg active local pool feature@hole_birth active local pool feature@extensible_dataset enabled local pool feature@embedded_data active local pool feature@bookmarks enabled local pool feature@filesystem_limits enabled local pool feature@large_blocks enabled local
No sign of large_dnodes yet, but be aware that it's out there waiting for you.
Share your woes by mail with Hugh Sparks. (I like to hear good news too.)