Migrate Linux (home)server to ZFS
Recently I moved my home server to a (refurbished) Dell Optiplex 7050. The previous PC was on ext4, and I decided to try ZFS without reinstalling it.
Why ZFS?
ZFS is an enterprise-grade file system. It supports a very long list of features; most target big data storage. However, some of these are a “nice-to-have” in a home server. In my case:
- Datasets: I really like them. You can manage data with all benefits from a partition (quotas, snapshots, etc.) while sharing the same underlying “raw space” (meaning that you can avoid losing space because you assigned too much to the /var partition and too little to /home)
- Snapshotting: the feature that I need when experimenting. You can do this with other filesystems and with LVM as well, but they are somehow limited (mainly due to the concept of “dataset” that is missing in many other filesystems)
- Checksums: I’d like to know if some data blocks are damaged. In RAID setups, I like the fact that ZFS tries to fix the block
- Native encryption and compression per dataset. Meaning that I can use different keys for different datasets or disable compression when needed (e.g., already compressed files)
The ability to transfer ZFS snapshots is fantastic, although I don’t think I’ll use too much at home. I’ll keep doing my backups with restic: leveraging on a single tool (like ZFS snap/send/recv for backups) might result in some weird bug, like the one that was present in some version of ZFS in Ubuntu (note that even ZFS snapshots were corrupted in that case).
What about btrfs? Well, if you take a look at their “Status” page for btrfs features you may discover that replacing a disk might result in some issues on I/O errors or that their RAID support is “mostly OK”, which is not very encouraging.
Prepare the new machine
The new PC is equipped with a single NVMe disk (refurbished), so no RAID (on the previous PC there was a RAID1) and non-ECC RAM (like before). Currently, I have no budget for a 2nd disk or ECC RAM, so I’ll live with it :-(
For the ZFS preparation, I adapter the good guide that you can find on the OpenZFS website. I used GRML, which is a Debian (bullseye
at the time of writing) live distribution with some tools. GRML comes with zfs-dkms
installed: I removed it and installed the zfs-modules
package (ZFS kernel module) for the running kernel (I’m pre-building the ZFS package using dkms mkbmdeb
- alternatively you’ll need zfs-dkms
and kernel headers installed).
Old storage layout
- RAID1 (
mdadm
) over mechanical diskssda1
andsdb1
were part of a RAID1md0
for/boot
partition (no encryption)sda2
andsdb2
were part of a RAID1md1
for/
partition
md1
(the root RAID) was managed bydmcrypt
, the plain-text block devicemd1_crypt
was formatted using ext4
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 465,8G 0 disk
├─sda1 8:1 0 953M 0 part
│ └─md0 9:0 0 952M 0 raid1 /boot
└─sda2 8:2 0 464,8G 0 part
└─md1 9:1 0 464,7G 0 raid1
└─md1_crypt 253:0 0 464,7G 0 crypt /
sdb 8:16 0 465,8G 0 disk
├─sdb1 8:17 0 953M 0 part
│ └─md0 9:0 0 952M 0 raid1 /boot
└─sdb2 8:18 0 464,8G 0 part
└─md1 9:1 0 464,7G 0 raid1
└─md1_crypt 253:0 0 464,7G 0 crypt /
New storage layout
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme0n1 259:0 0 477G 0 disk
├─nvme0n1p2 259:1 0 512M 0 part
├─nvme0n1p3 259:2 0 2G 0 part
└─nvme0n1p4 259:3 0 474,4G 0 part
We’ll create a new GPT disk with the EFI partition (nvme0n1p2
), a boot pool named bpool
(nvme0n1p3
) and a root pool named rpool
(nvme0n1p4
). The bpool
is needed because GRUB cannot boot from a pool that has newer features, such as encryption. I’ll use native encryption in some datasets (no more LUKS, sorry). Quotas will limit datasets to avoid that one dataset can stall the system. Everything will be prepared inside /target
while on the live disk.
Here a summary of commands that I used (most of them comes from the OpenZFS guide that I mentioned before):
# Install requirements
# See the OpenZFS guide to install zfs-dkms and linux headers
apt install --yes debootstrap gdisk zfs-modules-$(uname -r) zfsutils-linux
# Set useful variable `DISK` with the path for the *new* physical disk (in my case, the NVMe)
# The path depends on your system
DISK=/dev/disk/by-id/nvme-eui.0000000000000000
# ZAP every partition on the new disk
sgdisk --zap-all $DISK
# UEFI partition
sgdisk -n2:1M:+512M -t2:EF00 $DISK
# Create boot pool partition
sgdisk -n3:0:+1G -t3:BF01 $DISK
# Create root pool partition
sgdisk -n4:0:0 -t4:BF00 $DISK
# Create the boot pool bpool
zpool create \
-o cachefile=/etc/zfs/zpool.cache \
-o ashift=12 -d \
-o feature@async_destroy=enabled \
-o feature@bookmarks=enabled \
-o feature@embedded_data=enabled \
-o feature@empty_bpobj=enabled \
-o feature@enabled_txg=enabled \
-o feature@extensible_dataset=enabled \
-o feature@filesystem_limits=enabled \
-o feature@hole_birth=enabled \
-o feature@large_blocks=enabled \
-o feature@lz4_compress=enabled \
-o feature@spacemap_histogram=enabled \
-o feature@zpool_checkpoint=enabled \
-O acltype=posixacl -O canmount=off -O compression=lz4 \
-O devices=off -O normalization=formD -O relatime=on -O xattr=sa \
-O mountpoint=/boot -R /target \
bpool ${DISK}-part3
# Create the encrypted root pool rpool
zpool create \
-o ashift=12 \
-O encryption=aes-256-gcm \
-O keylocation=prompt -O keyformat=passphrase \
-O acltype=posixacl -O canmount=off -O compression=lz4 \
-O dnodesize=auto -O normalization=formD -O relatime=on \
-O xattr=sa -O mountpoint=/ -R /target \
rpool ${DISK}-part4
# Create the ZFS dataset for /boot
zfs create -o mountpoint=/boot bpool/boot
# Set safe quota (plenty of space for kernels and initramfs)
zfs set quota=1200M bpool/boot
# Create the ZFS dataset for /
zfs create -o mountpoint=/ rpool/ROOT
zfs mount rpool/ROOT
# Create datasets for home dirs
zfs create rpool/home
zfs create -o mountpoint=/root rpool/home/root
# Set safe quota (usually there is nothing there)
zfs set quota=50G rpool/home
# Create datasets for /var
zfs create -o canmount=off rpool/var
zfs create rpool/var/log
zfs create rpool/var/spool
zfs create -o com.sun:auto-snapshot=false rpool/var/cache
zfs create -o com.sun:auto-snapshot=false rpool/var/tmp
chmod 1777 /target/var/tmp
zfs create -o com.sun:auto-snapshot=false rpool/tmp
chmod 1777 /target/tmp
# Set safe quota values
zfs set quota=50G rpool/var
zfs set quota=50G rpool/tmp
# Create a separate pool for rancher k3s
zfs create -o com.sun:auto-snapshot=false -o mountpoint=/var/lib/rancher/k3s rpool/k3s
zfs create rpool/k3s/agent
zfs create rpool/k3s/data
zfs create rpool/k3s/server
zfs create rpool/k3s/storage
# Set safe quota values (there we have some data, and container images)
zfs set quota=50G rpool/k3s/agent
zfs set quota=50G rpool/k3s/data
zfs set quota=50G rpool/k3s/server
zfs set quota=300G rpool/k3s/storage
If everything is correct, zfs list
should be something like:
NAME USED AVAIL REFER MOUNTPOINT
bpool 112M 1.64G 96K /target/boot
bpool/boot 112M 1.06G 112M /target/boot
rpool 149G 309G 192K /target
rpool/ROOT 12.7G 309G 12.7G /target
rpool/home 1.79G 48.2G 264K /target/home
rpool/home/root 1.79G 48.2G 1.79G /target/root
rpool/k3s 134G 309G 296K /target/var/lib/rancher/k3s
rpool/k3s/agent 18.0G 32.0G 18.0G /target/var/lib/rancher/k3s/agent
rpool/k3s/data 147M 49.9G 147M /target/var/lib/rancher/k3s/data
rpool/k3s/server 7.16M 50.0G 7.16M /target/var/lib/rancher/k3s/server
rpool/k3s/storage 116G 184G 116G /target/var/lib/rancher/k3s/storage
rpool/tmp 4.91M 50.0G 4.91M /target/tmp
rpool/var 112M 49.9G 192K /target/var
rpool/var/cache 48.3M 49.9G 48.3M /target/var/cache
rpool/var/log 61.4M 49.9G 61.4M /target/var/log
rpool/var/spool 1.50M 49.9G 1.50M /target/var/spool
rpool/var/tmp 224K 49.9G 224K /target/var/tmp
As you can see, some datasets have smaller AVAIL
space thanks to the quota.
Clone data and system
Now that we have a skeleton for the new system, we can proceed to copy everything.
Will the new system accept ZFS? What kind of issues I’ll face? I don’t know yet, so I did some test runs using backup copies before powering off the old system and copying everything. Thanks to restic mount
, I mounted the latest backup for my home server in /mnt/backup
and I rsync’ed everything:
rsync -ahPHAXx --info=progress2 -e ssh /mnt/backup/hosts/proxima/latest/ /target
I already have the ZFS module installed because I’m using it on an external drive. However, we’ll need zfs-initramfs
to load ZFS on boot. Also, we need to install grub-efi and switch to EFI boot (as the previous system was on CSM/BIOS).
# Save old ZFS cache and replace with the temporary one
mv /target/etc/zfs/zpool.cache /target/etc/zfs/zpool.cache.old
zpool set cachefile=/etc/zfs/zpool.cache rpool
zpool set cachefile=/etc/zfs/zpool.cache bpool
cp /etc/zfs/zpool.cache /target/etc/zfs/zpool.cache
# Prepare chroot environment
mount -o bind /dev/ /target/dev
mount -t proc none /target/proc
mount -t sysfs none /target/sys
# Switch root
chroot /target /usr/bin/env DISK=$DISK bash --login
export PS1="(chroot) $PS1"
# I'm using LC_ALL="it_IT.UTF-8"
export LC_ALL="it_IT.UTF-8"
apt update
# Remove old packages
apt remove cryptsetup-initramfs
# If you don't have ZFS already installed, you should do it now. See the OpenZFS guide
# Install ZFS initramfs tools
apt install -t buster-backports zfs-initramfs
# Prepare EFI (see OpenZFS page)
apt install dosfstools
mkdosfs -F 32 -s 1 -n EFI ${DISK}-part2
mkdir /boot/efi
echo /dev/disk/by-uuid/$(blkid -s UUID -o value ${DISK}-part2) \
/boot/efi vfat defaults 0 0 >> /etc/fstab
mount /boot/efi
apt install --yes grub-efi-amd64 shim-signed
# I don't have any other OS, so we can safely remove os-prober
apt purge --yes os-prober
Now we need to fix /etc/fstab
and /etc/crypttab
: there is no need for mounts there (ZFS will handle its mount points), so I deleted entries for /boot
and /
in fstab
, and md1_crypt
in crypttab
.
After that, we can configure the auto-import of bpool
(this snippet comes from OpenZFS manual) by creating a new file in /etc/systemd/system/zfs-import-bpool.service
with this content:
[Unit]
DefaultDependencies=no
Before=zfs-import-scan.service
Before=zfs-import-cache.service
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/sbin/zpool import -N -o cachefile=none bpool
# Work-around to preserve zpool cache:
ExecStartPre=-/bin/mv /etc/zfs/zpool.cache /etc/zfs/preboot_zpool.cache
ExecStartPost=-/bin/mv /etc/zfs/preboot_zpool.cache /etc/zfs/zpool.cache
[Install]
WantedBy=zfs-import.target
Enable the service by issuing systemctl enable zfs-import-bpool.service
.
I had an issue at this point: grub-probe /boot
was not recognizing zfs
. I don’t know what happened, but I had to reboot grml
and launch chroot
again, and the problem disappeared…
Now we can configure grub
by editing /etc/default/grub
and adding root=ZFS=rpool/ROOT
to the GRUB_CMDLINE_LINUX
variable. Update grub configuration and install grub:
update-grub
grub-install --target=x86_64-efi --efi-directory=/boot/efi \
--bootloader-id=debian --recheck --no-floppy
Finally, we need to fix the ethernet name: I had eno1
in the previous system, now I have enp0s31f6
.
Boot
At this point, everything should be OK. If Debian can’t boot and drops you in a shell, verify that you installed the correct version of zfs-initramfs
(at least 2.0.3), as older versions have issues with encryption in rpool
.
Enjoy your ZFS-on-root server :-)
Acknowledgements
Thanks @lerrigatto for advices on some ZFS features :-)
Addendum: remotely unlock encrypted ZFS root pool/dataset
To unlock remotely ZFS encrypted datasets or dmcrypt
partitions, we need that:
- the kernel has an IP address (optionally, a gateway)
- an SSH server inside the initramfs
So, let’s install dropbear
. Debian provides a convenient pre-configured package:
apt install dropbear-initramfs
Put your SSH public key inside /etc/dropbear-initramfs/authorized_keys
. Note that older versions of dropbear
(like the one in Debian Buster) don’t support ed25519
keys. I have a specific RSA SSH key for that occasion. After configuring dropbear
, update all initramdisks using update-initramfs -k all -u
.
Now we need to add the IP configuration into the GRUB_CMDLINE_LINUX_DEFAULT
variable (/etc/default/grub
). The syntax is well explained in the official Linux Kernel documentation:
ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf>:<dns0-ip>:<dns1-ip>:<ntp0-ip>
client-ip
is the IP address of the client, for static assignmentserver-ip
is the IP address of the NFS server for root (used for PXE diskless booting, not needed in our case)gw-ip
is the IP address of the gateway, for static assignmentnetmask
is the network mask, for static assignmenthostname
is the network host name (or host + domain name)device
is the device to use (both with static and automatic config)autoconf
is the autoconfiguration protocol to use for automatic addressing; can beoff
ornone
: no auto configuration (static assignment)on
orany
: use any protocol available in the kernel (default)dhcp
: use DHCPbootp
: use BOOTPrarp
: use RARPboth
: use both BOOTP and RARP but not DHCP (old option kept for backwards compatibility)
dns0-ip
anddns1-ip
: IP addresses of DNS serversntp0-ip
: IP address of a NTP server
Unused fields can be left empty. Optionally, only one of the valid autoconf
value can be specified (e.g., ip=dhcp
) instead of empty fields.
Static assignment example:
ip=192.0.2.10::192.0.2.1:255.255.255.0
Update the configuration using update-grub
. You can SSH into the server initramfs at boot and issue the zfsunlock
command.