Last Updated:

Getting ZFS to run on your dedicated server in a data-center far away.

operationroot zfs

For those in the know ZFS has been the file system of choice for bit rot resistant large scale storage. But its snapshot capabilities and the the ability to easily send entire file systems between physical storage array also hold some attraction for hypervisors and mission critical data center applications.

What ever your motivation, this guide will show you how to get a pure ZFS system running on a machine that you only have console access to. We are basing this guide on Ubuntu 18.04 Server, but this should also work mostly on 20.04 Server. Note that on 20.04 desktop Zsys, Ubuntu's clever but in this instance not entirely helpful ZFS management layer will get in the way. So you'll be on your own there, although why are you running a desktop install on a dedicated server machine anyway?

Usual disclaimer applies. Us this guide at your own risk.

Prerequisites:

  • Basic understanding of ZFS concepts.

Software:

  • To start with, any type of live Linux booted into RAM on your server. Ubuntu 18.04 or Debian based distros are preferred but you can get debootstrap to work on pretty much any distro.
  • root - can't do this without it.

Hardware:

  • Preferably a server with ECC ram. Yes zfs also works without it, but it trusts the ram content unconditionally, so a corrupted d-ram cell could corrupt your stored data if you go without ECC. And since space weather is looking to get rougher might be not a bad investment.
  • One or more drives, preferably whole ones. Yes zfs could work on a single partition but giving it a whole drive is better for reasons that are out of scope here. Ideally you'd want more than one drive, because a single drive setup will only allow you to spot bit rot but not recover from it.
  • Something fast, durable and low latency as read cache (L2ARC) and something decently fast and durable for SLOG for ZIL. Think of SLOG as bucket your writes to your array get dumped in that is used as a roll back in case your machine is reset before the array has caught up with the writes. The read caching is hopefully self explanatory.
  • If you are running multiple VMS on a hypervisor with just 3-4 drives don't even think about doing this without an L2ARC and SLOG. ZFS can handle petabites of data, but a speed demon on small arrays it ain't. Incidentally Intel Xpoint storage is great for both L2ARC and SLOG, chiefly because of its low latency.

Alright let's get into it.

Preparing the drives:

This guide was adapted from the existing openzfs material. You will need to refer to it in part if you are configuring an UEFI boot system.

This part assumes we had to boot into a live rescue system which is based on debian buster. We will enable zfs support in the rescue image then debootstrap Ubuntu Bionic from it onto the zfs pools we create.

This creates a system with separate boot and root pools. The former has fewer features turned on since grub does not support some of the advanced stuff. While not applicable here, on the newer version of ZFS that ships with UB20.04 native zfs encryption is supported, which does upset grub, even when it occurs on the root pool. At least based on the version of grub out around May 2020.

This guide has some customizations regarding disk quotas and check sum algo.

We already have root and ssh is installed (obviously) otherwise take care of that first.

apt-get update apt install --yes debootstrap gdisk dkms dpkg-dev linux-headers-$(uname -r)apt install --yes -t buster-backports --no-install-recommends zfs-dkmsmodprobe zfsapt install --yes -t buster-backports zfsutils-linux

Formatting and partitioning disks. Create some variables that will make the rest of the process easier, so you don’t need to refer to massive path names. Be sure to use the ATA or NVME names.

ls /dev/disk/by-id/DISK1=/dev/disk/by-id/ata-WDC_WD3000FYYZ-01UL1B2_WD-WCC138XFVD4DISK2= ….

Whenever $DISK is used below assume you may need to refer to $DISK1.. $DISKn

If the disk was previously used in an MD array:

apt install --yes mdadm

If so, stop them (replace ``md0`` as required):

mdadm --stop /dev/md0# For an array using the whole disk:mdadm --zero-superblock --force $DISK# For an array using a partition:mdadm --zero-superblock --force ${DISK}-part2

Clear the partition table:

sgdisk --zap-all $DISK

Run this if you need legacy (BIOS) booting:

sgdisk -a1 -n1:24K:+2000K -t1:EF02 $DISK

Run this for UEFI booting (for use now or in the future) or use and size it for a swap partition if you like (-t2:8200):

sgdisk     -n2:2M:+512M   -t2:EF00 $DISK

Run this for the boot pool:

sgdisk     -n3:0:+1G      -t3:BF01 $DISK

For the root pool:

sgdisk     -n4:0:0        -t4:BF00 $DISK

We used zfs “reserved” partition types for cache and log devices which are obviously on separate NVME drives.

Now create the root pool. Note ashift is drive dependent. Use smartclt to determine physical and logical sectors of drive. For true 512 bite sector drive use ashift=9 for true 4k drives use ashift=12. Ashift 12 will work on 512 drives as well but you will lose a lot of capacity for certain file types. In my tests up to 25 percent. Also note that lower case o and upper case O matter. The latter are pool options. Use -f flag when disk might still be part of old pools. Or better still delete those pools first by importing then using zpool destroy…

Creating pools:

Now let's set this up for Ubuntu. First the boot pool:

zpool create -f -o ashift=9 -d \    -o feature@async_destroy=enabled \    -o feature@bookmarks=enabled \    -o feature@embedded_data=enabled \    -o feature@empty_bpobj=enabled \    -o feature@enabled_txg=enabled \    -o feature@extensible_dataset=enabled \    -o feature@filesystem_limits=enabled \    -o feature@hole_birth=enabled \    -o feature@large_blocks=enabled \    -o feature@lz4_compress=enabled \    -o feature@spacemap_histogram=enabled \    -O acltype=posixacl -O canmount=off -O compression=lz4 -O devices=off \    -O normalization=formD -O relatime=on -O xattr=sa \    -O mountpoint=/ -R /mnt bpool raidz ${DISK1}-part3 ${DISK2}-part3 ${DISK3}-part3

Now for the root pool:

zpool create -f -o ashift=9 \    -O acltype=posixacl -O canmount=off -O compression=lz4 \    -O dnodesize=auto -O normalization=formD -O relatime=on -O xattr=sa \    -O mountpoint=/ -R /mnt rpool raidz ${DISK1}-part4 ${DISK2}-part4 ${DISK3}-part4

To add separate ZIL (SLOG) and L2ARC to the root pool do this – point to relevant SSD or NVME device or partition:

# for zil (SLOG):sudo zpool add -f rpool log /dev/disk/by-id/nvme...# for L2ARC:sudo zpool add rpool cache /dev/disk/by-id/nvme...

Create filesystem datasets to act as containers:

zfs create -o checksum=sha256 -o canmount=off -o mountpoint=none rpool/ROOTzfs create -o checksum=sha256 -o canmount=off -o mountpoint=none bpool/BOOT

Create filesystem datasets for the root and boot filesystems:

zfs create -o quota=20G -o canmount=noauto -o mountpoint=/ rpool/ROOT/ubuntuzfs mount rpool/ROOT/ubuntuzfs create -o quota=1.5G -o canmount=noauto -o mountpoint=/boot bpool/BOOT/ubuntuzfs mount bpool/BOOT/ubuntu

Create datasets (adjust layout to suit your needs):

zfs set checksum=sha256 rpoolzfs create -o quota=20G                             rpool/homezfs create -o quota=5G -o mountpoint=/root             rpool/home/rootzfs create -o quota=6G -o canmount=off                 rpool/varzfs create -o quota=4.5G -o canmount=off                 rpool/var/libzfs create  -o quota=5G                                rpool/var/logzfs create  -o quota=1512M                               rpool/var/spool

If you wish to exclude these from snapshots:

zfs create -o quota=1G -o com.sun:auto-snapshot=false  rpool/var/cachezfs create -o quota=1G -o com.sun:auto-snapshot=false  rpool/var/tmpchmod 1777 /mnt/var/tmp

For a disk based tmps:

zfs create -o quota=2G -o com.sun:auto-snapshot=false  rpool/tmpchmod 1777 /mnt/tmp

Use zfs get checksum path/to/dataset to check that sha256 was inherited across correctly, same goes for compression. Using sha256 vs sha512 for compatibility reasons.

Install the minimal system:

First make debootstrap work so it can install Ubuntu from the Debian based system we are in.

wget http://at.archive.ubuntu.com/ubuntu/pool/main/u/ubuntu-keyring/ubuntu-keyring_2018.09.18.1~18.04.0.tar.gztar xzf ubuntu-keyring_2018.09.18.1~18.04.0.tar.gz cd ubuntu-keyring-2018.09.18.1/cp -f keyrings/* /usr/share/keyrings/debootstrap bionic /mntzfs set devices=off rpool

Replace HOSTNAME with the desired hostname:

echo HOSTNAME > /mnt/etc/hostnamevi /mnt/etc/hosts# Add a line:127.0.1.1       HOSTNAME# or if the system has a real name in DNS:127.0.1.1       FQDN HOSTNAME

Find the interface name:

ip addr show

Adjust NAME below to match your interface name:

vi /mnt/etc/netplan/01-netcfg.yamlnetwork:  version: 2  ethernets:    NAME:      dhcp4: true

Your ISP / data center may require you to hard code your ip, so use whatever method to configure netcfg.yaml that will ensure you can connect to your machine after reboot.

Configure the package sources:

vi /mnt/etc/apt/sources.listdeb http://archive.ubuntu.com/ubuntu bionic main restricted universe multiversedeb http://archive.ubuntu.com/ubuntu bionic-updates main restricted universe multiversedeb http://archive.ubuntu.com/ubuntu bionic-backports main restricted universe multiversedeb http://security.ubuntu.com/ubuntu bionic-security main restricted universe multiverse

"Enter" the new system:

Bind the virtual filesystems from the rescue system environment to the new system and chroot into it, adjust the DISK variable handover as appropriate, separate multiple variables by spaces:

mount --rbind /dev  /mnt/devmount --rbind /proc /mnt/procmount --rbind /sys  /mnt/syschroot /mnt /usr/bin/env DISK=$DISK bash --login

Note: This is using --rbind, not --bind

Configure a basic system environment:

ln -s /proc/self/mounts /etc/mtabapt updatedpkg-reconfigure locales

Even if you prefer a non-English system language, always ensure that en_US.UTF-8 is available:

dpkg-reconfigure tzdataapt install --yes vim openssh-server

Configure sshd_config to ensure it permits root login for now (temporarily).

Install ZFS in the chroot environment for the new system:

apt install --yes --no-install-recommends linux-image-generic# OR see below:apt install --yes --no-install-recommends  linux-image-generic-hwe-18.04apt install --yes zfs-initramfs

Hint: For the HWE kernel, install linux-image-generic-hwe-18.04 instead of linux-image-generic.

We’ll need that for the encrypted swap later:

apt install --yes cryptsetup

Install grub and select all disks that apply. i.e. the full array, anything bootable. This guide refers to bios systems. Refer to the guide in the link at the start of the article to set up UEFI booting systems. You must use either or.

apt install --yes grub-pc

to get rid off annoying error messages…

dpkg --purge os-prober

Set root password for reboot. Confirm that sshd_config permits root login.

passwd

Enable importing bpool:

This ensures that bpool is always imported, regardless of whether /etc/zfs/zpool.cache exists, whether it is in the cachefile or not, or whether zfs-import-scan.service is enabled.

vi /etc/systemd/system/zfs-import-bpool.service[Unit]DefaultDependencies=noBefore=zfs-import-scan.serviceBefore=zfs-import-cache.service[Service]Type=oneshotRemainAfterExit=yesExecStart=/sbin/zpool import -N -o cachefile=none bpool[Install]WantedBy=zfs-import.targetsystemctl enable zfs-import-bpool.service

Verify that the ZFS boot filesystem is recognized:

grub-probe /boot

Refresh the initrd files:

update-initramfs -c -k all

Workaround GRUB’s missing zpool-features support:

vi /etc/default/grub# Set: GRUB_CMDLINE_LINUX="root=ZFS=rpool/ROOT/ubuntu"#While you are there: Disable memory zeroing:vi /etc/default/grub# Add: init_on_alloc=0 to: GRUB_CMDLINE_LINUX_DEFAULT# Save and quit.

This is to address performance regressions. Since we are using the 5.3 Kernel due to installing linux-image-generic-hwe-18.04 we would be affected and should compensate for it.

Optional (but highly recommended): Make debugging GRUB easier:

vi /etc/default/grub# Comment out: GRUB_TIMEOUT_STYLE=hidden# Set: GRUB_TIMEOUT=5# Below GRUB_TIMEOUT, add: GRUB_RECORDFAIL_TIMEOUT=5# Remove quiet and splash from: GRUB_CMDLINE_LINUX_DEFAULT# Uncomment: GRUB_TERMINAL=console# Save and quit.

Update the boot configuration:

update-grub

For legacy (BIOS) booting, install GRUB to the MBR:

grub-install $DISK

Fix filesystem mount ordering (once again we are compensating for systemD fuckery):

zfs set mountpoint=legacy bpool/BOOT/ubuntuecho bpool/BOOT/ubuntu /boot zfs \    nodev,relatime,x-systemd.requires=zfs-import-bpool.service 0 0 >> /etc/fstabzfs set mountpoint=legacy rpool/var/logecho rpool/var/log /var/log zfs nodev,relatime 0 0 >> /etc/fstabzfs set mountpoint=legacy rpool/var/spoolecho rpool/var/spool /var/spool zfs nodev,relatime 0 0 >> /etc/fstab# If you created a /var/tmp dataset:zfs set mountpoint=legacy rpool/var/tmpecho rpool/var/tmp /var/tmp zfs nodev,relatime 0 0 >> /etc/fstab# If you created a /tmp dataset:zfs set mountpoint=legacy rpool/tmpecho rpool/tmp /tmp zfs nodev,relatime 0 0 >> /etc/fstab

Snapshot the initial installation:

zfs snapshot bpool/BOOT/ubuntu@installzfs snapshot rpool/ROOT/ubuntu@install

Exit from the chroot environment back to the LiveCD environment:

exit

Run these commands in the LiveCD environment to unmount all filesystems:

mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | xargs -i{} umount -lf {}zpool export -a

exporting the zpool above IS CRUCIAL – if you reboot without it, system won’t come up since otherwise zpool was not unmounted cleanly.

You did install ssh server in the chroot and enabled root login right?

reboot

TUNING after reboot:

apt-get update apt dist-upgrade --yes

Install a command-line environment only:

apt install --yes ubuntu-standard

Mirror GRUB

If you installed to multiple disks, install GRUB on the additional disks:

For legacy (BIOS) booting: THIS IS ONLY NEEDED if you didn’t install grub on all disks already. Otherwise ignore.

dpkg-reconfigure grub-pc

Hit enter until you get to the device selection screen.

Select (using the space bar) all of the disks (not partitions) in your pool.

Adjust the swap partition reference in the examples below as needed. Don’t put swap onto ZFS because this can lead to system lock up. USE MDadmin if you need to stripe across disks.

MAKE SURE YOU ADJUST THE THE $DISK VARIABLE CORRECTLY.

cat /proc/swaps

should return empty.

For an encrypted single-disk install:

apt install --yes cryptsetupecho swap ${DISK}-part2 /dev/urandom \      swap,cipher=serpent-xts-plain64:sha256,size=512,noearly >> /etc/crypttabecho /dev/mapper/swap none swap defaults,nofail,x-systemd.device-timeout=30 0 0 >> /etc/fstab

Options in bold are to deal with infinite boot wait for swap by systemD.

For an encrypted mirror or raidz topology:

apt install --yes cryptsetup mdadm# Adjust the level (ZFS raidz = MD raid5, raidz2 = raid6) and# raid-devices if necessary and specify the actual devices.# Personally, I would strip swap across disks but still avoid using it.mdadm --create /dev/md0 --metadata=1.2 --level=mirror \    --raid-devices=2 ${DISK1}-part2 ${DISK2}-part2echo swap /dev/md0 /dev/urandom \      swap,cipher=aes-xts-plain64:sha256,size=512 >> /etc/crypttabecho /dev/mapper/swap none swap defaults 0 0 >> /etc/fstab

Optional: Disable log compression:

As /var/log is already compressed by ZFS, logrotate’s compression is going to burn CPU and disk I/O for (in most cases) very little gain. Also, if you are making snapshots of /var/log, logrotate’s compression will actually waste space, as the uncompressed data will live on in the snapshot. You can edit the files in /etc/logrotate.d by hand to comment out compress, or use this loop (copy-and-paste highly recommended):

for file in /etc/logrotate.d/* ; do    if grep -Eq "(^|[^#y])compress" "$file" ; then        sed -i -r "s/(^|[^#y])(compress)/\1#\2/" "$file"    fidone

Then proceed through your standard OS config routine – at the end of it all remember to take a snapshot of boot and root. ;) and turn off root login in SSHd config.

That's it. Good luck.