Table of Contents

ZFS performance tuning tips

recordsize / volblocksize

Size must be power of 2.

This is basic operation unit for ZFS. ZFS is COW filesystem. So to modify even one byte of data stored inside 128kB record it must read 128kB record, modify it and store 128kB in new place. It creates huge read and write amplification.

Small sizes:

Big size:

Note: recordsize / volblocksize only defines upper limit. Smaller data still can create smaller recordsize (is it true for block?).

Examples:

Tune L2ARC for backups

When huge portion of data are written (new backups) or read (backup verify) L2ARC is constantly written with current data. To change this behaviour to cache only Most Frequent Use:

/etc/modprobe.d/zfs.conf
options zfs l2arc_mfuonly=1 l2arc_noprefetch=0

Explanation:

I/O scheduler

If whole device is managed by ZFS (not partition), ZFS sets scheduler to none.

official recommendation

For rotational devices, there is no sense to use advanced schedulers cfq or bfq directly on hard disc. Both depends on processes, processes groups and application. In this case there is group of kernel processess for ZFS.

Only possible scheduler to consider is deadline / mq-deadline. Deadline scheduler group reads into batches and writed into separate batches ordering by increasing LBA address (so it should be good for HDDs).

There is a discussion on OpenZFS project to do not touch schedulers anymore and let it to be configured by admin:

my findings

There is huge benefit to use bfq on rotational HDD. No more huge lags during KVM backups.

bfq honor ionice and:

HDD

ZFS Send & RaidZ - Poor performance on HDD #14916

cat /sys/module/zfs/parameters/zfs_vdev_async_write_max_active 
echo 2 > /sys/module/zfs/parameters/zfs_vdev_async_write_max_active 
cat /sys/module/zfs/parameters/zfs_vdev_async_write_max_active

Use huge record size - it can help on SMR drives. Note: it only make sense for ZFS file system. Cannot be applied on ZVOL.

zfs set recordsize=1M hddpool/data
zfs set recordsize=1M hddpool/vz

NOTE: SMR drives behaves correctly for sequential writes, but long working ZFS or LVM thin spread writes into lots of random location causing unusable IOPS. So never use SMR.

For ZVOLs: zvol volblocksize

Change ZFS volblock on a running Proxmox VM

Note: When no stripping is used (simple mirror) volblocksize should be 4kB (or at least the same as ashift).

Note: Latest Proxmox default volblock size was increased form 8k to 16k. When 8k is used warning is shown:

Warning: volblocksize (8192) is less than the default minimum block size (16384).
To reduce wasted space a volblocksize of 16384 is recommended.
zfs create -s -V 40G hddpool/data/vm-156-disk-0-16k -o volblock=16k
dd if=/dev/zvol/hddpool/data/vm-156-disk-0 of=/dev/zvol/hddpool/data/vm-156-disk-0-16k bs=1M status=progress conv=sparse
zfs rename hddpool/data/vm-156-disk-0 hddpool/data/vm-156-disk-0-backup
zfs rename hddpool/data/vm-156-disk-0-16k hddpool/data/vm-156-disk-0

Use bfq is mandator. See my findings.

Postgresql

See Archlinux wiki: Databases

zfs set recordsize=8K <pool>/postgres
 
# ONLY for SSD/NVM devices:
zfs set logbias=throughput <pool>/postgres

reduce ZFS ARC RAM usage

By default ZFS can sue 50% of RAM for ARC cache:

# apt install zfsutils-linux
 
# arcstat 
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail
16:47:26     3     0      0     0    0     0    0     0    0   15G   15G   1.8G
# arc_summary 
 
ARC size (current):                                    98.9 %   15.5 GiB
        Target size (adaptive):                       100.0 %   15.6 GiB
        Min size (hard limit):                          6.2 %  999.6 MiB
        Max size (high water):                           16:1   15.6 GiB
        Most Frequently Used (MFU) cache size:         75.5 %   11.2 GiB
        Most Recently Used (MRU) cache size:           24.5 %    3.6 GiB
        Metadata cache size (hard limit):              75.0 %   11.7 GiB
        Metadata cache size (current):                  8.9 %    1.0 GiB
        Dnode cache size (hard limit):                 10.0 %    1.2 GiB
        Dnode cache size (current):                     5.3 %   63.7 MiB

ARC size can be tuned by settings zfs kernel module parameters (Module Parameters):

Proxmox recommends following rule:

As a general rule of thumb, allocate at least 2 GiB Base + 1 GiB/TiB-Storage

Examples

Set zfs_arc_max to 4GB and zfs_arc_min to 128MB:

echo "$[4 * 1024*1024*1024]" >/sys/module/zfs/parameters/zfs_arc_max
echo "$[128     *1024*1024]" >/sys/module/zfs/parameters/zfs_arc_min

Make options persistent:

options zfs zfs_prefetch_disable=1
options zfs zfs_arc_max=4294967296
options zfs zfs_arc_min=134217728
options zfs zfs_arc_meta_limit_percent=75

and update-initramfs -u