meta data for this page
ZFS performance tuning tips
recordsize / volblocksize
Size must be power of 2.
- ZFS file system:
recordsize
(default 128kB) - ZVOL block device:
volblocksize
(default - Solaris based - was 8kB). With OpenZFS 2.2 default was changed to 16k to “reduce wasted space”.
This is basic operation unit for ZFS. ZFS is COW filesystem. So to modify even one byte of data stored inside 128kB record it must read 128kB record, modify it and store 128kB in new place. It creates huge read and write amplification.
Small sizes:
- are good for dedicated workloads, like databases, etc.
- 4kB has no sense with compression. Even if data is compressed it still takes smaller unit - 4kB.
- slower sequential read - lots of IOPS and checksum checks
- fragmentation over time
- metadata overhead (4kB data size needs also metada, so another 4kB unit will be used).
Big size:
- benefit from compression. I.e: 128kB data compressed to 16kB will create 16kB record size.
- very good for storage (write once data)
- read/write amplification for small read/writes
- good for sequential access
- good for HDDs (less fragmentation = less seeks)
- less metadata
- less fragmentation
Note: recordsize
/ volblocksize
only defines upper limit. Smaller data still can create smaller recordsize (is it true for block?).
Examples:
- 16kB for MySQL/InnoDB
- 128kB for rotational HDDs
Tune L2ARC for backups
When huge portion of data are written (new backups) or read (backup verify) L2ARC is constantly written with current data.
To change this behaviour to cache only Most Frequent Use
:
- /etc/modprobe.d/zfs.conf
options zfs l2arc_mfuonly=1 l2arc_noprefetch=0
Explanation:
- l2arc_mfuonly Controls whether only MFU metadata and data are cached from ARC into L2ARC. This may be desirable to avoid wasting space on L2ARC when reading/writing large amounts of data that are not expected to be accessed more than once. By default both MRU and MFU data and metadata are cached in the L2ARC.
- l2arc_noprefetch Disables writing prefetched, but unused, buffers to cache devices. Setting to 0 can increase L2ARC hit rates for workloads where the ARC is too small for a read workload that benefits from prefetching. Also, if the main pool devices are very slow, setting to 0 can improve some workloads such as backups.
I/O scheduler
If whole device is managed by ZFS (not partition), ZFS sets scheduler to none
.
official recommendation
For rotational devices, there is no sense to use advanced schedulers cfq
or bfq
directly on hard disc.
Both depends on processes, processes groups and application. In this case there is group of kernel processess for ZFS.
Only possible scheduler to consider is deadline
/ mq-deadline
.
Deadline
scheduler group reads into batches and writed into separate batches ordering by increasing LBA address (so it should be good for HDDs).
There is a discussion on OpenZFS project to do not touch schedulers anymore and let it to be configured by admin:
my findings
There is huge benefit to use bfq
on rotational HDD.
No more huge lags during KVM backups.
bfq
honor ionice
and:
- kernel
zvol
processes have priobe/0
- kvm processes have prio
be/4
- kvm process during vzdump have
be/7
- NOTE: only with patched version of kvm:pve-qemu-kvm
>=8.1.5-6
.
HDD
ZFS Send & RaidZ - Poor performance on HDD #14916
cat /sys/module/zfs/parameters/zfs_vdev_async_write_max_active echo 2 > /sys/module/zfs/parameters/zfs_vdev_async_write_max_active cat /sys/module/zfs/parameters/zfs_vdev_async_write_max_active
Use huge record size - it can help on SMR drives. Note: it only make sense for ZFS file system. Cannot be applied on ZVOL.
zfs set recordsize=1M hddpool/data zfs set recordsize=1M hddpool/vz
NOTE: SMR drives behaves correctly for sequential writes, but long working ZFS or LVM thin spread writes into lots of random location causing unusable IOPS. So never use SMR.
For ZVOLs: zvol volblocksize
Change ZFS volblock on a running Proxmox VM
Note: When no stripping is used (simple mirror) volblocksize should be 4kB (or at least the same as ashift).
Note: Latest Proxmox default volblock size was increased form 8k to 16k. When 8k is used warning is shown:
Warning: volblocksize (8192) is less than the default minimum block size (16384). To reduce wasted space a volblocksize of 16384 is recommended.
zfs create -s -V 40G hddpool/data/vm-156-disk-0-16k -o volblock=16k dd if=/dev/zvol/hddpool/data/vm-156-disk-0 of=/dev/zvol/hddpool/data/vm-156-disk-0-16k bs=1M status=progress conv=sparse zfs rename hddpool/data/vm-156-disk-0 hddpool/data/vm-156-disk-0-backup zfs rename hddpool/data/vm-156-disk-0-16k hddpool/data/vm-156-disk-0
Use bfq
is mandator. See my findings.
Postgresql
See Archlinux wiki: Databases
zfs set recordsize=8K <pool>/postgres # ONLY for SSD/NVM devices: zfs set logbias=throughput <pool>/postgres
reduce ZFS ARC RAM usage
By default ZFS can sue 50% of RAM for ARC cache:
# apt install zfsutils-linux # arcstat time read miss miss% dmis dm% pmis pm% mmis mm% size c avail 16:47:26 3 0 0 0 0 0 0 0 0 15G 15G 1.8G
# arc_summary ARC size (current): 98.9 % 15.5 GiB Target size (adaptive): 100.0 % 15.6 GiB Min size (hard limit): 6.2 % 999.6 MiB Max size (high water): 16:1 15.6 GiB Most Frequently Used (MFU) cache size: 75.5 % 11.2 GiB Most Recently Used (MRU) cache size: 24.5 % 3.6 GiB Metadata cache size (hard limit): 75.0 % 11.7 GiB Metadata cache size (current): 8.9 % 1.0 GiB Dnode cache size (hard limit): 10.0 % 1.2 GiB Dnode cache size (current): 5.3 % 63.7 MiB
ARC size can be tuned by settings zfs
kernel module parameters (Module Parameters):
zfs_arc_max
: Maximum size of ARC in bytes. If set to 0 then the maximum size of ARC is determined by the amount of system memory installed (50% on Linux)zfs_arc_min
: Minimum ARC size limit. When the ARC is asked to shrink, it will stop shrinking atc_min
as tuned byzfs_arc_min
.zfs_arc_meta_limit_percent
: Sets the limit to ARC metadata, arc_meta_limit, as a percentage of the maximum size target of the ARC,c_max
. Default is 75.
Proxmox recommends following rule:
As a general rule of thumb, allocate at least 2 GiB Base + 1 GiB/TiB-Storage
Examples
Set zfs_arc_max
to 4GB and zfs_arc_min
to 128MB:
echo "$[4 * 1024*1024*1024]" >/sys/module/zfs/parameters/zfs_arc_max echo "$[128 *1024*1024]" >/sys/module/zfs/parameters/zfs_arc_min
Make options persistent:
options zfs zfs_prefetch_disable=1 options zfs zfs_arc_max=4294967296 options zfs zfs_arc_min=134217728 options zfs zfs_arc_meta_limit_percent=75
and update-initramfs -u