meta data for this page
  •  

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
linux:fs:zfs:tuning [2025/01/04 18:14] niziaklinux:fs:zfs:tuning [2025/08/23 08:45] (current) niziak
Line 1: Line 1:
 ====== ZFS performance tuning tips ====== ====== ZFS performance tuning tips ======
 +
 +Copy-paste snippet:
 +<code bash>
 +zfs set recordsize=1M hddpool
 +zfs set recordsize=1M nvmpool
 +zfs set compression=zstd hddpool
 +zfs set compression=zstd nvmpool
 +</code>
 +
 +===== zil limit =====
 +
 +ZFS parameter [[https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zil-slog-bulk|zil_slog_bulk]] is responsible to ''throttle'' LOG device load. In older ZFS valu was set to 768kB, currently it is 64MB. All sync write requests above this size will be treated as async requests and written directly to slower main device.
 +
 +<file ini /etc/modprobe.d/zfs.conf>
 +options zfs zil_slog_bulk=67108864
 +options zfs l2arc_write_max=67108864
 +</file>
 +
 +See similar for L2ARC: [[https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#l2arc-write-max|l2arc_write_max]]
 +
 +===== recordsize / volblocksize =====
 +
 +Size must be power of 2.
 +
 +  * ZFS file system: ''recordsize''  (default 128kB)
 +  * ZVOL block device: ''volblocksize''  (default - Solaris based - was 8kB). With OpenZFS 2.2 default was changed to 16k to "reduce wasted space".
 +
 +This is basic operation unit for ZFS. ZFS is COW filesystem. So to modify even one byte of data stored inside 128kB record it must read 128kB record, modify it and store 128kB in new place. It creates huge read and write amplification.
 +
 +Small sizes:
 +
 +  * are good for dedicated workloads, like databases, etc.
 +  * 4kB has no sense with compression. Even if data is compressed below 4kB it still occupies smallest possible unit - 4kB.
 +  * slower sequential read - lots of IOPS and checksum checks
 +  * fragmentation over time
 +  * metadata overhead (4kB data size needs also metada, so another 4kB unit will be used).
 +
 +Big size:
 +
 +  * benefit from compression. I.e: 128kB data compressed to 16kB will create 16kB record size.
 +  * very good for storage (write once data)
 +  * read/write amplification for small read/writes
 +  * good for sequential access
 +  * good for HDDs (less fragmentation = less seeks)
 +  * less metadata
 +  * less fragmentation
 +  * zvol: huge overhead if guest is using small block sizes - try to match guest FS block size to volblock - do not set 4kB volblock size !
 +
 +Note: ''recordsize''  / ''volblocksize''  only defines upper limit. Smaller data still can create smaller recordsize (is it true for block?).
 +
 +Examples:
 +
 +  * 16kB for MySQL/InnoDB
 +  * 128kB for rotational HDDs
 +
 +Check real usage by histogram:
 +
 +<code bash>
 +zpool iostat -r
 +
 +
 +</code>
 +
 +===== zvol for guest =====
 +
 +  * match volblock size to guest block size
 +  * do not use guest CoW filesystem on CoW (ZFS)
 +  * do not use qcow2 files on ZFS
 +  * use 2 zvols per guest FS - one for storage and second one for journal
  
 ===== Tune L2ARC for backups ===== ===== Tune L2ARC for backups =====
  
-When huge portion of data are written (new backups) or read (backup verify) L2ARC is constantly written with current data. +When huge portion of data are written (new backups) or read (backup verify) L2ARC is constantly written with current data. To change this behaviour to cache only ''Most Frequent Use'':
-To change this behaviour to cache only ''Most Frequent Use'':+
  
 <file conf /etc/modprobe.d/zfs.conf> <file conf /etc/modprobe.d/zfs.conf>
 options zfs l2arc_mfuonly=1 l2arc_noprefetch=0 options zfs l2arc_mfuonly=1 l2arc_noprefetch=0
 +
 +
 </file> </file>
  
 Explanation: Explanation:
 +
   * [[https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#l2arc-mfuonly|l2arc_mfuonly]] Controls whether only MFU metadata and data are cached from ARC into L2ARC. This may be desirable to avoid wasting space on L2ARC when reading/writing large amounts of data that are not expected to be accessed more than once. By default both MRU and MFU data and metadata are cached in the L2ARC.   * [[https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#l2arc-mfuonly|l2arc_mfuonly]] Controls whether only MFU metadata and data are cached from ARC into L2ARC. This may be desirable to avoid wasting space on L2ARC when reading/writing large amounts of data that are not expected to be accessed more than once. By default both MRU and MFU data and metadata are cached in the L2ARC.
   * [[https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#l2arc-noprefetch|l2arc_noprefetch]] Disables writing prefetched, but unused, buffers to cache devices. Setting to 0 can increase L2ARC hit rates for workloads where the ARC is too small for a read workload that benefits from prefetching. Also, if the main pool devices are **very slow**, setting to 0 can improve some workloads such as **backups**.   * [[https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#l2arc-noprefetch|l2arc_noprefetch]] Disables writing prefetched, but unused, buffers to cache devices. Setting to 0 can increase L2ARC hit rates for workloads where the ARC is too small for a read workload that benefits from prefetching. Also, if the main pool devices are **very slow**, setting to 0 can improve some workloads such as **backups**.
Line 16: Line 87:
 ===== I/O scheduler ===== ===== I/O scheduler =====
  
- +If whole device is managed by ZFS (not partition), ZFS sets scheduler to ''none''.
-If whole device is managed by ZFS (not partition), ZFS sets scheduler to ''none''+
  
 ==== official recommendation ==== ==== official recommendation ====
  
-For rotational devices, there is no sense to use advanced schedulers ''cfq'' or ''bfq'' directly on hard disc. +For rotational devices, there is no sense to use advanced schedulers ''cfq''  or ''bfq''  directly on hard disc. Both depends on processes, processes groups and application. In this case there is group of kernel processess for ZFS.
-Both depends on processes, processes groups and application. In this case there is group of kernel processess for ZFS.+
  
-Only possible scheduler to consider is ''deadline'' / ''mq-deadline'' +Only possible scheduler to consider is ''deadline''  / ''mq-deadline''. ''Deadline''  scheduler group reads into batches and writed into separate batches ordering by increasing LBA address (so it should be good for HDDs).
-''Deadline'' scheduler group reads into batches and writed into separate batches ordering by increasing LBA address (so it should be good for HDDs).+
  
 There is a discussion on OpenZFS project to do not touch schedulers anymore and let it to be configured by admin: There is a discussion on OpenZFS project to do not touch schedulers anymore and let it to be configured by admin:
 +
   * [[https://github.com/openzfs/zfs/pull/9042|Set "none" scheduler if available (initramfs) #9042]]   * [[https://github.com/openzfs/zfs/pull/9042|Set "none" scheduler if available (initramfs) #9042]]
-  * [[https://github.com/openzfs/zfs/commit/42c24d90d112b6e9e1a304346a1335e058f1678b]]+  * [[https://github.com/openzfs/zfs/commit/42c24d90d112b6e9e1a304346a1335e058f1678b|https://github.com/openzfs/zfs/commit/42c24d90d112b6e9e1a304346a1335e058f1678b]]
  
 ==== my findings ==== ==== my findings ====
  
-There is huge benefit to use ''bfq'' on rotational HDD. +There is huge benefit to use ''bfq''  on rotational HDD. No more huge lags during KVM backups. 
-No more huge lags during KVM backups.+ 
 +''bfq''  honor ''ionice''  and:
  
-''bfq'' honor ''ionice'' and: +  * kernel ''zvol''  processes have prio ''be/0''
-  * kernel ''zvol'' processes have prio ''be/0''+
   * kvm processes have prio ''be/4''   * kvm processes have prio ''be/4''
-  * kvm process during vzdump have ''be/7'' - NOTE: only with patched version of kvm: ''pve-qemu-kvm'' >= ''8.1.5-6''.+  * kvm process during vzdump have ''be/7''  - NOTE: only with patched version of kvm: ''pve-qemu-kvm''>= ''8.1.5-6''.
  
 ===== HDD ===== ===== HDD =====
Line 46: Line 115:
  
 <code bash> <code bash>
-cat /sys/module/zfs/parameters/zfs_vdev_async_write_max_active  
-echo 2 > /sys/module/zfs/parameters/zfs_vdev_async_write_max_active  
 cat /sys/module/zfs/parameters/zfs_vdev_async_write_max_active cat /sys/module/zfs/parameters/zfs_vdev_async_write_max_active
 +echo 2 > /sys/module/zfs/parameters/zfs_vdev_async_write_max_active
 +cat /sys/module/zfs/parameters/zfs_vdev_async_write_max_active
 +
 +
 </code> </code>
  
 Use huge record size - it can help on SMR drives. Note: it only make sense for ZFS file system. Cannot be applied on ZVOL. Use huge record size - it can help on SMR drives. Note: it only make sense for ZFS file system. Cannot be applied on ZVOL.
 +
 <code bash> <code bash>
 zfs set recordsize=1M hddpool/data zfs set recordsize=1M hddpool/data
 zfs set recordsize=1M hddpool/vz zfs set recordsize=1M hddpool/vz
 +
 +
 </code> </code>
  
 NOTE: SMR drives behaves correctly for sequential writes, but long working ZFS or LVM thin spread writes into lots of random location causing unusable IOPS. So never use SMR. NOTE: SMR drives behaves correctly for sequential writes, but long working ZFS or LVM thin spread writes into lots of random location causing unusable IOPS. So never use SMR.
- 
  
 For ZVOLs: [[https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html#zvol-volblocksize|zvol volblocksize]] For ZVOLs: [[https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html#zvol-volblocksize|zvol volblocksize]]
Line 64: Line 137:
 [[https://blog.guillaumematheron.fr/2023/500/change-zfs-volblock-on-a-running-proxmox-vm/|Change ZFS volblock on a running Proxmox VM]] [[https://blog.guillaumematheron.fr/2023/500/change-zfs-volblock-on-a-running-proxmox-vm/|Change ZFS volblock on a running Proxmox VM]]
  
-Note: is no stripping is used (simple mirror) volblocksize should be 4kB (or at least the same as ashift)+**Note:**  When no stripping is used (simple mirror) volblocksize should be 4kB (or at least the same as ashift)
 + 
 +**Note:**  Latest Proxmox default volblock size was increased form 8k to 16k. When 8k is used warning is shown: 
 +<code> 
 + 
 +Warning: volblocksize (8192) is less than the default minimum block size (16384). 
 +To reduce wasted space a volblocksize of 16384 is recommended. 
 + 
 +</code>
  
 <code bash> <code bash>
Line 71: Line 152:
 zfs rename hddpool/data/vm-156-disk-0 hddpool/data/vm-156-disk-0-backup zfs rename hddpool/data/vm-156-disk-0 hddpool/data/vm-156-disk-0-backup
 zfs rename hddpool/data/vm-156-disk-0-16k hddpool/data/vm-156-disk-0 zfs rename hddpool/data/vm-156-disk-0-16k hddpool/data/vm-156-disk-0
-</code> 
  
  
 +</code>
  
-Use ''bfq'' is mandator. See [[#my findings]]. +Use ''bfq''  is mandator. See [[#my_findings|my findings]].
- +
  
 ===== Postgresql ===== ===== Postgresql =====
Line 88: Line 167:
 # ONLY for SSD/NVM devices: # ONLY for SSD/NVM devices:
 zfs set logbias=throughput <pool>/postgres zfs set logbias=throughput <pool>/postgres
 +
 +
 </code> </code>
  
Line 93: Line 174:
  
 By default ZFS can sue 50% of RAM for ARC cache: By default ZFS can sue 50% of RAM for ARC cache:
 +
 <code bash> <code bash>
 # apt install zfsutils-linux # apt install zfsutils-linux
  
-# arcstat +# arcstat
     time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size      avail     time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size      avail
 16:47:26              0        0        0        0   15G   15G   1.8G 16:47:26              0        0        0        0   15G   15G   1.8G
 +
 +
 </code> </code>
  
 <code bash> <code bash>
-# arc_summary +# arc_summary
  
 ARC size (current):                                    98.9 %   15.5 GiB ARC size (current):                                    98.9 %   15.5 GiB
Line 114: Line 198:
         Dnode cache size (hard limit):                 10.0 %    1.2 GiB         Dnode cache size (hard limit):                 10.0 %    1.2 GiB
         Dnode cache size (current):                     5.3 %   63.7 MiB         Dnode cache size (current):                     5.3 %   63.7 MiB
 +
 +
 </code> </code>
  
-ARC size can be tuned by settings ''zfs'' kernel module parameters ([[https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-arc-max|Module Parameters]]):+ARC size can be tuned by settings ''zfs''  kernel module parameters ([[https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-arc-max|Module Parameters]]): 
   * ''zfs_arc_max'': Maximum size of ARC in bytes. If set to 0 then the maximum size of ARC is determined by the amount of system memory installed (50% on Linux)   * ''zfs_arc_max'': Maximum size of ARC in bytes. If set to 0 then the maximum size of ARC is determined by the amount of system memory installed (50% on Linux)
-  * ''zfs_arc_min'': Minimum ARC size limit. When the ARC is asked to shrink, it will stop shrinking at ''c_min'' as tuned by ''zfs_arc_min''.+  * ''zfs_arc_min'': Minimum ARC size limit. When the ARC is asked to shrink, it will stop shrinking at ''c_min''  as tuned by ''zfs_arc_min''.
   * ''zfs_arc_meta_limit_percent'': Sets the limit to ARC metadata, arc_meta_limit, as a percentage of the maximum size target of the ARC, ''c_max''. Default is 75.   * ''zfs_arc_meta_limit_percent'': Sets the limit to ARC metadata, arc_meta_limit, as a percentage of the maximum size target of the ARC, ''c_max''. Default is 75.
- 
 Proxmox recommends following [[https://pve.proxmox.com/wiki/ZFS_on_Linux#sysadmin_zfs_limit_memory_usage|rule]]: Proxmox recommends following [[https://pve.proxmox.com/wiki/ZFS_on_Linux#sysadmin_zfs_limit_memory_usage|rule]]:
 +<code>
  
-  As a general rule of thumb, allocate at least 2 GiB Base + 1 GiB/TiB-Storage+As a general rule of thumb, allocate at least 2 GiB Base + 1 GiB/TiB-Storage 
 + 
 +</code>
  
 ==== Examples ==== ==== Examples ====
-   + 
-Set ''zfs_arc_max'' to 4GB and ''zfs_arc_min'' to 128MB:+Set ''zfs_arc_max''  to 4GB and ''zfs_arc_min''  to 128MB: 
 <code bash> <code bash>
 echo "$[4 * 1024*1024*1024]" >/sys/module/zfs/parameters/zfs_arc_max echo "$[4 * 1024*1024*1024]" >/sys/module/zfs/parameters/zfs_arc_max
 echo "$[128     *1024*1024]" >/sys/module/zfs/parameters/zfs_arc_min echo "$[128     *1024*1024]" >/sys/module/zfs/parameters/zfs_arc_min
 +
 +
 </code> </code>
  
 Make options persistent: Make options persistent:
-<file /etc/modprobe.d/zfs.conf>+ 
 +<code etcmodprobedzfsconf>
 options zfs zfs_prefetch_disable=1 options zfs zfs_prefetch_disable=1
 options zfs zfs_arc_max=4294967296 options zfs zfs_arc_max=4294967296
 options zfs zfs_arc_min=134217728 options zfs zfs_arc_min=134217728
 options zfs zfs_arc_meta_limit_percent=75 options zfs zfs_arc_meta_limit_percent=75
-</file>+ 
 + 
 +</code>
  
 and ''update-initramfs -u'' and ''update-initramfs -u''