meta data for this page
  •  

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
linux:fs:zfs:tuning [2025/08/23 08:45] niziaklinux:fs:zfs:tuning [2026/04/14 21:43] (current) niziak
Line 3: Line 3:
 Copy-paste snippet: Copy-paste snippet:
 <code bash> <code bash>
-zfs set recordsize=1M hddpool+zfs set recordsize=1M rpool 
 +zfs set recordsize=16M hddpool
 zfs set recordsize=1M nvmpool zfs set recordsize=1M nvmpool
 +zfs set compression=zstd rpool
 zfs set compression=zstd hddpool zfs set compression=zstd hddpool
 zfs set compression=zstd nvmpool zfs set compression=zstd nvmpool
 </code> </code>
 +
 +**Note:** ''zstd'' means ''zstd-3''. It is still CPU hungry compression, and it is visible on ''top'' monitoring. For high workloads like build nodes use ''lz4''
 +See more in [[linux:fs:zfs:compression]]
 +
 +
 +===== stripe size =====
 +
 +ZFS use dynamic stripe size. One strip is one write transaction (limited by recordsize).
 +So zfs dataset recordsize needs tunning to given type of workload.
 +
 +For example: on pool composed as  3 x 2 HDD mirror:
 +
 +<code bash>fio --name=rand-4k --ioengine=libaio --rw=randrw --rwmixread=70 --bs=1m --direct=1 --size=1G --numjobs=6 --iodepth=16 --runtime=60 --time_based --filename=fio_testfile --group_reporting</code>
 +
 +  * zfs dataset with recordsize 128k:
 +    * BS=4k jobs=1 IOPS RW 214/91
 +    * BS=4k jobs=6 IOPS RW 2107/909
 +    * BS=16k jobs=1 IOPS RW 137/59
 +    * BS=16k jobs=6 IOPS RW 1277/549
 +    * BS=128k jobs=1 IOPS RW 190/82
 +    * BS=128k jobs=6 IOPS RW 549/239
 +    * BS=1m jobs=1 IOPS RW 48/21
 +    * BS=1m jobs=6 IOPS RW 164/71
 +    * BS=16m jobs=1 IOPS RW 9/4
 +    * BS=16m jobs=6 IOPS RW 17/7
 +  * zfs dataset with recordsize 1M:
 +    * BS=4k jobs=6 IOPS RW 21,7/9k - aggregated
 +    * BS=128k jobs=6 IOPS RW 1125/484
 +    * BS=1m jobs=6 IOPS RW 232/101
 +    * BS=16m jobs=6 IOPS RW
 +  * zfs dataset with recordsize 16M:
 +    * BS=4k jobs=1 IOPS RW 38/16
 +    * BS=4k jobs=6 IOPS RW 156k/67k
 +    * BS=16k jobs=1 IOPS RW 31/14
 +    * BS=16k jobs=6 IOPS RW 122k/52k
 +    * BS=128k jobs=1 IOPS RW 20/9
 +    * BS=128k jobs=6 IOPS RW 17.7K/7607 - small iops are aggregated into 16M
 +    * BS=1m jobs=1 IOPS RW 30/13
 +    * BS=1m jobs=6 IOPS RW 2586/1117
 +    * BS=16m jobs=1 IOPS RW 5/2
 +    * BS=16m jobs=6 IOPS RW 20/8
 +
 +For example: on pool composed as 6x HDD raidz2:
 +  * zfs dataset with recordsize 16M:
 +    * BS=128k jobs=6 IOPS RW 16.4k/7026
 +    * BS=1m jobs=6 IOPS RW 2472/1068
 +    * BS=16m jobs=6 IOPS RW 27/11
  
 ===== zil limit ===== ===== zil limit =====
Line 85: Line 134:
   * [[https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#l2arc-noprefetch|l2arc_noprefetch]] Disables writing prefetched, but unused, buffers to cache devices. Setting to 0 can increase L2ARC hit rates for workloads where the ARC is too small for a read workload that benefits from prefetching. Also, if the main pool devices are **very slow**, setting to 0 can improve some workloads such as **backups**.   * [[https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#l2arc-noprefetch|l2arc_noprefetch]] Disables writing prefetched, but unused, buffers to cache devices. Setting to 0 can increase L2ARC hit rates for workloads where the ARC is too small for a read workload that benefits from prefetching. Also, if the main pool devices are **very slow**, setting to 0 can improve some workloads such as **backups**.
  
-===== I/O scheduler ===== 
  
-If whole device is managed by ZFS (not partition), ZFS sets scheduler to ''none''. 
  
-==== official recommendation ==== 
  
-For rotational devices, there is no sense to use advanced schedulers ''cfq''  or ''bfq''  directly on hard disc. Both depends on processes, processes groups and application. In this case there is group of kernel processess for ZFS. 
- 
-Only possible scheduler to consider is ''deadline''  / ''mq-deadline''. ''Deadline''  scheduler group reads into batches and writed into separate batches ordering by increasing LBA address (so it should be good for HDDs). 
- 
-There is a discussion on OpenZFS project to do not touch schedulers anymore and let it to be configured by admin: 
- 
-  * [[https://github.com/openzfs/zfs/pull/9042|Set "none" scheduler if available (initramfs) #9042]] 
-  * [[https://github.com/openzfs/zfs/commit/42c24d90d112b6e9e1a304346a1335e058f1678b|https://github.com/openzfs/zfs/commit/42c24d90d112b6e9e1a304346a1335e058f1678b]] 
- 
-==== my findings ==== 
- 
-There is huge benefit to use ''bfq''  on rotational HDD. No more huge lags during KVM backups. 
- 
-''bfq''  honor ''ionice''  and: 
- 
-  * kernel ''zvol''  processes have prio ''be/0'' 
-  * kvm processes have prio ''be/4'' 
-  * kvm process during vzdump have ''be/7''  - NOTE: only with patched version of kvm: ''pve-qemu-kvm''>= ''8.1.5-6''. 
- 
-===== HDD ===== 
- 
-[[https://github.com/openzfs/zfs/discussions/14916|ZFS Send & RaidZ - Poor performance on HDD #14916]] 
- 
-<code bash> 
-cat /sys/module/zfs/parameters/zfs_vdev_async_write_max_active 
-echo 2 > /sys/module/zfs/parameters/zfs_vdev_async_write_max_active 
-cat /sys/module/zfs/parameters/zfs_vdev_async_write_max_active 
- 
- 
-</code> 
- 
-Use huge record size - it can help on SMR drives. Note: it only make sense for ZFS file system. Cannot be applied on ZVOL. 
- 
-<code bash> 
-zfs set recordsize=1M hddpool/data 
-zfs set recordsize=1M hddpool/vz 
- 
- 
-</code> 
- 
-NOTE: SMR drives behaves correctly for sequential writes, but long working ZFS or LVM thin spread writes into lots of random location causing unusable IOPS. So never use SMR. 
- 
-For ZVOLs: [[https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html#zvol-volblocksize|zvol volblocksize]] 
- 
-[[https://blog.guillaumematheron.fr/2023/500/change-zfs-volblock-on-a-running-proxmox-vm/|Change ZFS volblock on a running Proxmox VM]] 
- 
-**Note:**  When no stripping is used (simple mirror) volblocksize should be 4kB (or at least the same as ashift). 
- 
-**Note:**  Latest Proxmox default volblock size was increased form 8k to 16k. When 8k is used warning is shown: 
-<code> 
- 
-Warning: volblocksize (8192) is less than the default minimum block size (16384). 
-To reduce wasted space a volblocksize of 16384 is recommended. 
- 
-</code> 
- 
-<code bash> 
-zfs create -s -V 40G hddpool/data/vm-156-disk-0-16k -o volblock=16k 
-dd if=/dev/zvol/hddpool/data/vm-156-disk-0 of=/dev/zvol/hddpool/data/vm-156-disk-0-16k bs=1M status=progress conv=sparse 
-zfs rename hddpool/data/vm-156-disk-0 hddpool/data/vm-156-disk-0-backup 
-zfs rename hddpool/data/vm-156-disk-0-16k hddpool/data/vm-156-disk-0 
- 
- 
-</code> 
- 
-Use ''bfq''  is mandator. See [[#my_findings|my findings]]. 
  
 ===== Postgresql ===== ===== Postgresql =====
Line 178: Line 158:
 # apt install zfsutils-linux # apt install zfsutils-linux
  
-arcstat+zarcstat
     time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size      avail     time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size      avail
 16:47:26              0        0        0        0   15G   15G   1.8G 16:47:26              0        0        0        0   15G   15G   1.8G
- 
- 
 </code> </code>
  
 <code bash> <code bash>
-arc_summary+zarcsummary -s arc
  
 ARC size (current):                                    98.9 %   15.5 GiB ARC size (current):                                    98.9 %   15.5 GiB
Line 198: Line 176:
         Dnode cache size (hard limit):                 10.0 %    1.2 GiB         Dnode cache size (hard limit):                 10.0 %    1.2 GiB
         Dnode cache size (current):                     5.3 %   63.7 MiB         Dnode cache size (current):                     5.3 %   63.7 MiB
- 
- 
 </code> </code>
  
Line 206: Line 182:
   * ''zfs_arc_max'': Maximum size of ARC in bytes. If set to 0 then the maximum size of ARC is determined by the amount of system memory installed (50% on Linux)   * ''zfs_arc_max'': Maximum size of ARC in bytes. If set to 0 then the maximum size of ARC is determined by the amount of system memory installed (50% on Linux)
   * ''zfs_arc_min'': Minimum ARC size limit. When the ARC is asked to shrink, it will stop shrinking at ''c_min''  as tuned by ''zfs_arc_min''.   * ''zfs_arc_min'': Minimum ARC size limit. When the ARC is asked to shrink, it will stop shrinking at ''c_min''  as tuned by ''zfs_arc_min''.
-  * ''zfs_arc_meta_limit_percent'': Sets the limit to ARC metadata, arc_meta_limit, as a percentage of the maximum size target of the ARC, ''c_max''Default is 75.+  * ''zfs_arc_meta_balance'': Balance between metadata and data on ghost hits. Values above 100 increase metadata caching by proportionally reducing effect of ghost data hits on target data/metadata rate. [[https://openzfs.github.io/openzfs-docs/man/master/4/zfs.4.html#zfs_arc_meta_balance]] 
 Proxmox recommends following [[https://pve.proxmox.com/wiki/ZFS_on_Linux#sysadmin_zfs_limit_memory_usage|rule]]: Proxmox recommends following [[https://pve.proxmox.com/wiki/ZFS_on_Linux#sysadmin_zfs_limit_memory_usage|rule]]:
 <code> <code>
Line 221: Line 198:
 echo "$[4 * 1024*1024*1024]" >/sys/module/zfs/parameters/zfs_arc_max echo "$[4 * 1024*1024*1024]" >/sys/module/zfs/parameters/zfs_arc_max
 echo "$[128     *1024*1024]" >/sys/module/zfs/parameters/zfs_arc_min echo "$[128     *1024*1024]" >/sys/module/zfs/parameters/zfs_arc_min
- 
- 
 </code> </code>
  
Line 232: Line 207:
 options zfs zfs_arc_min=134217728 options zfs zfs_arc_min=134217728
 options zfs zfs_arc_meta_limit_percent=75 options zfs zfs_arc_meta_limit_percent=75
- 
- 
 </code> </code>