meta data for this page
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| linux:fs:zfs:tuning [2025/03/04 16:08] – [zvol for guest] niziak | linux:fs:zfs:tuning [2026/04/14 21:43] (current) – niziak | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== ZFS performance tuning tips ====== | ====== ZFS performance tuning tips ====== | ||
| + | |||
| + | Copy-paste snippet: | ||
| + | <code bash> | ||
| + | zfs set recordsize=1M rpool | ||
| + | zfs set recordsize=16M hddpool | ||
| + | zfs set recordsize=1M nvmpool | ||
| + | zfs set compression=zstd rpool | ||
| + | zfs set compression=zstd hddpool | ||
| + | zfs set compression=zstd nvmpool | ||
| + | </ | ||
| + | |||
| + | **Note:** '' | ||
| + | See more in [[linux: | ||
| + | |||
| + | |||
| + | ===== stripe size ===== | ||
| + | |||
| + | ZFS use dynamic stripe size. One strip is one write transaction (limited by recordsize). | ||
| + | So zfs dataset recordsize needs tunning to given type of workload. | ||
| + | |||
| + | For example: on pool composed as 3 x 2 HDD mirror: | ||
| + | |||
| + | <code bash>fio --name=rand-4k --ioengine=libaio --rw=randrw --rwmixread=70 --bs=1m --direct=1 --size=1G --numjobs=6 --iodepth=16 --runtime=60 --time_based --filename=fio_testfile --group_reporting</ | ||
| + | |||
| + | * zfs dataset with recordsize 128k: | ||
| + | * BS=4k jobs=1 IOPS RW 214/91 | ||
| + | * BS=4k jobs=6 IOPS RW 2107/909 | ||
| + | * BS=16k jobs=1 IOPS RW 137/59 | ||
| + | * BS=16k jobs=6 IOPS RW 1277/549 | ||
| + | * BS=128k jobs=1 IOPS RW 190/82 | ||
| + | * BS=128k jobs=6 IOPS RW 549/239 | ||
| + | * BS=1m jobs=1 IOPS RW 48/21 | ||
| + | * BS=1m jobs=6 IOPS RW 164/71 | ||
| + | * BS=16m jobs=1 IOPS RW 9/4 | ||
| + | * BS=16m jobs=6 IOPS RW 17/7 | ||
| + | * zfs dataset with recordsize 1M: | ||
| + | * BS=4k jobs=6 IOPS RW 21,7/9k - aggregated | ||
| + | * BS=128k jobs=6 IOPS RW 1125/484 | ||
| + | * BS=1m jobs=6 IOPS RW 232/101 | ||
| + | * BS=16m jobs=6 IOPS RW | ||
| + | * zfs dataset with recordsize 16M: | ||
| + | * BS=4k jobs=1 IOPS RW 38/16 | ||
| + | * BS=4k jobs=6 IOPS RW 156k/67k | ||
| + | * BS=16k jobs=1 IOPS RW 31/14 | ||
| + | * BS=16k jobs=6 IOPS RW 122k/52k | ||
| + | * BS=128k jobs=1 IOPS RW 20/9 | ||
| + | * BS=128k jobs=6 IOPS RW 17.7K/7607 - small iops are aggregated into 16M | ||
| + | * BS=1m jobs=1 IOPS RW 30/13 | ||
| + | * BS=1m jobs=6 IOPS RW 2586/1117 | ||
| + | * BS=16m jobs=1 IOPS RW 5/2 | ||
| + | * BS=16m jobs=6 IOPS RW 20/8 | ||
| + | |||
| + | For example: on pool composed as 6x HDD raidz2: | ||
| + | * zfs dataset with recordsize 16M: | ||
| + | * BS=128k jobs=6 IOPS RW 16.4k/7026 | ||
| + | * BS=1m jobs=6 IOPS RW 2472/1068 | ||
| + | * BS=16m jobs=6 IOPS RW 27/11 | ||
| ===== zil limit ===== | ===== zil limit ===== | ||
| ZFS parameter [[https:// | ZFS parameter [[https:// | ||
| + | |||
| + | <file ini / | ||
| + | options zfs zil_slog_bulk=67108864 | ||
| + | options zfs l2arc_write_max=67108864 | ||
| + | </ | ||
| See similar for L2ARC: [[https:// | See similar for L2ARC: [[https:// | ||
| Line 33: | Line 95: | ||
| * less metadata | * less metadata | ||
| * less fragmentation | * less fragmentation | ||
| - | * zvol: huge overhead if guest is using small block sizes | + | * zvol: huge overhead if guest is using small block sizes - try to match guest FS block size to volblock - do not set 4kB volblock size ! |
| Note: '' | Note: '' | ||
| Line 49: | Line 111: | ||
| </ | </ | ||
| + | |||
| ===== zvol for guest ===== | ===== zvol for guest ===== | ||
| Line 55: | Line 118: | ||
| * do not use qcow2 files on ZFS | * do not use qcow2 files on ZFS | ||
| * use 2 zvols per guest FS - one for storage and second one for journal | * use 2 zvols per guest FS - one for storage and second one for journal | ||
| - | |||
| ===== Tune L2ARC for backups ===== | ===== Tune L2ARC for backups ===== | ||
| Line 72: | Line 134: | ||
| * [[https:// | * [[https:// | ||
| - | ===== I/O scheduler ===== | ||
| - | If whole device is managed by ZFS (not partition), ZFS sets scheduler to '' | ||
| - | ==== official recommendation ==== | ||
| - | For rotational devices, there is no sense to use advanced schedulers '' | ||
| - | |||
| - | Only possible scheduler to consider is '' | ||
| - | |||
| - | There is a discussion on OpenZFS project to do not touch schedulers anymore and let it to be configured by admin: | ||
| - | |||
| - | * [[https:// | ||
| - | * [[https:// | ||
| - | |||
| - | ==== my findings ==== | ||
| - | |||
| - | There is huge benefit to use '' | ||
| - | |||
| - | '' | ||
| - | |||
| - | * kernel '' | ||
| - | * kvm processes have prio '' | ||
| - | * kvm process during vzdump have '' | ||
| - | |||
| - | ===== HDD ===== | ||
| - | |||
| - | [[https:// | ||
| - | |||
| - | <code bash> | ||
| - | cat / | ||
| - | echo 2 > / | ||
| - | cat / | ||
| - | |||
| - | |||
| - | </ | ||
| - | |||
| - | Use huge record size - it can help on SMR drives. Note: it only make sense for ZFS file system. Cannot be applied on ZVOL. | ||
| - | |||
| - | <code bash> | ||
| - | zfs set recordsize=1M hddpool/ | ||
| - | zfs set recordsize=1M hddpool/vz | ||
| - | |||
| - | |||
| - | </ | ||
| - | |||
| - | NOTE: SMR drives behaves correctly for sequential writes, but long working ZFS or LVM thin spread writes into lots of random location causing unusable IOPS. So never use SMR. | ||
| - | |||
| - | For ZVOLs: [[https:// | ||
| - | |||
| - | [[https:// | ||
| - | |||
| - | **Note: | ||
| - | |||
| - | **Note: | ||
| - | < | ||
| - | |||
| - | Warning: volblocksize (8192) is less than the default minimum block size (16384). | ||
| - | To reduce wasted space a volblocksize of 16384 is recommended. | ||
| - | |||
| - | </ | ||
| - | |||
| - | <code bash> | ||
| - | zfs create -s -V 40G hddpool/ | ||
| - | dd if=/ | ||
| - | zfs rename hddpool/ | ||
| - | zfs rename hddpool/ | ||
| - | |||
| - | |||
| - | </ | ||
| - | |||
| - | Use '' | ||
| ===== Postgresql ===== | ===== Postgresql ===== | ||
| Line 165: | Line 158: | ||
| # apt install zfsutils-linux | # apt install zfsutils-linux | ||
| - | # arcstat | + | # zarcstat |
| time read miss miss% dmis dm% pmis pm% mmis mm% size | time read miss miss% dmis dm% pmis pm% mmis mm% size | ||
| 16: | 16: | ||
| - | |||
| - | |||
| </ | </ | ||
| <code bash> | <code bash> | ||
| - | # arc_summary | + | # zarcsummary -s arc |
| ARC size (current): | ARC size (current): | ||
| Line 185: | Line 176: | ||
| Dnode cache size (hard limit): | Dnode cache size (hard limit): | ||
| Dnode cache size (current): | Dnode cache size (current): | ||
| - | |||
| - | |||
| </ | </ | ||
| Line 193: | Line 182: | ||
| * '' | * '' | ||
| * '' | * '' | ||
| - | * '' | + | * '' |
| Proxmox recommends following [[https:// | Proxmox recommends following [[https:// | ||
| < | < | ||
| Line 208: | Line 198: | ||
| echo "$[4 * 1024*1024*1024]" | echo "$[4 * 1024*1024*1024]" | ||
| echo " | echo " | ||
| - | |||
| - | |||
| </ | </ | ||
| Line 219: | Line 207: | ||
| options zfs zfs_arc_min=134217728 | options zfs zfs_arc_min=134217728 | ||
| options zfs zfs_arc_meta_limit_percent=75 | options zfs zfs_arc_meta_limit_percent=75 | ||
| - | |||
| - | |||
| </ | </ | ||