meta data for this page
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| linux:fs:zfs:tuning [2025/01/04 16:08] – niziak | linux:fs:zfs:tuning [2025/08/23 08:45] (current) – niziak | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== ZFS performance tuning tips ====== | ====== ZFS performance tuning tips ====== | ||
| + | |||
| + | Copy-paste snippet: | ||
| + | <code bash> | ||
| + | zfs set recordsize=1M hddpool | ||
| + | zfs set recordsize=1M nvmpool | ||
| + | zfs set compression=zstd hddpool | ||
| + | zfs set compression=zstd nvmpool | ||
| + | </ | ||
| + | |||
| + | ===== zil limit ===== | ||
| + | |||
| + | ZFS parameter [[https:// | ||
| + | |||
| + | <file ini / | ||
| + | options zfs zil_slog_bulk=67108864 | ||
| + | options zfs l2arc_write_max=67108864 | ||
| + | </ | ||
| + | |||
| + | See similar for L2ARC: [[https:// | ||
| + | |||
| + | ===== recordsize / volblocksize ===== | ||
| + | |||
| + | Size must be power of 2. | ||
| + | |||
| + | * ZFS file system: '' | ||
| + | * ZVOL block device: '' | ||
| + | |||
| + | This is basic operation unit for ZFS. ZFS is COW filesystem. So to modify even one byte of data stored inside 128kB record it must read 128kB record, modify it and store 128kB in new place. It creates huge read and write amplification. | ||
| + | |||
| + | Small sizes: | ||
| + | |||
| + | * are good for dedicated workloads, like databases, etc. | ||
| + | * 4kB has no sense with compression. Even if data is compressed below 4kB it still occupies smallest possible unit - 4kB. | ||
| + | * slower sequential read - lots of IOPS and checksum checks | ||
| + | * fragmentation over time | ||
| + | * metadata overhead (4kB data size needs also metada, so another 4kB unit will be used). | ||
| + | |||
| + | Big size: | ||
| + | |||
| + | * benefit from compression. I.e: 128kB data compressed to 16kB will create 16kB record size. | ||
| + | * very good for storage (write once data) | ||
| + | * read/write amplification for small read/writes | ||
| + | * good for sequential access | ||
| + | * good for HDDs (less fragmentation = less seeks) | ||
| + | * less metadata | ||
| + | * less fragmentation | ||
| + | * zvol: huge overhead if guest is using small block sizes - try to match guest FS block size to volblock - do not set 4kB volblock size ! | ||
| + | |||
| + | Note: '' | ||
| + | |||
| + | Examples: | ||
| + | |||
| + | * 16kB for MySQL/ | ||
| + | * 128kB for rotational HDDs | ||
| + | |||
| + | Check real usage by histogram: | ||
| + | |||
| + | <code bash> | ||
| + | zpool iostat -r | ||
| + | |||
| + | |||
| + | </ | ||
| + | |||
| + | ===== zvol for guest ===== | ||
| + | |||
| + | * match volblock size to guest block size | ||
| + | * do not use guest CoW filesystem on CoW (ZFS) | ||
| + | * do not use qcow2 files on ZFS | ||
| + | * use 2 zvols per guest FS - one for storage and second one for journal | ||
| ===== Tune L2ARC for backups ===== | ===== Tune L2ARC for backups ===== | ||
| - | When huge portion of data are written (new backups) or read (backup verify) L2ARC is constantly written with current data. | + | When huge portion of data are written (new backups) or read (backup verify) L2ARC is constantly written with current data. To change this behaviour to cache only '' |
| - | To change this behaviour to cache only '' | + | |
| <file conf / | <file conf / | ||
| options zfs l2arc_mfuonly=1 l2arc_noprefetch=0 | options zfs l2arc_mfuonly=1 l2arc_noprefetch=0 | ||
| + | |||
| + | |||
| </ | </ | ||
| Explanation: | Explanation: | ||
| + | |||
| * [[https:// | * [[https:// | ||
| * [[https:// | * [[https:// | ||
| Line 16: | Line 87: | ||
| ===== I/O scheduler ===== | ===== I/O scheduler ===== | ||
| - | + | If whole device is managed by ZFS (not partition), ZFS sets scheduler to '' | |
| - | If whole device is managed by ZFS (not partition), ZFS sets scheduler to '' | + | |
| ==== official recommendation ==== | ==== official recommendation ==== | ||
| - | For rotational devices, there is no sense to use advanced schedulers '' | + | For rotational devices, there is no sense to use advanced schedulers '' |
| - | Both depends on processes, processes groups and application. In this case there is group of kernel processess for ZFS. | + | |
| - | Only possible scheduler to consider is '' | + | Only possible scheduler to consider is '' |
| - | '' | + | |
| There is a discussion on OpenZFS project to do not touch schedulers anymore and let it to be configured by admin: | There is a discussion on OpenZFS project to do not touch schedulers anymore and let it to be configured by admin: | ||
| + | |||
| * [[https:// | * [[https:// | ||
| - | * [[https:// | + | * [[https:// |
| ==== my findings ==== | ==== my findings ==== | ||
| - | There is huge benefit to use '' | + | There is huge benefit to use '' |
| - | No more huge lags during KVM backups. | + | |
| - | '' | + | '' |
| - | * kernel '' | + | |
| + | * kernel '' | ||
| * kvm processes have prio '' | * kvm processes have prio '' | ||
| - | * kvm process during vzdump have '' | + | * kvm process during vzdump have '' |
| ===== HDD ===== | ===== HDD ===== | ||
| Line 46: | Line 115: | ||
| <code bash> | <code bash> | ||
| - | cat / | ||
| - | echo 2 > / | ||
| cat / | cat / | ||
| + | echo 2 > / | ||
| + | cat / | ||
| + | |||
| + | |||
| </ | </ | ||
| Use huge record size - it can help on SMR drives. Note: it only make sense for ZFS file system. Cannot be applied on ZVOL. | Use huge record size - it can help on SMR drives. Note: it only make sense for ZFS file system. Cannot be applied on ZVOL. | ||
| + | |||
| <code bash> | <code bash> | ||
| zfs set recordsize=1M hddpool/ | zfs set recordsize=1M hddpool/ | ||
| zfs set recordsize=1M hddpool/vz | zfs set recordsize=1M hddpool/vz | ||
| + | |||
| + | |||
| </ | </ | ||
| + | |||
| + | NOTE: SMR drives behaves correctly for sequential writes, but long working ZFS or LVM thin spread writes into lots of random location causing unusable IOPS. So never use SMR. | ||
| For ZVOLs: [[https:// | For ZVOLs: [[https:// | ||
| + | [[https:// | ||
| - | Use '' | + | **Note: |
| + | **Note: | ||
| + | < | ||
| + | |||
| + | Warning: volblocksize (8192) is less than the default minimum block size (16384). | ||
| + | To reduce wasted space a volblocksize of 16384 is recommended. | ||
| + | |||
| + | </ | ||
| + | |||
| + | <code bash> | ||
| + | zfs create -s -V 40G hddpool/ | ||
| + | dd if=/ | ||
| + | zfs rename hddpool/ | ||
| + | zfs rename hddpool/ | ||
| + | |||
| + | |||
| + | </ | ||
| + | Use '' | ||
| ===== Postgresql ===== | ===== Postgresql ===== | ||
| Line 73: | Line 167: | ||
| # ONLY for SSD/NVM devices: | # ONLY for SSD/NVM devices: | ||
| zfs set logbias=throughput < | zfs set logbias=throughput < | ||
| + | |||
| + | |||
| </ | </ | ||
| Line 78: | Line 174: | ||
| By default ZFS can sue 50% of RAM for ARC cache: | By default ZFS can sue 50% of RAM for ARC cache: | ||
| + | |||
| <code bash> | <code bash> | ||
| # apt install zfsutils-linux | # apt install zfsutils-linux | ||
| - | # arcstat | + | # arcstat |
| time read miss miss% dmis dm% pmis pm% mmis mm% size | time read miss miss% dmis dm% pmis pm% mmis mm% size | ||
| 16: | 16: | ||
| + | |||
| + | |||
| </ | </ | ||
| <code bash> | <code bash> | ||
| - | # arc_summary | + | # arc_summary |
| ARC size (current): | ARC size (current): | ||
| Line 99: | Line 198: | ||
| Dnode cache size (hard limit): | Dnode cache size (hard limit): | ||
| Dnode cache size (current): | Dnode cache size (current): | ||
| + | |||
| + | |||
| </ | </ | ||
| - | ARC size can be tuned by settings '' | + | ARC size can be tuned by settings '' |
| * '' | * '' | ||
| - | * '' | + | * '' |
| * '' | * '' | ||
| - | |||
| Proxmox recommends following [[https:// | Proxmox recommends following [[https:// | ||
| + | < | ||
| - | | + | As a general rule of thumb, allocate at least 2 GiB Base + 1 GiB/ |
| + | |||
| + | </ | ||
| ==== Examples ==== | ==== Examples ==== | ||
| - | | + | |
| - | Set '' | + | Set '' |
| <code bash> | <code bash> | ||
| echo "$[4 * 1024*1024*1024]" | echo "$[4 * 1024*1024*1024]" | ||
| echo " | echo " | ||
| + | |||
| + | |||
| </ | </ | ||
| Make options persistent: | Make options persistent: | ||
| - | <file / | + | |
| + | <code etcmodprobedzfsconf> | ||
| options zfs zfs_prefetch_disable=1 | options zfs zfs_prefetch_disable=1 | ||
| options zfs zfs_arc_max=4294967296 | options zfs zfs_arc_max=4294967296 | ||
| options zfs zfs_arc_min=134217728 | options zfs zfs_arc_min=134217728 | ||
| options zfs zfs_arc_meta_limit_percent=75 | options zfs zfs_arc_meta_limit_percent=75 | ||
| - | </file> | + | |
| + | |||
| + | </code> | ||
| and '' | and '' | ||