meta data for this page
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| linux:fs:zfs:tuning [2025/01/09 08:54] – niziak | linux:fs:zfs:tuning [2025/08/23 08:45] (current) – niziak | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== ZFS performance tuning tips ====== | ====== ZFS performance tuning tips ====== | ||
| + | |||
| + | Copy-paste snippet: | ||
| + | <code bash> | ||
| + | zfs set recordsize=1M hddpool | ||
| + | zfs set recordsize=1M nvmpool | ||
| + | zfs set compression=zstd hddpool | ||
| + | zfs set compression=zstd nvmpool | ||
| + | </ | ||
| + | |||
| + | ===== zil limit ===== | ||
| + | |||
| + | ZFS parameter [[https:// | ||
| + | |||
| + | <file ini / | ||
| + | options zfs zil_slog_bulk=67108864 | ||
| + | options zfs l2arc_write_max=67108864 | ||
| + | </ | ||
| + | |||
| + | See similar for L2ARC: [[https:// | ||
| ===== recordsize / volblocksize ===== | ===== recordsize / volblocksize ===== | ||
| Line 5: | Line 24: | ||
| Size must be power of 2. | Size must be power of 2. | ||
| - | * ZFS file system: '' | + | * ZFS file system: '' |
| - | * ZVOL block device: '' | + | * ZVOL block device: '' |
| This is basic operation unit for ZFS. ZFS is COW filesystem. So to modify even one byte of data stored inside 128kB record it must read 128kB record, modify it and store 128kB in new place. It creates huge read and write amplification. | This is basic operation unit for ZFS. ZFS is COW filesystem. So to modify even one byte of data stored inside 128kB record it must read 128kB record, modify it and store 128kB in new place. It creates huge read and write amplification. | ||
| Small sizes: | Small sizes: | ||
| + | |||
| * are good for dedicated workloads, like databases, etc. | * are good for dedicated workloads, like databases, etc. | ||
| - | * 4kB has no sense with compression. Even if data is compressed it still takes smaller | + | * 4kB has no sense with compression. Even if data is compressed |
| * slower sequential read - lots of IOPS and checksum checks | * slower sequential read - lots of IOPS and checksum checks | ||
| * fragmentation over time | * fragmentation over time | ||
| Line 18: | Line 38: | ||
| Big size: | Big size: | ||
| + | |||
| * benefit from compression. I.e: 128kB data compressed to 16kB will create 16kB record size. | * benefit from compression. I.e: 128kB data compressed to 16kB will create 16kB record size. | ||
| * very good for storage (write once data) | * very good for storage (write once data) | ||
| + | * read/write amplification for small read/writes | ||
| * good for sequential access | * good for sequential access | ||
| + | * good for HDDs (less fragmentation = less seeks) | ||
| * less metadata | * less metadata | ||
| * less fragmentation | * less fragmentation | ||
| - | Note: recordsize / volblocksize only defines upper limit. Smaller data still can create smaller recordsize (is it true for block?). | + | * zvol: huge overhead if guest is using small block sizes - try to match guest FS block size to volblock - do not set 4kB volblock size ! |
| + | Note: '' | ||
| + | |||
| + | Examples: | ||
| + | |||
| + | * 16kB for MySQL/ | ||
| + | * 128kB for rotational HDDs | ||
| + | |||
| + | Check real usage by histogram: | ||
| + | |||
| + | <code bash> | ||
| + | zpool iostat -r | ||
| + | |||
| + | |||
| + | </ | ||
| + | |||
| + | ===== zvol for guest ===== | ||
| + | |||
| + | * match volblock size to guest block size | ||
| + | * do not use guest CoW filesystem on CoW (ZFS) | ||
| + | * do not use qcow2 files on ZFS | ||
| + | * use 2 zvols per guest FS - one for storage and second one for journal | ||
| ===== Tune L2ARC for backups ===== | ===== Tune L2ARC for backups ===== | ||
| - | When huge portion of data are written (new backups) or read (backup verify) L2ARC is constantly written with current data. | + | When huge portion of data are written (new backups) or read (backup verify) L2ARC is constantly written with current data. To change this behaviour to cache only '' |
| - | To change this behaviour to cache only '' | + | |
| <file conf / | <file conf / | ||
| options zfs l2arc_mfuonly=1 l2arc_noprefetch=0 | options zfs l2arc_mfuonly=1 l2arc_noprefetch=0 | ||
| + | |||
| + | |||
| </ | </ | ||
| Explanation: | Explanation: | ||
| + | |||
| * [[https:// | * [[https:// | ||
| * [[https:// | * [[https:// | ||
| Line 41: | Line 87: | ||
| ===== I/O scheduler ===== | ===== I/O scheduler ===== | ||
| - | + | If whole device is managed by ZFS (not partition), ZFS sets scheduler to '' | |
| - | If whole device is managed by ZFS (not partition), ZFS sets scheduler to '' | + | |
| ==== official recommendation ==== | ==== official recommendation ==== | ||
| - | For rotational devices, there is no sense to use advanced schedulers '' | + | For rotational devices, there is no sense to use advanced schedulers '' |
| - | Both depends on processes, processes groups and application. In this case there is group of kernel processess for ZFS. | + | |
| - | Only possible scheduler to consider is '' | + | Only possible scheduler to consider is '' |
| - | '' | + | |
| There is a discussion on OpenZFS project to do not touch schedulers anymore and let it to be configured by admin: | There is a discussion on OpenZFS project to do not touch schedulers anymore and let it to be configured by admin: | ||
| + | |||
| * [[https:// | * [[https:// | ||
| - | * [[https:// | + | * [[https:// |
| ==== my findings ==== | ==== my findings ==== | ||
| - | There is huge benefit to use '' | + | There is huge benefit to use '' |
| - | No more huge lags during KVM backups. | + | |
| + | '' | ||
| - | '' | + | |
| - | | + | |
| * kvm processes have prio '' | * kvm processes have prio '' | ||
| - | * kvm process during vzdump have '' | + | * kvm process during vzdump have '' |
| ===== HDD ===== | ===== HDD ===== | ||
| Line 71: | Line 115: | ||
| <code bash> | <code bash> | ||
| - | cat / | ||
| - | echo 2 > / | ||
| cat / | cat / | ||
| + | echo 2 > / | ||
| + | cat / | ||
| + | |||
| + | |||
| </ | </ | ||
| Use huge record size - it can help on SMR drives. Note: it only make sense for ZFS file system. Cannot be applied on ZVOL. | Use huge record size - it can help on SMR drives. Note: it only make sense for ZFS file system. Cannot be applied on ZVOL. | ||
| + | |||
| <code bash> | <code bash> | ||
| zfs set recordsize=1M hddpool/ | zfs set recordsize=1M hddpool/ | ||
| zfs set recordsize=1M hddpool/vz | zfs set recordsize=1M hddpool/vz | ||
| + | |||
| + | |||
| </ | </ | ||
| NOTE: SMR drives behaves correctly for sequential writes, but long working ZFS or LVM thin spread writes into lots of random location causing unusable IOPS. So never use SMR. | NOTE: SMR drives behaves correctly for sequential writes, but long working ZFS or LVM thin spread writes into lots of random location causing unusable IOPS. So never use SMR. | ||
| - | |||
| For ZVOLs: [[https:// | For ZVOLs: [[https:// | ||
| Line 89: | Line 137: | ||
| [[https:// | [[https:// | ||
| - | **Note:** When no stripping is used (simple mirror) volblocksize should be 4kB (or at least the same as ashift). | + | **Note: |
| - | **Note:** Latest Proxmox default volblock size was increased form 8k to 16k. When 8k is used warning is shown: | + | **Note: |
| < | < | ||
| + | |||
| Warning: volblocksize (8192) is less than the default minimum block size (16384). | Warning: volblocksize (8192) is less than the default minimum block size (16384). | ||
| To reduce wasted space a volblocksize of 16384 is recommended. | To reduce wasted space a volblocksize of 16384 is recommended. | ||
| + | |||
| </ | </ | ||
| - | |||
| <code bash> | <code bash> | ||
| Line 103: | Line 152: | ||
| zfs rename hddpool/ | zfs rename hddpool/ | ||
| zfs rename hddpool/ | zfs rename hddpool/ | ||
| - | </ | ||
| + | </ | ||
| - | Use '' | + | Use '' |
| - | + | ||
| ===== Postgresql ===== | ===== Postgresql ===== | ||
| Line 120: | Line 167: | ||
| # ONLY for SSD/NVM devices: | # ONLY for SSD/NVM devices: | ||
| zfs set logbias=throughput < | zfs set logbias=throughput < | ||
| + | |||
| + | |||
| </ | </ | ||
| Line 125: | Line 174: | ||
| By default ZFS can sue 50% of RAM for ARC cache: | By default ZFS can sue 50% of RAM for ARC cache: | ||
| + | |||
| <code bash> | <code bash> | ||
| # apt install zfsutils-linux | # apt install zfsutils-linux | ||
| - | # arcstat | + | # arcstat |
| time read miss miss% dmis dm% pmis pm% mmis mm% size | time read miss miss% dmis dm% pmis pm% mmis mm% size | ||
| 16: | 16: | ||
| + | |||
| + | |||
| </ | </ | ||
| <code bash> | <code bash> | ||
| - | # arc_summary | + | # arc_summary |
| ARC size (current): | ARC size (current): | ||
| Line 146: | Line 198: | ||
| Dnode cache size (hard limit): | Dnode cache size (hard limit): | ||
| Dnode cache size (current): | Dnode cache size (current): | ||
| + | |||
| + | |||
| </ | </ | ||
| - | ARC size can be tuned by settings '' | + | ARC size can be tuned by settings '' |
| * '' | * '' | ||
| - | * '' | + | * '' |
| * '' | * '' | ||
| - | |||
| Proxmox recommends following [[https:// | Proxmox recommends following [[https:// | ||
| + | < | ||
| - | | + | As a general rule of thumb, allocate at least 2 GiB Base + 1 GiB/ |
| + | |||
| + | </ | ||
| ==== Examples ==== | ==== Examples ==== | ||
| - | | + | |
| - | Set '' | + | Set '' |
| <code bash> | <code bash> | ||
| echo "$[4 * 1024*1024*1024]" | echo "$[4 * 1024*1024*1024]" | ||
| echo " | echo " | ||
| + | |||
| + | |||
| </ | </ | ||
| Make options persistent: | Make options persistent: | ||
| - | <file / | + | |
| + | <code etcmodprobedzfsconf> | ||
| options zfs zfs_prefetch_disable=1 | options zfs zfs_prefetch_disable=1 | ||
| options zfs zfs_arc_max=4294967296 | options zfs zfs_arc_max=4294967296 | ||
| options zfs zfs_arc_min=134217728 | options zfs zfs_arc_min=134217728 | ||
| options zfs zfs_arc_meta_limit_percent=75 | options zfs zfs_arc_meta_limit_percent=75 | ||
| - | </file> | + | |
| + | |||
| + | </code> | ||
| and '' | and '' | ||