meta data for this page
  •  

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
vm:proxmox:ceph:performance [2025/10/29 17:42] – [performance on slow HDDs] niziakvm:proxmox:ceph:performance [2026/06/07 21:13] (current) niziak
Line 6: Line 6:
   * [[https://www.reddit.com/r/ceph/comments/zpk0wo/new_to_ceph_hdd_pool_is_extremely_slow/|New to Ceph, HDD pool is extremely slow]]   * [[https://www.reddit.com/r/ceph/comments/zpk0wo/new_to_ceph_hdd_pool_is_extremely_slow/|New to Ceph, HDD pool is extremely slow]]
   * [[https://forum.proxmox.com/threads/ceph-storage-performance.129408/#post-566971|Ceph Storage Performance]]   * [[https://forum.proxmox.com/threads/ceph-storage-performance.129408/#post-566971|Ceph Storage Performance]]
 +  * [[https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/|Ceph: A Journey to 1 TiB/s]]
 +  * [[https://www.boniface.me/posts/pvc-ceph-tuning-adventures/]]
  
 ===== Performance tips ===== ===== Performance tips =====
Line 11: Line 13:
 Ceph is build for scale and works great in large clusters. In small cluster every node will be heavily loaded. Ceph is build for scale and works great in large clusters. In small cluster every node will be heavily loaded.
  
 +  * ceph ensure data safety - it waits for data to be written to medium on all replicas. Use enterpise SSDs with battery PLP (Power Loss Protection) to reduce latency. Some people reports 8x speed increase.
   * adapt PG to number of OSDs to spread traffic evenly   * adapt PG to number of OSDs to spread traffic evenly
   * use ''krbd''   * use ''krbd''
 +  * more OSD = better parallelism
   * enable ''writeback'' on VMs (possible data loss on consumer SSDs)   * enable ''writeback'' on VMs (possible data loss on consumer SSDs)
 +  * MTU 9000 (jumbo frames) [[https://ceph.io/en/news/blog/2015/ceph-loves-jumbo-frames/|Ceph Loves Jumbo Frames]]
 +  * net latency <200us (''ping -s 1000 pve'')
 +  * C-States: [[https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/|Ceph: A Journey to 1 TiB/s]]
 +    * Ceph is incredibly sensitive to latency introduced by CPU c-state transitions. Set ''Max perf'' in BIOS to disable C-States or boot Linux with ''GRUB_CMDLINE_LINUX="idle=poll intel_idle.max_cstate=0 intel_pstate=disable processor.max_cstate=1" ''
 +    * Disable IOMMU in kernel
  
 ==== performance on small cluster ==== ==== performance on small cluster ====
Line 61: Line 70:
  
 no tools to show primary PG balancing. Tool on https://github.com/JoshSalomon/Cephalocon-2019/blob/master/pool_pgs_osd.sh no tools to show primary PG balancing. Tool on https://github.com/JoshSalomon/Cephalocon-2019/blob/master/pool_pgs_osd.sh
 +
 +==== fragmentation ====
 +
 +<code bash>
 +# ceph tell 'osd.*' bluestore allocator score block
 +osd.0: {
 +    "fragmentation_rating": 0.27187848765399758
 +}
 +osd.1: {
 +    "fragmentation_rating": 0.31147177012467503
 +}
 +osd.2: {
 +    "fragmentation_rating": 0.30870023661486262
 +}
 +osd.3: {
 +    "fragmentation_rating": 0.25266931194419928
 +}
 +osd.4: {
 +    "fragmentation_rating": 0.29409796398594706
 +}
 +osd.5: {
 +    "fragmentation_rating": 0.33731626650673441
 +}
 +osd.6: {
 +    "fragmentation_rating": 0.23903976339003158
 +}
 +</code>
 +
  
  
 ==== performance on slow HDDs ==== ==== performance on slow HDDs ====
 +
 +Do not keep ''osd_memory_target'' below 2G:
 +<code bash>
 +ceph config set osd osd_memory_target 4294967296
 +ceph config get osd osd_memory_target
 +4294967296
 +</code>
 +
 +If journal is on SSD, change low_threshold to sth bigger - NOTE - check if is valid for BLuestore, probably this is legacy paramater for Filestore:
 +<code bash>
 +# internal parameter calculated from other parameters:
 +ceph config get osd journal_throttle_low_threshhold
 +0.600000
 +
 +# 5GB:
 +ceph config get osd osd_journal_size
 +5120
 +</code>
 +
 +==== bluestore_min_alloc_size ====
 +
 +  * Read: [[https://docs.ceph.com/en/reef/rados/configuration/bluestore-config-ref/#minimum-allocation-size]]
 +  * Restart of OSD needed
 +  * Impact: A smaller value reduces space waste (space amplification) but increases metadata overhead, while a larger value helps with large sequential writes but wastes space on small files.
 +  * These settings are generally applied to new or freshly deployed OSDs
 +
 +<code bash>
 +# ceph tell 'osd.*' config show | grep bluestore_min_alloc
 +    "bluestore_min_alloc_size": "0",
 +    "bluestore_min_alloc_size_hdd": "4096",
 +    "bluestore_min_alloc_size_ssd": "4096",
 +
 +# ceph tell 'osd.*' config set global bluestore_min_alloc_size_hdd 16384
 +</code>
 +
 +==== filestore_op_threads ====
 +
 +<code bash>
 +# ceph tell 'osd.*' config show | grep filestore_op_threads
 +
 +"filestore_op_threads": "2"
 +# ceph tell 'osd.*' config set filestore_op_threads 4
 +
 +</code>
  
 === mClock scheduler === === mClock scheduler ===
  
-[[https://pve.proxmox.com/wiki/Ceph_mClock_Tuning]]+  * [[https://pve.proxmox.com/wiki/Ceph_mClock_Tuning]] 
 + 
 +  * [[https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#osd-capacity-determination-automated]] 
 +  * [[https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#set-or-override-max-iops-capacity-of-an-osd]]
  
-[[https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#osd-capacity-determination-automated]] 
 Upon startup ceph mClock scheduler performs benchmarking of storage and configure IOPS according to results: Upon startup ceph mClock scheduler performs benchmarking of storage and configure IOPS according to results:
  
Line 87: Line 170:
     "osd_mclock_max_sequential_bandwidth_ssd": "1258291200",     "osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
 </code> </code>
 +
 +Manual benchmark:
 +<code bash>
 +ceph tell 'osd.*' bench 12288000 4096 4194304 100
 +</code>
 +
 +Override settings:
  
 <code bash> <code bash>