meta data for this page
  •  

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
vm:proxmox:ceph:performance [2024/05/17 19:33] niziakvm:proxmox:ceph:performance [2025/11/01 10:36] (current) niziak
Line 6: Line 6:
   * [[https://www.reddit.com/r/ceph/comments/zpk0wo/new_to_ceph_hdd_pool_is_extremely_slow/|New to Ceph, HDD pool is extremely slow]]   * [[https://www.reddit.com/r/ceph/comments/zpk0wo/new_to_ceph_hdd_pool_is_extremely_slow/|New to Ceph, HDD pool is extremely slow]]
   * [[https://forum.proxmox.com/threads/ceph-storage-performance.129408/#post-566971|Ceph Storage Performance]]   * [[https://forum.proxmox.com/threads/ceph-storage-performance.129408/#post-566971|Ceph Storage Performance]]
 +  * [[https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/|Ceph: A Journey to 1 TiB/s]]
 +  * [[https://www.boniface.me/posts/pvc-ceph-tuning-adventures/]]
  
 ===== Performance tips ===== ===== Performance tips =====
Line 13: Line 15:
   * adapt PG to number of OSDs to spread traffic evenly   * adapt PG to number of OSDs to spread traffic evenly
   * use ''krbd''   * use ''krbd''
 +  * more OSD = better parallelism
   * enable ''writeback'' on VMs (possible data loss on consumer SSDs)   * enable ''writeback'' on VMs (possible data loss on consumer SSDs)
 +  * MTU 9000 (jumbo frames) [[https://ceph.io/en/news/blog/2015/ceph-loves-jumbo-frames/|Ceph Loves Jumbo Frames]]
 +  * net latency <200us (''ping -s 1000 pve'')
 +  * [[https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/|Ceph: A Journey to 1 TiB/s]]
 +    * Ceph is incredibly sensitive to latency introduced by CPU c-state transitions. Set ''Max perf'' in BIOS to disable C-States or boot Linux with ''GRUB_CMDLINE_LINUX="idle=poll intel_idle.max_cstate=0 intel_pstate=disable processor.max_cstate=1" ''
 +    * Disable IOMMU in kernel
  
 ==== performance on small cluster ==== ==== performance on small cluster ====
Line 24: Line 32:
     * primary PG - original/first PG - others are replicas. Primary PG is used for read.     * primary PG - original/first PG - others are replicas. Primary PG is used for read.
   * use relatively more PG than for big cluster - better balance, but handling PGs consumes resources (RAM)   * use relatively more PG than for big cluster - better balance, but handling PGs consumes resources (RAM)
 +    * i.e. for 7 OSD x 2TB PG autoscaler recommends 256 PG. After changing to 384 IOops drastivally increases and latency drops.
 +      Setting to 512 PG wasn't possible because limit of 250PG/OSD.
  
 === balancer === === balancer ===
Line 53: Line 63:
 ==== check cluster balance ==== ==== check cluster balance ====
  
 +<code bash>
 ceph -s ceph -s
-ceph osd df shows standard deviation+ceph osd df shows standard deviation 
 +</code>
  
 no tools to show primary PG balancing. Tool on https://github.com/JoshSalomon/Cephalocon-2019/blob/master/pool_pgs_osd.sh no tools to show primary PG balancing. Tool on https://github.com/JoshSalomon/Cephalocon-2019/blob/master/pool_pgs_osd.sh
 +
 +==== fragmentation ====
 +
 +<code bash>
 +# ceph tell 'osd.*' bluestore allocator score block
 +osd.0: {
 +    "fragmentation_rating": 0.27187848765399758
 +}
 +osd.1: {
 +    "fragmentation_rating": 0.31147177012467503
 +}
 +osd.2: {
 +    "fragmentation_rating": 0.30870023661486262
 +}
 +osd.3: {
 +    "fragmentation_rating": 0.25266931194419928
 +}
 +osd.4: {
 +    "fragmentation_rating": 0.29409796398594706
 +}
 +osd.5: {
 +    "fragmentation_rating": 0.33731626650673441
 +}
 +osd.6: {
 +    "fragmentation_rating": 0.23903976339003158
 +}
 +</code>
 +
  
  
 ==== performance on slow HDDs ==== ==== performance on slow HDDs ====
  
 +Do not keep ''osd_memory_target'' below 2G:
 +<code bash>
 +ceph config set osd osd_memory_target 4294967296
 +ceph config get osd osd_memory_target
 +4294967296
 +</code>
 +
 +If journal is on SSD, change low_threshold to sth bigger - NOTE - check if is valid for BLuestore, probably this is legacy paramater for Filestore:
 +<code bash>
 +# internal parameter calculated from other parameters:
 +ceph config get osd journal_throttle_low_threshhold
 +0.600000
 +
 +# 5GB:
 +ceph config get osd osd_journal_size
 +5120
 +</code>
 +
 +=== mClock scheduler ===
 +
 +  * [[https://pve.proxmox.com/wiki/Ceph_mClock_Tuning]]
 +
 +  * [[https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#osd-capacity-determination-automated]]
 +  * [[https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#set-or-override-max-iops-capacity-of-an-osd]]
 +
 +Upon startup ceph mClock scheduler performs benchmarking of storage and configure IOPS according to results:
 +
 +
 +<code bash>
 +# ceph tell 'osd.*' config show | grep osd_mclock_max_capacity_iops_hdd
 +    "osd_mclock_max_capacity_iops_hdd": "269.194638",
 +    "osd_mclock_max_capacity_iops_hdd": "310.961086",
 +    "osd_mclock_max_capacity_iops_hdd": "299.505949",
 +    "osd_mclock_max_capacity_iops_hdd": "345.471699",
 +    "osd_mclock_max_capacity_iops_hdd": "356.290246",
 +    "osd_mclock_max_capacity_iops_hdd": "229.234009",
 +    "osd_mclock_max_capacity_iops_hdd": "266.478860",
 +
 +# ceph tell 'osd.*' config show | grep osd_mclock_max_sequential
 +    "osd_mclock_max_sequential_bandwidth_hdd": "157286400",
 +    "osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
 +</code>
 +
 +Manual benchmark:
 +<code bash>
 +ceph tell 'osd.*' bench 12288000 4096 4194304 100
 +</code>
 +
 +Override settings:
 +
 +<code bash>
 +ceph config dump | grep osd_mclock_max_capacity_iops
 +
 +for i in $(seq 0 7); do ceph config rm osd.$i osd_mclock_max_capacity_iops_hdd; done
 +ceph config set global osd_mclock_max_capacity_iops_hdd 111
 +
 +ceph config dump | grep osd_mclock_max_capacity_iops
 +</code>
 +
 +== mClock profiles ==
 +
 +<code bash>
 +ceph tell 'osd.*' config show | grep osd_mclock_profile
 +</code>
 +
 +<code bash>
 +ceph tell 'osd.*' config set osd_mclock_profile [high_client_ops|high_recovery_ops|balanced]
 +
 +ceph tell 'osd.*' config show | grep osd_mclock_profile
 +</code>
 +
 +== mClock custom profile ==
 +
 +<code bash>
 +ceph tell 'osd.*' config set osd_mclock_profile custom
 +</code>