meta data for this page
  •  

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
vm:proxmox:ceph:performance [2024/05/17 19:10] niziakvm:proxmox:ceph:performance [2026/06/07 21:13] (current) niziak
Line 6: Line 6:
   * [[https://www.reddit.com/r/ceph/comments/zpk0wo/new_to_ceph_hdd_pool_is_extremely_slow/|New to Ceph, HDD pool is extremely slow]]   * [[https://www.reddit.com/r/ceph/comments/zpk0wo/new_to_ceph_hdd_pool_is_extremely_slow/|New to Ceph, HDD pool is extremely slow]]
   * [[https://forum.proxmox.com/threads/ceph-storage-performance.129408/#post-566971|Ceph Storage Performance]]   * [[https://forum.proxmox.com/threads/ceph-storage-performance.129408/#post-566971|Ceph Storage Performance]]
 +  * [[https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/|Ceph: A Journey to 1 TiB/s]]
 +  * [[https://www.boniface.me/posts/pvc-ceph-tuning-adventures/]]
  
 ===== Performance tips ===== ===== Performance tips =====
Line 11: Line 13:
 Ceph is build for scale and works great in large clusters. In small cluster every node will be heavily loaded. Ceph is build for scale and works great in large clusters. In small cluster every node will be heavily loaded.
  
 +  * ceph ensure data safety - it waits for data to be written to medium on all replicas. Use enterpise SSDs with battery PLP (Power Loss Protection) to reduce latency. Some people reports 8x speed increase.
   * adapt PG to number of OSDs to spread traffic evenly   * adapt PG to number of OSDs to spread traffic evenly
   * use ''krbd''   * use ''krbd''
 +  * more OSD = better parallelism
   * enable ''writeback'' on VMs (possible data loss on consumer SSDs)   * enable ''writeback'' on VMs (possible data loss on consumer SSDs)
 +  * MTU 9000 (jumbo frames) [[https://ceph.io/en/news/blog/2015/ceph-loves-jumbo-frames/|Ceph Loves Jumbo Frames]]
 +  * net latency <200us (''ping -s 1000 pve'')
 +  * C-States: [[https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/|Ceph: A Journey to 1 TiB/s]]
 +    * Ceph is incredibly sensitive to latency introduced by CPU c-state transitions. Set ''Max perf'' in BIOS to disable C-States or boot Linux with ''GRUB_CMDLINE_LINUX="idle=poll intel_idle.max_cstate=0 intel_pstate=disable processor.max_cstate=1" ''
 +    * Disable IOMMU in kernel
  
 ==== performance on small cluster ==== ==== performance on small cluster ====
Line 23: Line 32:
   * same number of primary PG per OSD = read operations spread evenly   * same number of primary PG per OSD = read operations spread evenly
     * primary PG - original/first PG - others are replicas. Primary PG is used for read.     * primary PG - original/first PG - others are replicas. Primary PG is used for read.
 +  * use relatively more PG than for big cluster - better balance, but handling PGs consumes resources (RAM)
 +    * i.e. for 7 OSD x 2TB PG autoscaler recommends 256 PG. After changing to 384 IOops drastivally increases and latency drops.
 +      Setting to 512 PG wasn't possible because limit of 250PG/OSD.
  
 === balancer === === balancer ===
Line 49: Line 61:
  
 It is possible to set desired/target size of pool. This prevents autoscaler to move data every time new data are stored. It is possible to set desired/target size of pool. This prevents autoscaler to move data every time new data are stored.
 +
 +==== check cluster balance ====
 +
 +<code bash>
 +ceph -s
 +ceph osd df # shows standard deviation
 +</code>
 +
 +no tools to show primary PG balancing. Tool on https://github.com/JoshSalomon/Cephalocon-2019/blob/master/pool_pgs_osd.sh
 +
 +==== fragmentation ====
 +
 +<code bash>
 +# ceph tell 'osd.*' bluestore allocator score block
 +osd.0: {
 +    "fragmentation_rating": 0.27187848765399758
 +}
 +osd.1: {
 +    "fragmentation_rating": 0.31147177012467503
 +}
 +osd.2: {
 +    "fragmentation_rating": 0.30870023661486262
 +}
 +osd.3: {
 +    "fragmentation_rating": 0.25266931194419928
 +}
 +osd.4: {
 +    "fragmentation_rating": 0.29409796398594706
 +}
 +osd.5: {
 +    "fragmentation_rating": 0.33731626650673441
 +}
 +osd.6: {
 +    "fragmentation_rating": 0.23903976339003158
 +}
 +</code>
 +
 +
  
 ==== performance on slow HDDs ==== ==== performance on slow HDDs ====
  
 +Do not keep ''osd_memory_target'' below 2G:
 +<code bash>
 +ceph config set osd osd_memory_target 4294967296
 +ceph config get osd osd_memory_target
 +4294967296
 +</code>
 +
 +If journal is on SSD, change low_threshold to sth bigger - NOTE - check if is valid for BLuestore, probably this is legacy paramater for Filestore:
 +<code bash>
 +# internal parameter calculated from other parameters:
 +ceph config get osd journal_throttle_low_threshhold
 +0.600000
 +
 +# 5GB:
 +ceph config get osd osd_journal_size
 +5120
 +</code>
 +
 +==== bluestore_min_alloc_size ====
 +
 +  * Read: [[https://docs.ceph.com/en/reef/rados/configuration/bluestore-config-ref/#minimum-allocation-size]]
 +  * Restart of OSD needed
 +  * Impact: A smaller value reduces space waste (space amplification) but increases metadata overhead, while a larger value helps with large sequential writes but wastes space on small files.
 +  * These settings are generally applied to new or freshly deployed OSDs
 +
 +<code bash>
 +# ceph tell 'osd.*' config show | grep bluestore_min_alloc
 +    "bluestore_min_alloc_size": "0",
 +    "bluestore_min_alloc_size_hdd": "4096",
 +    "bluestore_min_alloc_size_ssd": "4096",
 +
 +# ceph tell 'osd.*' config set global bluestore_min_alloc_size_hdd 16384
 +</code>
 +
 +==== filestore_op_threads ====
 +
 +<code bash>
 +# ceph tell 'osd.*' config show | grep filestore_op_threads
 +
 +"filestore_op_threads": "2"
 +# ceph tell 'osd.*' config set filestore_op_threads 4
 +
 +</code>
 +
 +=== mClock scheduler ===
 +
 +  * [[https://pve.proxmox.com/wiki/Ceph_mClock_Tuning]]
 +
 +  * [[https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#osd-capacity-determination-automated]]
 +  * [[https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#set-or-override-max-iops-capacity-of-an-osd]]
 +
 +Upon startup ceph mClock scheduler performs benchmarking of storage and configure IOPS according to results:
 +
 +
 +<code bash>
 +# ceph tell 'osd.*' config show | grep osd_mclock_max_capacity_iops_hdd
 +    "osd_mclock_max_capacity_iops_hdd": "269.194638",
 +    "osd_mclock_max_capacity_iops_hdd": "310.961086",
 +    "osd_mclock_max_capacity_iops_hdd": "299.505949",
 +    "osd_mclock_max_capacity_iops_hdd": "345.471699",
 +    "osd_mclock_max_capacity_iops_hdd": "356.290246",
 +    "osd_mclock_max_capacity_iops_hdd": "229.234009",
 +    "osd_mclock_max_capacity_iops_hdd": "266.478860",
 +
 +# ceph tell 'osd.*' config show | grep osd_mclock_max_sequential
 +    "osd_mclock_max_sequential_bandwidth_hdd": "157286400",
 +    "osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
 +</code>
 +
 +Manual benchmark:
 +<code bash>
 +ceph tell 'osd.*' bench 12288000 4096 4194304 100
 +</code>
 +
 +Override settings:
 +
 +<code bash>
 +ceph config dump | grep osd_mclock_max_capacity_iops
 +
 +for i in $(seq 0 7); do ceph config rm osd.$i osd_mclock_max_capacity_iops_hdd; done
 +ceph config set global osd_mclock_max_capacity_iops_hdd 111
 +
 +ceph config dump | grep osd_mclock_max_capacity_iops
 +</code>
 +
 +== mClock profiles ==
 +
 +<code bash>
 +ceph tell 'osd.*' config show | grep osd_mclock_profile
 +</code>
 +
 +<code bash>
 +ceph tell 'osd.*' config set osd_mclock_profile [high_client_ops|high_recovery_ops|balanced]
 +
 +ceph tell 'osd.*' config show | grep osd_mclock_profile
 +</code>
 +
 +== mClock custom profile ==
 +
 +<code bash>
 +ceph tell 'osd.*' config set osd_mclock_profile custom
 +</code>