====== CEPH performance ====== * [[https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing|BlueStore Config Reference: Sizing]] * [[https://yourcmc.ru/wiki/Ceph_performance]] * [[https://accelazh.github.io/ceph/Ceph-Performance-Tuning-Checklist|Ceph Performance Tuning Checklist]] * [[https://www.reddit.com/r/ceph/comments/zpk0wo/new_to_ceph_hdd_pool_is_extremely_slow/|New to Ceph, HDD pool is extremely slow]] * [[https://forum.proxmox.com/threads/ceph-storage-performance.129408/#post-566971|Ceph Storage Performance]] * [[https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/|Ceph: A Journey to 1 TiB/s]] * [[https://www.boniface.me/posts/pvc-ceph-tuning-adventures/]] ===== Performance tips ===== Ceph is build for scale and works great in large clusters. In small cluster every node will be heavily loaded. * adapt PG to number of OSDs to spread traffic evenly * use ''krbd'' * more OSD = better parallelism * enable ''writeback'' on VMs (possible data loss on consumer SSDs) * MTU 9000 (jumbo frames) [[https://ceph.io/en/news/blog/2015/ceph-loves-jumbo-frames/|Ceph Loves Jumbo Frames]] * net latency <200us (''ping -s 1000 pve'') * [[https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/|Ceph: A Journey to 1 TiB/s]] * Ceph is incredibly sensitive to latency introduced by CPU c-state transitions. Set ''Max perf'' in BIOS to disable C-States or boot Linux with ''GRUB_CMDLINE_LINUX="idle=poll intel_idle.max_cstate=0 intel_pstate=disable processor.max_cstate=1" '' * Disable IOMMU in kernel ==== performance on small cluster ==== * [[https://www.youtube.com/watch?v=LlLLJxNcVOY|Configuring Small Ceph Clusters for Optimal Performance - Josh Salomon, Red Hat]] * number of PG should be power of 2 (or middle between powers of 2) * same utilization (% full) per device * same number of PG per OSD := same number of request per device * same number of primary PG per OSD = read operations spread evenly * primary PG - original/first PG - others are replicas. Primary PG is used for read. * use relatively more PG than for big cluster - better balance, but handling PGs consumes resources (RAM) * i.e. for 7 OSD x 2TB PG autoscaler recommends 256 PG. After changing to 384 IOops drastivally increases and latency drops. Setting to 512 PG wasn't possible because limit of 250PG/OSD. === balancer === ceph mgr module enable balancer ceph balancer on ceph balancer mode upmap === CRUSH reweight === If possible use ''balancer'' Override default CRUSH assignment. === PG autoscaler === Better to use in warn mode, to do not put unexpected load when PG number will change. ceph mgr module enable pg_autoscaler #ceph osd pool set pg_autoscale_mode ceph osd pool set rbd pg_autoscale_mode warn It is possible to set desired/target size of pool. This prevents autoscaler to move data every time new data are stored. ==== check cluster balance ==== ceph -s ceph osd df # shows standard deviation no tools to show primary PG balancing. Tool on https://github.com/JoshSalomon/Cephalocon-2019/blob/master/pool_pgs_osd.sh ==== fragmentation ==== # ceph tell 'osd.*' bluestore allocator score block osd.0: { "fragmentation_rating": 0.27187848765399758 } osd.1: { "fragmentation_rating": 0.31147177012467503 } osd.2: { "fragmentation_rating": 0.30870023661486262 } osd.3: { "fragmentation_rating": 0.25266931194419928 } osd.4: { "fragmentation_rating": 0.29409796398594706 } osd.5: { "fragmentation_rating": 0.33731626650673441 } osd.6: { "fragmentation_rating": 0.23903976339003158 } ==== performance on slow HDDs ==== Do not keep ''osd_memory_target'' below 2G: ceph config set osd osd_memory_target 4294967296 ceph config get osd osd_memory_target 4294967296 If journal is on SSD, change low_threshold to sth bigger - NOTE - check if is valid for BLuestore, probably this is legacy paramater for Filestore: # internal parameter calculated from other parameters: ceph config get osd journal_throttle_low_threshhold 0.600000 # 5GB: ceph config get osd osd_journal_size 5120 === mClock scheduler === * [[https://pve.proxmox.com/wiki/Ceph_mClock_Tuning]] * [[https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#osd-capacity-determination-automated]] * [[https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#set-or-override-max-iops-capacity-of-an-osd]] Upon startup ceph mClock scheduler performs benchmarking of storage and configure IOPS according to results: # ceph tell 'osd.*' config show | grep osd_mclock_max_capacity_iops_hdd "osd_mclock_max_capacity_iops_hdd": "269.194638", "osd_mclock_max_capacity_iops_hdd": "310.961086", "osd_mclock_max_capacity_iops_hdd": "299.505949", "osd_mclock_max_capacity_iops_hdd": "345.471699", "osd_mclock_max_capacity_iops_hdd": "356.290246", "osd_mclock_max_capacity_iops_hdd": "229.234009", "osd_mclock_max_capacity_iops_hdd": "266.478860", # ceph tell 'osd.*' config show | grep osd_mclock_max_sequential "osd_mclock_max_sequential_bandwidth_hdd": "157286400", "osd_mclock_max_sequential_bandwidth_ssd": "1258291200", Manual benchmark: ceph tell 'osd.*' bench 12288000 4096 4194304 100 Override settings: ceph config dump | grep osd_mclock_max_capacity_iops for i in $(seq 0 7); do ceph config rm osd.$i osd_mclock_max_capacity_iops_hdd; done ceph config set global osd_mclock_max_capacity_iops_hdd 111 ceph config dump | grep osd_mclock_max_capacity_iops == mClock profiles == ceph tell 'osd.*' config show | grep osd_mclock_profile ceph tell 'osd.*' config set osd_mclock_profile [high_client_ops|high_recovery_ops|balanced] ceph tell 'osd.*' config show | grep osd_mclock_profile == mClock custom profile == ceph tell 'osd.*' config set osd_mclock_profile custom