Differences

This shows you the differences between two versions of the page.

--- vm:proxmox:ceph:performance [2023/05/31 09:22] – niziak
+++ vm:proxmox:ceph:performance [2025/11/01 10:36] (current) – niziak
@@ Line 4: / Line 4: @@
   * [[https://yourcmc.ru/wiki/Ceph_performance]]
   * [[https://accelazh.github.io/ceph/Ceph-Performance-Tuning-Checklist|Ceph Performance Tuning Checklist]]
+  * [[https://www.reddit.com/r/ceph/comments/zpk0wo/new_to_ceph_hdd_pool_is_extremely_slow/|New to Ceph, HDD pool is extremely slow]]
+  * [[https://forum.proxmox.com/threads/ceph-storage-performance.129408/#post-566971|Ceph Storage Performance]]
+  * [[https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/|Ceph: A Journey to 1 TiB/s]]
+  * [[https://www.boniface.me/posts/pvc-ceph-tuning-adventures/]]
+===== Performance tips =====
-==== Issues ====
+Ceph is build for scale and works great in large clusters. In small cluster every node will be heavily loaded.
-=== auth: unable to find a keyring ===
+  * adapt PG to number of OSDs to spread traffic evenly
+  * use ''krbd''
+  * more OSD = better parallelism
+  * enable ''writeback'' on VMs (possible data loss on consumer SSDs)
+  * MTU 9000 (jumbo frames) [[https://ceph.io/en/news/blog/2015/ceph-loves-jumbo-frames/|Ceph Loves Jumbo Frames]]
+  * net latency <200us (''ping -s 1000 pve'')
+  * [[https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/|Ceph: A Journey to 1 TiB/s]]
+    * Ceph is incredibly sensitive to latency introduced by CPU c-state transitions. Set ''Max perf'' in BIOS to disable C-States or boot Linux with ''GRUB_CMDLINE_LINUX="idle=poll intel_idle.max_cstate=0 intel_pstate=disable processor.max_cstate=1" ''
+    * Disable IOMMU in kernel
-It is not possible to create ceph OSD neither from WebUI nor cmdline: <code bash>pveceph osd create /dev/sdc</code>
+==== performance on small cluster ====
-<code>
+  * [[https://www.youtube.com/watch?v=LlLLJxNcVOY|Configuring Small Ceph Clusters for Optimal Performance - Josh Salomon, Red Hat]]
-Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-2/activate.monmap
+  * number of PG should be power of 2 (or middle between powers of 2)
- stderr: 2021-01-28T10:21:24.996+0100 7fd1a848f700 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.bootstrap-osd.keyring: (2) No such file or directory
+  * same utilization (% full) per device
--01-28T10:21:24.996+0100 7fd1a848f700 -1 AuthRegistry(0x7fd1a0059030) no keyring found at /etc/pve/priv/ceph.client.bootstrap-osd.keyring, disabling cephx
+  * same number of PG per OSD := same number of request per device
+  * same number of primary PG per OSD = read operations spread evenly
+    * primary PG - original/first PG - others are replicas. Primary PG is used for read.
+  * use relatively more PG than for big cluster - better balance, but handling PGs consumes resources (RAM)
+    * i.e. for 7 OSD x 2TB PG autoscaler recommends 256 PG. After changing to 384 IOops drastivally increases and latency drops.
+      Setting to 512 PG wasn't possible because limit of 250PG/OSD.
+=== balancer ===
+<code bash>
+ceph mgr module enable balancer
+ceph balancer on
+ceph balancer mode upmap
 </code>
-<file init /etc/pve/ceph.conf>
+=== CRUSH reweight ===
-[client]
-         keyring = /etc/pve/priv/$cluster.$name.keyring
-[mds]
+If possible use ''balancer''
-         keyring = /var/lib/ceph/mds/ceph-$id/keyring
-</file>
-ceph.conf Variables
+Override default CRUSH assignment.
-  * **$cluster** - cluster name. For proxmox it is ''ceph''
-  * **$type** - daemon process ''mds'' ''osd'' ''mon''
-  * **$id** - daemon or client indentifier. For ''osd.0'' it is ''0''
-  * **$host** - hostname where the process is running
-  * **$name** - Expands to $type.$id. I.e: ''osd.2'' or ''client.bootstrap''
-  * **$pid** - Expands to daemon pid
-**SOLUTION:**
-<code bash>cp /var/lib/ceph/bootstrap-osd/ceph.keyring /etc/pve/priv/ceph.client.bootstrap-osd.keyring</code>
-alternative to try: change ceph.conf
-=== Unit -.mount is masked. ===
+=== PG autoscaler ===
-<code>
+Better to use in warn mode, to do not put unexpected load when PG number will change.
-Running command: /usr/bin/systemctl start ceph-osd@2
+<code bash>
- stderr: Failed to start ceph-osd@2.service: Unit -.mount is masked.
+ceph mgr module enable pg_autoscaler
--->  RuntimeError: command returned non-zero exit status: 1
+#ceph osd pool set <pool> pg_autoscale_mode <mode>
+ceph osd pool set rbd pg_autoscale_mode warn
 </code>
-It was caused by ''gparted'' which wasn't correctly shutdown.
+It is possible to set desired/target size of pool. This prevents autoscaler to move data every time new data are stored.
-  * [[https://askubuntu.com/questions/1191596/unit-mount-is-masked|Unit -.mount is masked]]
-  * [[https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=948739|gparted should not mask .mount units]]
-  * [[https://unix.stackexchange.com/questions/533933/systemd-cant-unmask-root-mount-mount/548996]]
-**Solution:**
+==== check cluster balance ====
-<code bash>systemctl --runtime unmask -- -.mount</code>
+<code bash>
+ceph -s
+ceph osd df # shows standard deviation
+</code>
-To list runtime masked units:
+no tools to show primary PG balancing. Tool on https://github.com/JoshSalomon/Cephalocon-2019/blob/master/pool_pgs_osd.sh
-<code bash>ls -l /var/run/systemd/system | grep mount | grep '/dev/null' | cut -d ' ' -f 11</code>
-To unescape systemd unit names:
+==== fragmentation ====
-<code bash> systemd-escape -u 'rpool-data-basevol\x2d800\x2ddisk\x2d0.mount'</code>
+<code bash>
+# ceph tell 'osd.*' bluestore allocator score block
+osd.0: {
+    "fragmentation_rating": 0.27187848765399758
+}
+osd.1: {
+    "fragmentation_rating": 0.31147177012467503
+}
+osd.2: {
+    "fragmentation_rating": 0.30870023661486262
+}
+osd.3: {
+    "fragmentation_rating": 0.25266931194419928
+}
+osd.4: {
+    "fragmentation_rating": 0.29409796398594706
+}
+osd.5: {
+    "fragmentation_rating": 0.33731626650673441
+}
+osd.6: {
+    "fragmentation_rating": 0.23903976339003158
+}
+</code>
+==== performance on slow HDDs ====
+Do not keep ''osd_memory_target'' below 2G:
+<code bash>
+ceph config set osd osd_memory_target 4294967296
+ceph config get osd osd_memory_target
+4294967296
+</code>
+If journal is on SSD, change low_threshold to sth bigger - NOTE - check if is valid for BLuestore, probably this is legacy paramater for Filestore:
+<code bash>
+# internal parameter calculated from other parameters:
+ceph config get osd journal_throttle_low_threshhold
+.600000
+# 5GB:
+ceph config get osd osd_journal_size
+</code>
+=== mClock scheduler ===
+  * [[https://pve.proxmox.com/wiki/Ceph_mClock_Tuning]]
+  * [[https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#osd-capacity-determination-automated]]
+  * [[https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#set-or-override-max-iops-capacity-of-an-osd]]
+Upon startup ceph mClock scheduler performs benchmarking of storage and configure IOPS according to results:
+<code bash>
+# ceph tell 'osd.*' config show | grep osd_mclock_max_capacity_iops_hdd
+    "osd_mclock_max_capacity_iops_hdd": "269.194638",
+    "osd_mclock_max_capacity_iops_hdd": "310.961086",
+    "osd_mclock_max_capacity_iops_hdd": "299.505949",
+    "osd_mclock_max_capacity_iops_hdd": "345.471699",
+    "osd_mclock_max_capacity_iops_hdd": "356.290246",
+    "osd_mclock_max_capacity_iops_hdd": "229.234009",
+    "osd_mclock_max_capacity_iops_hdd": "266.478860",
+# ceph tell 'osd.*' config show | grep osd_mclock_max_sequential
+    "osd_mclock_max_sequential_bandwidth_hdd": "157286400",
+    "osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
+</code>
+Manual benchmark:
+<code bash>
+ceph tell 'osd.*' bench 12288000 4096 4194304 100
+</code>
+Override settings:
+<code bash>
+ceph config dump | grep osd_mclock_max_capacity_iops
+for i in $(seq 0 7); do ceph config rm osd.$i osd_mclock_max_capacity_iops_hdd; done
+ceph config set global osd_mclock_max_capacity_iops_hdd 111
+ceph config dump | grep osd_mclock_max_capacity_iops
+</code>
+== mClock profiles ==
+<code bash>
+ceph tell 'osd.*' config show | grep osd_mclock_profile
+</code>
+<code bash>
+ceph tell 'osd.*' config set osd_mclock_profile [high_client_ops|high_recovery_ops|balanced]
+ceph tell 'osd.*' config show | grep osd_mclock_profile
+</code>
+== mClock custom profile ==
+<code bash>
+ceph tell 'osd.*' config set osd_mclock_profile custom
+</code>

Tools

menus and quick search

quick search

site status

Page Tools

meta data for this page

Differences