Differences

This shows you the differences between two versions of the page.

--- vm:proxmox:ceph:performance [2025/10/29 17:42] – [performance on slow HDDs] niziak
+++ vm:proxmox:ceph:performance [2026/06/07 21:13] (current) – niziak
@@ Line 6: / Line 6: @@
   * [[https://www.reddit.com/r/ceph/comments/zpk0wo/new_to_ceph_hdd_pool_is_extremely_slow/|New to Ceph, HDD pool is extremely slow]]
   * [[https://forum.proxmox.com/threads/ceph-storage-performance.129408/#post-566971|Ceph Storage Performance]]
+  * [[https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/|Ceph: A Journey to 1 TiB/s]]
+  * [[https://www.boniface.me/posts/pvc-ceph-tuning-adventures/]]
 ===== Performance tips =====
@@ Line 11: / Line 13: @@
 Ceph is build for scale and works great in large clusters. In small cluster every node will be heavily loaded.
+  * ceph ensure data safety - it waits for data to be written to medium on all replicas. Use enterpise SSDs with battery PLP (Power Loss Protection) to reduce latency. Some people reports 8x speed increase.
   * adapt PG to number of OSDs to spread traffic evenly
   * use ''krbd''
+  * more OSD = better parallelism
   * enable ''writeback'' on VMs (possible data loss on consumer SSDs)
+  * MTU 9000 (jumbo frames) [[https://ceph.io/en/news/blog/2015/ceph-loves-jumbo-frames/|Ceph Loves Jumbo Frames]]
+  * net latency <200us (''ping -s 1000 pve'')
+  * C-States: [[https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/|Ceph: A Journey to 1 TiB/s]]
+    * Ceph is incredibly sensitive to latency introduced by CPU c-state transitions. Set ''Max perf'' in BIOS to disable C-States or boot Linux with ''GRUB_CMDLINE_LINUX="idle=poll intel_idle.max_cstate=0 intel_pstate=disable processor.max_cstate=1" ''
+    * Disable IOMMU in kernel
 ==== performance on small cluster ====
@@ Line 61: / Line 70: @@
 no tools to show primary PG balancing. Tool on https://github.com/JoshSalomon/Cephalocon-2019/blob/master/pool_pgs_osd.sh
+==== fragmentation ====
+<code bash>
+# ceph tell 'osd.*' bluestore allocator score block
+osd.0: {
+    "fragmentation_rating": 0.27187848765399758
+}
+osd.1: {
+    "fragmentation_rating": 0.31147177012467503
+}
+osd.2: {
+    "fragmentation_rating": 0.30870023661486262
+}
+osd.3: {
+    "fragmentation_rating": 0.25266931194419928
+}
+osd.4: {
+    "fragmentation_rating": 0.29409796398594706
+}
+osd.5: {
+    "fragmentation_rating": 0.33731626650673441
+}
+osd.6: {
+    "fragmentation_rating": 0.23903976339003158
+}
+</code>
 ==== performance on slow HDDs ====
+Do not keep ''osd_memory_target'' below 2G:
+<code bash>
+ceph config set osd osd_memory_target 4294967296
+ceph config get osd osd_memory_target
+4294967296
+</code>
+If journal is on SSD, change low_threshold to sth bigger - NOTE - check if is valid for BLuestore, probably this is legacy paramater for Filestore:
+<code bash>
+# internal parameter calculated from other parameters:
+ceph config get osd journal_throttle_low_threshhold
+.600000
+# 5GB:
+ceph config get osd osd_journal_size
+</code>
+==== bluestore_min_alloc_size ====
+  * Read: [[https://docs.ceph.com/en/reef/rados/configuration/bluestore-config-ref/#minimum-allocation-size]]
+  * Restart of OSD needed
+  * Impact: A smaller value reduces space waste (space amplification) but increases metadata overhead, while a larger value helps with large sequential writes but wastes space on small files.
+  * These settings are generally applied to new or freshly deployed OSDs
+<code bash>
+# ceph tell 'osd.*' config show | grep bluestore_min_alloc
+    "bluestore_min_alloc_size": "0",
+    "bluestore_min_alloc_size_hdd": "4096",
+    "bluestore_min_alloc_size_ssd": "4096",
+# ceph tell 'osd.*' config set global bluestore_min_alloc_size_hdd 16384
+</code>
+==== filestore_op_threads ====
+<code bash>
+# ceph tell 'osd.*' config show | grep filestore_op_threads
+"filestore_op_threads": "2"
+# ceph tell 'osd.*' config set filestore_op_threads 4
+</code>
 === mClock scheduler ===
-[[https://pve.proxmox.com/wiki/Ceph_mClock_Tuning]]
+  * [[https://pve.proxmox.com/wiki/Ceph_mClock_Tuning]]
+  * [[https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#osd-capacity-determination-automated]]
+  * [[https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#set-or-override-max-iops-capacity-of-an-osd]]
-[[https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#osd-capacity-determination-automated]]
 Upon startup ceph mClock scheduler performs benchmarking of storage and configure IOPS according to results:
@@ Line 87: / Line 170: @@
     "osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
 </code>
+Manual benchmark:
+<code bash>
+ceph tell 'osd.*' bench 12288000 4096 4194304 100
+</code>
+Override settings:
 <code bash>

Tools

menus and quick search

quick search

site status

Page Tools

meta data for this page

Differences