Differences

This shows you the differences between two versions of the page.

--- vm:proxmox:ceph:performance [2021/01/30 20:28] – niziak
+++ vm:proxmox:ceph:performance [2025/11/01 10:36] (current) – niziak
@@ Line 4: / Line 4: @@
   * [[https://yourcmc.ru/wiki/Ceph_performance]]
   * [[https://accelazh.github.io/ceph/Ceph-Performance-Tuning-Checklist|Ceph Performance Tuning Checklist]]
+  * [[https://www.reddit.com/r/ceph/comments/zpk0wo/new_to_ceph_hdd_pool_is_extremely_slow/|New to Ceph, HDD pool is extremely slow]]
+  * [[https://forum.proxmox.com/threads/ceph-storage-performance.129408/#post-566971|Ceph Storage Performance]]
+  * [[https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/|Ceph: A Journey to 1 TiB/s]]
+  * [[https://www.boniface.me/posts/pvc-ceph-tuning-adventures/]]
-==== block.db and block.wal ====
+===== Performance tips =====
-  The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s internal journal or write-ahead log. It is recommended to use a fast SSD or NVRAM for better performance.
+Ceph is build for scale and works great in large clusters. In small cluster every node will be heavily loaded.
-  Important
+  * adapt PG to number of OSDs to spread traffic evenly
-  Since Ceph has to write all data to the journal (or WAL+DB) before it can ACK writes, having this metadata and OSD performance in balance is really important!
+  * use ''krbd''
+  * more OSD = better parallelism
+  * enable ''writeback'' on VMs (possible data loss on consumer SSDs)
+  * MTU 9000 (jumbo frames) [[https://ceph.io/en/news/blog/2015/ceph-loves-jumbo-frames/|Ceph Loves Jumbo Frames]]
+  * net latency <200us (''ping -s 1000 pve'')
+  * [[https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/|Ceph: A Journey to 1 TiB/s]]
+    * Ceph is incredibly sensitive to latency introduced by CPU c-state transitions. Set ''Max perf'' in BIOS to disable C-States or boot Linux with ''GRUB_CMDLINE_LINUX="idle=poll intel_idle.max_cstate=0 intel_pstate=disable processor.max_cstate=1" ''
+    * Disable IOMMU in kernel
-For hosts with multiple HDDs (multiple OSDs), it is possible to use one SSD for all OSDS DB/WAL (one partition per OSD).
+==== performance on small cluster ====
-NOTE: The recommended scenario for mixed setup for one host is to use
+  * [[https://www.youtube.com/watch?v=LlLLJxNcVOY|Configuring Small Ceph Clusters for Optimal Performance - Josh Salomon, Red Hat]]
-  * multiple HDDS (one OSD per HDD)
+  * number of PG should be power of 2 (or middle between powers of 2)
-  * one fast SSD/NVMe drive for DB/WAL (i.e. 30GB per 2TB HDD only needed).
+  * same utilization (% full) per device
+  * same number of PG per OSD := same number of request per device
+  * same number of primary PG per OSD = read operations spread evenly
+    * primary PG - original/first PG - others are replicas. Primary PG is used for read.
+  * use relatively more PG than for big cluster - better balance, but handling PGs consumes resources (RAM)
+    * i.e. for 7 OSD x 2TB PG autoscaler recommends 256 PG. After changing to 384 IOops drastivally increases and latency drops.
+      Setting to 512 PG wasn't possible because limit of 250PG/OSD.
-Proxmox UI and CLI expects only whole device as DB device, not partition!. It will not destroy existing drive. It expect LVM volume with free space and it will create new LVM volume for DB/WAK.
+=== balancer ===
-Ceph native CLI can work with partition specified as DB (it also works with whole drive or LVM).
+<code bash>
+ceph mgr module enable balancer
+ceph balancer on
+ceph balancer mode upmap
+</code>
-**MORE INFO:**
+=== CRUSH reweight ===
-  * https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#block-and-block-db
-  * https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#adding-removing-osds
-  * https://www.reddit.com/r/ceph/comments/jnyxgm/how_do_you_create_multiple_osds_per_disk_with/
+If possible use ''balancer''
-==== DB/WAL sizes ====
+Override default CRUSH assignment.
-  * If there is <1GB of fast storage, the best is to use it as WAL only (without DB).
-  * if a DB device is specified but an explicit WAL device is not, the WAL will be implicitly colocated with the DB on the faster device.
-DB size:
-  * (still true for Octopus 15.2.6 ) DB should be 30GB. And this doesn't depend on the size of the data partition.
-     * all block.db sizes except **4, 30, 286 GB** are pointless,
-       * see: [[https://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing|About block.db sizing]]
-       * [[https://github.com/facebook/rocksdb/wiki/Leveled-Compaction|Leveled Compaction]]
-  * should have as large as possible logical volumes
-  * for RGW (Rados Gateway) workloads: min 4% of block device size
-  * for RBD (Rados Block Device) workloads: 1-2% is enough (2% from 2TB is 40GB)
-  * according to ''ceph daemon osd.0 perf dump'' 80GB was reserved on HDD for DB, where 1.6-2.4GB is used.
+=== PG autoscaler ===
-===== Adding journal DB/WAL partition =====
+Better to use in warn mode, to do not put unexpected load when PG number will change.
+<code bash>
+ceph mgr module enable pg_autoscaler
+#ceph osd pool set <pool> pg_autoscale_mode <mode>
+ceph osd pool set rbd pg_autoscale_mode warn
+</code>
-If an OSD needs to be shutdown for maintenance (i.e. adding new disc), please set ''ceph osd set noout'' to prevent unnecessary data balance.
+It is possible to set desired/target size of pool. This prevents autoscaler to move data every time new data are stored.
-==== Create parition on NVM drive ====
+==== check cluster balance ====
-Reorganize existing NVM/SSD disc to make some free space. Create empty partition on free space.
 <code bash>
-# remove cache partition from zpool
+ceph -s
-zpool list -v
+ceph osd df # shows standard deviation
-zpool remove rpool /dev/nvme0n1p3
-...
-reorganize partition
-...
-blkid
-zpool add rpool cache 277455ae-1bfa-41f6-8b89-fd362d35515e
 </code>
-==== Replace OSD ====
+no tools to show primary PG balancing. Tool on https://github.com/JoshSalomon/Cephalocon-2019/blob/master/pool_pgs_osd.sh
+==== fragmentation ====
 <code bash>
-ceph osd tree
+# ceph tell 'osd.*' bluestore allocator score block
+osd.0: {
+    "fragmentation_rating": 0.27187848765399758
+}
+osd.1: {
+    "fragmentation_rating": 0.31147177012467503
+}
+osd.2: {
+    "fragmentation_rating": 0.30870023661486262
+}
+osd.3: {
+    "fragmentation_rating": 0.25266931194419928
+}
+osd.4: {
+    "fragmentation_rating": 0.29409796398594706
+}
+osd.5: {
+    "fragmentation_rating": 0.33731626650673441
+}
+osd.6: {
+    "fragmentation_rating": 0.23903976339003158
+}
+</code>
-ceph device ls-by-host pve5
-DEVICE                     DEV  DAEMONS  EXPECTED FAILURE
-TOSHIBA_HDWD120_30HN40HAS  sdc  osd.2
-### Switch OFF OSD. Ceph should rebalance data from replicas when OSD is switched off directly
-ceph osd out X
-## or better use lines below:
-# this is optional for safety for small clusters instead of using ceph out osd.2
-ceph osd reweight osd.X 0
-# wait for data migration away from osd.X
-watch 'ceph -s; ceph osd df tree'
-# Remove OSD
-ceph osd out X
-ceph osd safe-to-destroy osd.X
+==== performance on slow HDDs ====
-ceph osd down X
-systemctl stop ceph-osd@X.service
-ceph osd destroy X
-#pveceph osd destroy X
+Do not keep ''osd_memory_target'' below 2G:
+<code bash>
+ceph config set osd osd_memory_target 4294967296
+ceph config get osd osd_memory_target
+4294967296
+</code>
+If journal is on SSD, change low_threshold to sth bigger - NOTE - check if is valid for BLuestore, probably this is legacy paramater for Filestore:
+<code bash>
+# internal parameter calculated from other parameters:
+ceph config get osd journal_throttle_low_threshhold
+.600000
-# to remove partition table, boot sector and any OSD leftover:
+# 5GB:
-ceph-volume lvm zap /dev/sdX --destroy
+ceph config get osd osd_journal_size
+
+</code>
-## it is not possible to specify DB partition with pveceph command (read begining of page):
+=== mClock scheduler ===
-# pveceph osd create /dev/sdc --db_dev /dev/nvme0n1p3
-## it requires whole device as db dev with LVM and will create new LVM on free space, i.e.
-# pveceph osd create /dev/sdc --db_dev /dev/nvme0n1 --db_size 32G
-## so direct ceph command will be used:
-# Prevent backfilling when new osd will be added
+  * [[https://pve.proxmox.com/wiki/Ceph_mClock_Tuning]]
-ceph osd set nobackfill
-### Create OSD:
+  * [[https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#osd-capacity-determination-automated]]
-ceph-volume lvm create --osd-id X --bluestore --data /dev/sdc --block.db /dev/nvme0n1p3
+  * [[https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#set-or-override-max-iops-capacity-of-an-osd]]
-# or split above into two step:
-ceph-volume lvm prepare --bluestore --data /dev/sdX --block.db /dev/nvme0n1pX
-ceph-volume lvm activate --bluestore X e56ecc53-826d-40b0-a647-xxxxxxxxxxxx
-# also possible: ceph-volume lvm activate --all
-## DRAFTS:
+Upon startup ceph mClock scheduler performs benchmarking of storage and configure IOPS according to results:
-#ceph-volume lvm create --cluster-fsid 321bdc94-39a5-460a-834f-6e617fdd6c66 --data /dev/sdc --block.db /dev/nvme0n1p3
-#ceph-volume lvm activate --bluestore <osd id> <osd fsid>
-</code>
-Verify:
 <code bash>
-ls -l /var/lib/ceph/osd/ceph-X/
+# ceph tell 'osd.*' config show | grep osd_mclock_max_capacity_iops_hdd
-lrwxrwxrwx 1 ceph ceph  93 sty 28 17:59 block -> /dev/ceph-16a69325-6fb3-4d09-84ee-c053c01f410f/osd-block-e56ecc53-826d-40b0-a647-5f1a1fc8800e
+    "osd_mclock_max_capacity_iops_hdd": "269.194638",
-lrwxrwxrwx 1 ceph ceph  14 sty 28 17:59 block.db -> /dev/nvme0n1p3
+    "osd_mclock_max_capacity_iops_hdd": "310.961086",
+    "osd_mclock_max_capacity_iops_hdd": "299.505949",
-ceph daemon osd.X perf dump | jq '.bluefs'
+    "osd_mclock_max_capacity_iops_hdd": "345.471699",
-{
+    "osd_mclock_max_capacity_iops_hdd": "356.290246",
-  "gift_bytes": 0,
+    "osd_mclock_max_capacity_iops_hdd": "229.234009",
-  "reclaim_bytes": 0,
+    "osd_mclock_max_capacity_iops_hdd": "266.478860",
-  "db_total_bytes": 42949664768,    --> 39,99GB  (40GB partition created)
-  "db_used_bytes": 1452269568,      -->  1,35GB
-  "wal_total_bytes": 0,
-  "wal_used_bytes": 0,
-...
-# OR
-  "db_total_bytes": 4294959104,     --> 3,9GB (4GB partition)
-  "db_used_bytes": 66052096,        -->
-ceph device ls
+# ceph tell 'osd.*' config show | grep osd_mclock_max_sequential
+    "osd_mclock_max_sequential_bandwidth_hdd": "157286400",
+    "osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
 </code>
-And restore backfilling:
+Manual benchmark:
 <code bash>
-ceph osd unset nobackfill
+ceph tell 'osd.*' bench 12288000 4096 4194304 100
 </code>
-Check benefits:
+Override settings:
-  * Observe better latency on OSD with NVM/SSD: <code bash>watch ceph osd perf</code>
-  * check ''iotop'' output. Now ''[bstore_kv_sync]'' should take less time.
+<code bash>
+ceph config dump | grep osd_mclock_max_capacity_iops
+for i in $(seq 0 7); do ceph config rm osd.$i osd_mclock_max_capacity_iops_hdd; done
+ceph config set global osd_mclock_max_capacity_iops_hdd 111
+ceph config dump | grep osd_mclock_max_capacity_iops
+</code>
-==== Issues ====
+== mClock profiles ==
-=== auth: unable to find a keyring ===
+<code bash>
+ceph tell 'osd.*' config show | grep osd_mclock_profile
-It is not possible to create ceph OSD neither from WebUI nor cmdline: <code bash>pveceph osd create /dev/sdc</code>
-<code>
-Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-2/activate.monmap
- stderr: 2021-01-28T10:21:24.996+0100 7fd1a848f700 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.bootstrap-osd.keyring: (2) No such file or directory
--01-28T10:21:24.996+0100 7fd1a848f700 -1 AuthRegistry(0x7fd1a0059030) no keyring found at /etc/pve/priv/ceph.client.bootstrap-osd.keyring, disabling cephx
 </code>
-<file init /etc/pve/ceph.conf>
+<code bash>
-[client]
+ceph tell 'osd.*' config set osd_mclock_profile [high_client_ops|high_recovery_ops|balanced]
-         keyring = /etc/pve/priv/$cluster.$name.keyring
-[mds]
-         keyring = /var/lib/ceph/mds/ceph-$id/keyring
-</file>
-ceph.conf Variables
-  * **$cluster** - cluster name. For proxmox it is ''ceph''
-  * **$type** - daemon process ''mds'' ''osd'' ''mon''
-  * **$id** - daemon or client indentifier. For ''osd.0'' it is ''0''
-  * **$host** - hostname where the process is running
-  * **$name** - Expands to $type.$id. I.e: ''osd.2'' or ''client.bootstrap''
-  * **$pid** - Expands to daemon pid
-**SOLUTION:**
-<code bash>cp /var/lib/ceph/bootstrap-osd/ceph.keyring /etc/pve/priv/ceph.client.bootstrap-osd.keyring</code>
-alternative to try: change ceph.conf
-=== Unit -.mount is masked. ===
-<code>
+ceph tell 'osd.*' config show | grep osd_mclock_profile
-Running command: /usr/bin/systemctl start ceph-osd@2
- stderr: Failed to start ceph-osd@2.service: Unit -.mount is masked.
--->  RuntimeError: command returned non-zero exit status: 1
 </code>
-It was caused by ''gparted'' which wasn't correctly shutdown.
+== mClock custom profile ==
-  * [[https://askubuntu.com/questions/1191596/unit-mount-is-masked|Unit -.mount is masked]]
-  * [[https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=948739|gparted should not mask .mount units]]
-  * [[https://unix.stackexchange.com/questions/533933/systemd-cant-unmask-root-mount-mount/548996]]
-**Solution:**
-<code bash>systemctl --runtime unmask -- -.mount</code>
-To list runtime masked units:
-<code bash>ls -l /var/run/systemd/system | grep mount | grep '/dev/null' | cut -d ' ' -f 11</code>
-To unescape systemd unit names:
-<code bash> systemd-escape -u 'rpool-data-basevol\x2d800\x2ddisk\x2d0.mount'</code>
+<code bash>
+ceph tell 'osd.*' config set osd_mclock_profile custom
+</code>

Tools

menus and quick search

quick search

site status

Page Tools

meta data for this page

Differences