Differences

This shows you the differences between two versions of the page.

--- vm:proxmox:ceph:performance [2021/05/19 19:33] – niziak
+++ vm:proxmox:ceph:performance [2025/11/01 10:36] (current) – niziak
@@ Line 4: / Line 4: @@
   * [[https://yourcmc.ru/wiki/Ceph_performance]]
   * [[https://accelazh.github.io/ceph/Ceph-Performance-Tuning-Checklist|Ceph Performance Tuning Checklist]]
+  * [[https://www.reddit.com/r/ceph/comments/zpk0wo/new_to_ceph_hdd_pool_is_extremely_slow/|New to Ceph, HDD pool is extremely slow]]
+  * [[https://forum.proxmox.com/threads/ceph-storage-performance.129408/#post-566971|Ceph Storage Performance]]
+  * [[https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/|Ceph: A Journey to 1 TiB/s]]
+  * [[https://www.boniface.me/posts/pvc-ceph-tuning-adventures/]]
-==== block.db and block.wal ====
+===== Performance tips =====
-  The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s internal journal or write-ahead log. It is recommended to use a fast SSD or NVRAM for better performance.
+Ceph is build for scale and works great in large clusters. In small cluster every node will be heavily loaded.
-  Important
+  * adapt PG to number of OSDs to spread traffic evenly
-  Since Ceph has to write all data to the journal (or WAL+DB) before it can ACK writes, having this metadata and OSD performance in balance is really important!
+  * use ''krbd''
+  * more OSD = better parallelism
+  * enable ''writeback'' on VMs (possible data loss on consumer SSDs)
+  * MTU 9000 (jumbo frames) [[https://ceph.io/en/news/blog/2015/ceph-loves-jumbo-frames/|Ceph Loves Jumbo Frames]]
+  * net latency <200us (''ping -s 1000 pve'')
+  * [[https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/|Ceph: A Journey to 1 TiB/s]]
+    * Ceph is incredibly sensitive to latency introduced by CPU c-state transitions. Set ''Max perf'' in BIOS to disable C-States or boot Linux with ''GRUB_CMDLINE_LINUX="idle=poll intel_idle.max_cstate=0 intel_pstate=disable processor.max_cstate=1" ''
+    * Disable IOMMU in kernel
-For hosts with multiple HDDs (multiple OSDs), it is possible to use one SSD for all OSDS DB/WAL (one partition per OSD).
+==== performance on small cluster ====
-NOTE: The recommended scenario for mixed setup for one host is to use
+  * [[https://www.youtube.com/watch?v=LlLLJxNcVOY|Configuring Small Ceph Clusters for Optimal Performance - Josh Salomon, Red Hat]]
-  * multiple HDDS (one OSD per HDD)
+  * number of PG should be power of 2 (or middle between powers of 2)
-  * one fast SSD/NVMe drive for DB/WAL (i.e. 4GB or 30GB per 2TB HDD only needed).
+  * same utilization (% full) per device
+  * same number of PG per OSD := same number of request per device
+  * same number of primary PG per OSD = read operations spread evenly
+    * primary PG - original/first PG - others are replicas. Primary PG is used for read.
+  * use relatively more PG than for big cluster - better balance, but handling PGs consumes resources (RAM)
+    * i.e. for 7 OSD x 2TB PG autoscaler recommends 256 PG. After changing to 384 IOops drastivally increases and latency drops.
+      Setting to 512 PG wasn't possible because limit of 250PG/OSD.
-Proxmox UI and CLI expects only whole device as DB device, not partition!. It will not destroy existing drive. It expect LVM volume with free space and it will create new LVM volume for DB/WAK.
+=== balancer ===
-Ceph native CLI can work with partition specified as DB (it also works with whole drive or LVM).
+<code bash>
+ceph mgr module enable balancer
+ceph balancer on
+ceph balancer mode upmap
+</code>
-**MORE INFO:**
+=== CRUSH reweight ===
-  * https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#block-and-block-db
-  * https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#adding-removing-osds
-  * https://www.reddit.com/r/ceph/comments/jnyxgm/how_do_you_create_multiple_osds_per_disk_with/
+If possible use ''balancer''
-==== DB/WAL sizes ====
+Override default CRUSH assignment.
-  * If there is <1GB of fast storage, the best is to use it as WAL only (without DB).
-  * if a DB device is specified but an explicit WAL device is not, the WAL will be implicitly colocated with the DB on the faster device.
-DB size:
-  * (still true for Octopus 15.2.6 ) DB should be 30GB. And this doesn't depend on the size of the data partition.
-     * all block.db sizes except **4, 30, 286 GB** are pointless,
-       * see: [[https://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing|About block.db sizing]]
-       * [[https://github.com/facebook/rocksdb/wiki/Leveled-Compaction|Leveled Compaction]]
-  * should have as large as possible logical volumes
-  * for RGW (Rados Gateway) workloads: min 4% of block device size
-  * for RBD (Rados Block Device) workloads: 1-2% is enough (2% from 2TB is 40GB)
-  * according to ''ceph daemon osd.0 perf dump | jq .bluefs'' 80GB was reserved on HDD for DB, where 1.6-2.4GB is used.
+=== PG autoscaler ===
-===== Adding journal DB/WAL partition =====
+Better to use in warn mode, to do not put unexpected load when PG number will change.
+<code bash>
-If an OSD needs to be shutdown for maintenance (i.e. adding new disc), please set ''ceph osd set noout'' to prevent unnecessary data balance.
+ceph mgr module enable pg_autoscaler
+#ceph osd pool set <pool> pg_autoscale_mode <mode>
-==== Create parition on NVM drive ====
+ceph osd pool set rbd pg_autoscale_mode warn
+</code>
-Reorganize existing NVM/SSD disc to make some free space. Create empty partition on free space.
+It is possible to set desired/target size of pool. This prevents autoscaler to move data every time new data are stored.
-=== Cut some space from zpool cache NVM partition ===
+==== check cluster balance ====
 <code bash>
-# remove cache partition from zpool
+ceph -s
-zpool list -v
+ceph osd df # shows standard deviation
-zpool remove rpool /dev/nvme0n1p3
-...
-reorganize partition
-...
-blkid
-zpool add rpool cache 277455ae-1bfa-41f6-8b89-fd362d35515e
 </code>
-=== Cut some space from zpool ===
+no tools to show primary PG balancing. Tool on https://github.com/JoshSalomon/Cephalocon-2019/blob/master/pool_pgs_osd.sh
+==== fragmentation ====
-Example how to cut some space from ''nvmpool'' zpool with spare temporary drive:
-  * we have 1 spare HDD which will be new Ceph OSD in future
-  * zpool doesn't support online shrinking.
-  * move ''nvmpool'' to spare HDD, to release NVM ''nvmpool'' partition.
 <code bash>
-zpool replace nvmpool nvme0n1p4 sdc
+# ceph tell 'osd.*' bluestore allocator score block
+osd.0: {
+    "fragmentation_rating": 0.27187848765399758
+}
+osd.1: {
+    "fragmentation_rating": 0.31147177012467503
+}
+osd.2: {
+    "fragmentation_rating": 0.30870023661486262
+}
+osd.3: {
+    "fragmentation_rating": 0.25266931194419928
+}
+osd.4: {
+    "fragmentation_rating": 0.29409796398594706
+}
+osd.5: {
+    "fragmentation_rating": 0.33731626650673441
+}
+osd.6: {
+    "fragmentation_rating": 0.23903976339003158
+}
 </code>
-<code bash>
-# zpool status nvmpool
-  pool: nvmpool
- state: ONLINE
-status: One or more devices is currently being resilvered.  The pool will
-	continue to function, possibly in a degraded state.
-action: Wait for the resilver to complete.
-  scan: resilver in progress since Thu May  6 14:13:32 2021
-.2G scanned at 249M/s, 21.0G issued at 74.4M/s, 70.2G total
-.1G resilvered, 29.91% done, 00:11:17 to go
-config:
-	NAME           STATE     READ WRITE CKSUM
-	nvmpool        ONLINE       0     0     0
+==== performance on slow HDDs ====
-	  replacing-0  ONLINE       0     0     0
-	    nvme0n1p4  ONLINE       0     0     0
+Do not keep ''osd_memory_target'' below 2G:
-	    sdc        ONLINE       0     0     0  (resilvering)
-</code>
-  * wait for resilver
-  * reorganize partitions
-  * replace disks again
 <code bash>
-zpool replace nvmpool sdc nvme0n1p4
+ceph config set osd osd_memory_target 4294967296
+ceph config get osd osd_memory_target
+4294967296
 </code>
-==== Replace OSD ====
+If journal is on SSD, change low_threshold to sth bigger - NOTE - check if is valid for BLuestore, probably this is legacy paramater for Filestore:
 <code bash>
-ceph osd tree
+# internal parameter calculated from other parameters:
+ceph config get osd journal_throttle_low_threshhold
+.600000
-ceph device ls-by-host pve5
+# 5GB:
-DEVICE                     DEV  DAEMONS  EXPECTED FAILURE
+ceph config get osd osd_journal_size
-TOSHIBA_HDWD120_30HN40HAS  sdc  osd.2
+</code>
-### Switch OFF OSD. Ceph should rebalance data from replicas when OSD is switched off directly
+=== mClock scheduler ===
-ceph osd out X
-## or better use lines below:
-# this is optional for safety for small clusters instead of using ceph out osd.2
-ceph osd reweight osd.X 0
-# wait for data migration away from osd.X
-watch 'ceph -s; ceph osd df tree'
-# Remove OSD
-ceph osd out X
-ceph osd safe-to-destroy osd.X
+  * [[https://pve.proxmox.com/wiki/Ceph_mClock_Tuning]]
-ceph osd down X
-systemctl stop ceph-osd@X.service
-ceph osd destroy X
-#pveceph osd destroy X
+  * [[https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#osd-capacity-determination-automated]]
+  * [[https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#set-or-override-max-iops-capacity-of-an-osd]]
+Upon startup ceph mClock scheduler performs benchmarking of storage and configure IOPS according to results:
-# to remove partition table, boot sector and any OSD leftover:
-ceph-volume lvm zap /dev/sdX --destroy
-## it is not possible to specify DB partition with pveceph command (read begining of page):
+<code bash>
-# pveceph osd create /dev/sdc --db_dev /dev/nvme0n1p3
+# ceph tell 'osd.*' config show | grep osd_mclock_max_capacity_iops_hdd
-## it requires whole device as db dev with LVM and will create new LVM on free space, i.e.
+    "osd_mclock_max_capacity_iops_hdd": "269.194638",
-# pveceph osd create /dev/sdc --db_dev /dev/nvme0n1 --db_size 32G
+    "osd_mclock_max_capacity_iops_hdd": "310.961086",
-## so direct ceph command will be used:
+    "osd_mclock_max_capacity_iops_hdd": "299.505949",
+    "osd_mclock_max_capacity_iops_hdd": "345.471699",
+    "osd_mclock_max_capacity_iops_hdd": "356.290246",
+    "osd_mclock_max_capacity_iops_hdd": "229.234009",
+    "osd_mclock_max_capacity_iops_hdd": "266.478860",
-# Prevent backfilling when new osd will be added
+# ceph tell 'osd.*' config show | grep osd_mclock_max_sequential
-ceph osd set nobackfill
+    "osd_mclock_max_sequential_bandwidth_hdd": "157286400",
+    "osd_mclock_max_sequential_bandwidth_ssd": "1258291200",
+</code>
-### Create OSD:
+Manual benchmark:
-ceph-volume lvm create --osd-id X --bluestore --data /dev/sdc --block.db /dev/nvme0n1p3
+<code bash>
-# or split above into two step:
+ceph tell 'osd.*' bench 12288000 4096 4194304 100
-ceph-volume lvm prepare --bluestore --data /dev/sdX --block.db /dev/nvme0n1pX
-ceph-volume lvm activate --bluestore X e56ecc53-826d-40b0-a647-xxxxxxxxxxxx
-# also possible: ceph-volume lvm activate --all
-## DRAFTS:
-#ceph-volume lvm create --cluster-fsid 321bdc94-39a5-460a-834f-6e617fdd6c66 --data /dev/sdc --block.db /dev/nvme0n1p3
-#ceph-volume lvm activate --bluestore <osd id> <osd fsid>
 </code>
-Verify:
+Override settings:
 <code bash>
-ls -l /var/lib/ceph/osd/ceph-X/
+ceph config dump | grep osd_mclock_max_capacity_iops
-lrwxrwxrwx 1 ceph ceph  93 sty 28 17:59 block -> /dev/ceph-16a69325-6fb3-4d09-84ee-c053c01f410f/osd-block-e56ecc53-826d-40b0-a647-5f1a1fc8800e
-lrwxrwxrwx 1 ceph ceph  14 sty 28 17:59 block.db -> /dev/nvme0n1p3
-ceph daemon osd.X perf dump | jq '.bluefs'
+for i in $(seq 0 7); do ceph config rm osd.$i osd_mclock_max_capacity_iops_hdd; done
-{
+ceph config set global osd_mclock_max_capacity_iops_hdd 111
-  "gift_bytes": 0,
-  "reclaim_bytes": 0,
-  "db_total_bytes": 42949664768,    --> 39,99GB  (40GB partition created)
-  "db_used_bytes": 1452269568,      -->  1,35GB
-  "wal_total_bytes": 0,
-  "wal_used_bytes": 0,
-...
-# OR
-  "db_total_bytes": 4294959104,     --> 3,9GB (4GB partition)
-  "db_used_bytes": 66052096,        -->
-ceph device ls
+ceph config dump | grep osd_mclock_max_capacity_iops
 </code>
-And restore backfilling:
+== mClock profiles ==
 <code bash>
-ceph osd unset nobackfill
+ceph tell 'osd.*' config show | grep osd_mclock_profile
 </code>
-Check benefits:
+<code bash>
-  * Observe better latency on OSD with NVM/SSD: <code bash>watch ceph osd perf</code>
+ceph tell 'osd.*' config set osd_mclock_profile [high_client_ops|high_recovery_ops|balanced]
-  * check ''iotop'' output. Now ''[bstore_kv_sync]'' should take less time.
+ceph tell 'osd.*' config show | grep osd_mclock_profile
-==== Issues ====
-=== auth: unable to find a keyring ===
-It is not possible to create ceph OSD neither from WebUI nor cmdline: <code bash>pveceph osd create /dev/sdc</code>
-<code>
-Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-2/activate.monmap
- stderr: 2021-01-28T10:21:24.996+0100 7fd1a848f700 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.bootstrap-osd.keyring: (2) No such file or directory
--01-28T10:21:24.996+0100 7fd1a848f700 -1 AuthRegistry(0x7fd1a0059030) no keyring found at /etc/pve/priv/ceph.client.bootstrap-osd.keyring, disabling cephx
 </code>
-<file init /etc/pve/ceph.conf>
+== mClock custom profile ==
-[client]
-         keyring = /etc/pve/priv/$cluster.$name.keyring
-[mds]
+<code bash>
-         keyring = /var/lib/ceph/mds/ceph-$id/keyring
+ceph tell 'osd.*' config set osd_mclock_profile custom
-</file>
-ceph.conf Variables
-  * **$cluster** - cluster name. For proxmox it is ''ceph''
-  * **$type** - daemon process ''mds'' ''osd'' ''mon''
-  * **$id** - daemon or client indentifier. For ''osd.0'' it is ''0''
-  * **$host** - hostname where the process is running
-  * **$name** - Expands to $type.$id. I.e: ''osd.2'' or ''client.bootstrap''
-  * **$pid** - Expands to daemon pid
-**SOLUTION:**
-<code bash>cp /var/lib/ceph/bootstrap-osd/ceph.keyring /etc/pve/priv/ceph.client.bootstrap-osd.keyring</code>
-alternative to try: change ceph.conf
-=== Unit -.mount is masked. ===
-<code>
-Running command: /usr/bin/systemctl start ceph-osd@2
- stderr: Failed to start ceph-osd@2.service: Unit -.mount is masked.
--->  RuntimeError: command returned non-zero exit status: 1
 </code>
-It was caused by ''gparted'' which wasn't correctly shutdown.
-  * [[https://askubuntu.com/questions/1191596/unit-mount-is-masked|Unit -.mount is masked]]
-  * [[https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=948739|gparted should not mask .mount units]]
-  * [[https://unix.stackexchange.com/questions/533933/systemd-cant-unmask-root-mount-mount/548996]]
-**Solution:**
-<code bash>systemctl --runtime unmask -- -.mount</code>
-To list runtime masked units:
-<code bash>ls -l /var/run/systemd/system | grep mount | grep '/dev/null' | cut -d ' ' -f 11</code>
-To unescape systemd unit names:
-<code bash> systemd-escape -u 'rpool-data-basevol\x2d800\x2ddisk\x2d0.mount'</code>

Tools

menus and quick search

quick search

site status

Page Tools

meta data for this page

Differences