meta data for this page
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| vm:proxmox:ceph:performance [2023/05/31 09:22] – niziak | vm:proxmox:ceph:performance [2025/11/01 10:36] (current) – niziak | ||
|---|---|---|---|
| Line 4: | Line 4: | ||
| * [[https:// | * [[https:// | ||
| * [[https:// | * [[https:// | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | ===== Performance tips ===== | ||
| - | ==== Issues ==== | + | Ceph is build for scale and works great in large clusters. In small cluster every node will be heavily loaded. |
| - | === auth: unable to find a keyring | + | * adapt PG to number of OSDs to spread traffic evenly |
| + | * use '' | ||
| + | * more OSD = better parallelism | ||
| + | * enable '' | ||
| + | * MTU 9000 (jumbo frames) [[https:// | ||
| + | * net latency <200us ('' | ||
| + | * [[https:// | ||
| + | * Ceph is incredibly sensitive to latency introduced by CPU c-state transitions. Set '' | ||
| + | * Disable IOMMU in kernel | ||
| - | It is not possible to create ceph OSD neither from WebUI nor cmdline: <code bash> | + | ==== performance on small cluster ==== |
| - | < | + | * [[https://www.youtube.com/watch? |
| - | Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring / | + | * number of PG should be power of 2 (or middle between powers of 2) |
| - | stderr: 2021-01-28T10:21:24.996+0100 7fd1a848f700 -1 auth: unable to find a keyring on / | + | * same utilization (% full) per device |
| - | 2021-01-28T10: | + | * same number of PG per OSD := same number of request per device |
| + | * same number of primary PG per OSD = read operations spread evenly | ||
| + | * primary PG - original/ | ||
| + | * use relatively more PG than for big cluster | ||
| + | * i.e. for 7 OSD x 2TB PG autoscaler recommends 256 PG. After changing to 384 IOops drastivally increases and latency drops. | ||
| + | Setting to 512 PG wasn't possible because limit of 250PG/OSD. | ||
| + | |||
| + | === balancer === | ||
| + | |||
| + | <code bash> | ||
| + | ceph mgr module enable balancer | ||
| + | ceph balancer on | ||
| + | ceph balancer mode upmap | ||
| </ | </ | ||
| - | <file init / | + | === CRUSH reweight === |
| - | [client] | + | |
| - | | + | |
| - | [mds] | + | If possible use '' |
| - | | + | |
| - | </ | + | |
| - | ceph.conf Variables | + | Override default CRUSH assignment. |
| - | * **$cluster** - cluster name. For proxmox it is '' | + | |
| - | * **$type** - daemon process '' | + | |
| - | * **$id** - daemon or client indentifier. For '' | + | |
| - | * **$host** - hostname where the process is running | + | |
| - | * **$name** - Expands to $type.$id. I.e: '' | + | |
| - | * **$pid** - Expands to daemon pid | + | |
| - | **SOLUTION: | ||
| - | <code bash>cp / | ||
| - | alternative to try: change ceph.conf | ||
| - | === Unit -.mount is masked. | + | === PG autoscaler |
| - | < | + | Better to use in warn mode, to do not put unexpected load when PG number will change. |
| - | Running command: / | + | < |
| - | stderr: Failed to start ceph-osd@2.service: Unit -.mount is masked. | + | ceph mgr module enable pg_autoscaler |
| - | --> | + | #ceph osd pool set <pool> pg_autoscale_mode < |
| + | ceph osd pool set rbd pg_autoscale_mode warn | ||
| </ | </ | ||
| - | It was caused by '' | + | It is possible to set desired/target size of pool. This prevents autoscaler to move data every time new data are stored. |
| - | * [[https:// | + | |
| - | * [[https://bugs.debian.org/ | + | |
| - | * [[https:// | + | |
| - | **Solution: | + | ==== check cluster balance ==== |
| - | <code bash>systemctl | + | <code bash> |
| + | ceph -s | ||
| + | ceph osd df # shows standard deviation | ||
| + | </ | ||
| - | To list runtime masked units: | + | no tools to show primary PG balancing. Tool on https://github.com/JoshSalomon/Cephalocon-2019/blob/master/pool_pgs_osd.sh |
| - | <code bash>ls -l /var/run/systemd/system | grep mount | grep '/dev/null' | cut -d ' ' -f 11</code> | + | |
| - | To unescape systemd unit names: | + | ==== fragmentation ==== |
| - | <code bash> systemd-escape -u ' | + | |
| + | <code bash> | ||
| + | # ceph tell ' | ||
| + | osd.0: { | ||
| + | " | ||
| + | } | ||
| + | osd.1: { | ||
| + | " | ||
| + | } | ||
| + | osd.2: { | ||
| + | " | ||
| + | } | ||
| + | osd.3: { | ||
| + | " | ||
| + | } | ||
| + | osd.4: { | ||
| + | " | ||
| + | } | ||
| + | osd.5: { | ||
| + | " | ||
| + | } | ||
| + | osd.6: { | ||
| + | " | ||
| + | } | ||
| + | </ | ||
| + | |||
| + | |||
| + | ==== performance on slow HDDs ==== | ||
| + | |||
| + | Do not keep '' | ||
| + | <code bash> | ||
| + | ceph config set osd osd_memory_target 4294967296 | ||
| + | ceph config get osd osd_memory_target | ||
| + | 4294967296 | ||
| + | </ | ||
| + | |||
| + | If journal is on SSD, change low_threshold to sth bigger - NOTE - check if is valid for BLuestore, probably this is legacy paramater for Filestore: | ||
| + | <code bash> | ||
| + | # internal parameter calculated from other parameters: | ||
| + | ceph config get osd journal_throttle_low_threshhold | ||
| + | 0.600000 | ||
| + | |||
| + | # 5GB: | ||
| + | ceph config get osd osd_journal_size | ||
| + | 5120 | ||
| + | </ | ||
| + | |||
| + | === mClock scheduler === | ||
| + | |||
| + | * [[https:// | ||
| + | |||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | |||
| + | Upon startup ceph mClock scheduler performs benchmarking of storage and configure IOPS according to results: | ||
| + | |||
| + | |||
| + | <code bash> | ||
| + | # ceph tell ' | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | |||
| + | # ceph tell ' | ||
| + | " | ||
| + | " | ||
| + | </ | ||
| + | |||
| + | Manual benchmark: | ||
| + | <code bash> | ||
| + | ceph tell ' | ||
| + | </ | ||
| + | |||
| + | Override settings: | ||
| + | |||
| + | <code bash> | ||
| + | ceph config dump | grep osd_mclock_max_capacity_iops | ||
| + | |||
| + | for i in $(seq 0 7); do ceph config rm osd.$i osd_mclock_max_capacity_iops_hdd; | ||
| + | ceph config set global osd_mclock_max_capacity_iops_hdd 111 | ||
| + | |||
| + | ceph config dump | grep osd_mclock_max_capacity_iops | ||
| + | </ | ||
| + | |||
| + | == mClock profiles == | ||
| + | |||
| + | <code bash> | ||
| + | ceph tell ' | ||
| + | </ | ||
| + | |||
| + | <code bash> | ||
| + | ceph tell ' | ||
| + | |||
| + | ceph tell ' | ||
| + | </ | ||
| + | |||
| + | == mClock custom profile == | ||
| + | |||
| + | <code bash> | ||
| + | ceph tell ' | ||
| + | </ | ||