meta data for this page
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| vm:proxmox:ceph:performance [2021/05/19 19:33] – niziak | vm:proxmox:ceph:performance [2025/11/01 10:36] (current) – niziak | ||
|---|---|---|---|
| Line 4: | Line 4: | ||
| * [[https:// | * [[https:// | ||
| * [[https:// | * [[https:// | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| - | ==== block.db and block.wal | + | ===== Performance tips ===== |
| - | The DB stores BlueStore’s internal metadata | + | Ceph is build for scale and works great in large clusters. In small cluster every node will be heavily loaded. |
| - | | + | |
| - | | + | |
| + | * more OSD = better parallelism | ||
| + | * enable '' | ||
| + | * MTU 9000 (jumbo frames) [[https:// | ||
| + | * net latency <200us ('' | ||
| + | * [[https:// | ||
| + | * Ceph is incredibly sensitive to latency introduced by CPU c-state transitions. Set '' | ||
| + | * Disable IOMMU in kernel | ||
| - | For hosts with multiple HDDs (multiple OSDs), it is possible to use one SSD for all OSDS DB/WAL (one partition per OSD). | + | ==== performance on small cluster ==== |
| - | NOTE: The recommended scenario | + | * [[https:// |
| - | * multiple HDDS (one OSD per HDD) | + | * number of PG should be power of 2 (or middle between powers of 2) |
| - | * one fast SSD/NVMe drive for DB/ | + | * same utilization (% full) per device |
| + | * same number of PG per OSD := same number of request | ||
| + | * same number of primary PG per OSD = read operations spread evenly | ||
| + | * primary PG - original/first PG - others are replicas. Primary PG is used for read. | ||
| + | * use relatively more PG than for big cluster - better balance, but handling PGs consumes resources | ||
| + | * i.e. for 7 OSD x 2TB PG autoscaler recommends 256 PG. After changing to 384 IOops drastivally increases and latency drops. | ||
| + | Setting to 512 PG wasn't possible because limit of 250PG/OSD. | ||
| - | Proxmox UI and CLI expects only whole device as DB device, not partition!. It will not destroy existing drive. It expect LVM volume with free space and it will create new LVM volume for DB/WAK. | + | === balancer === |
| - | Ceph native CLI can work with partition specified as DB (it also works with whole drive or LVM). | + | <code bash> |
| + | ceph mgr module enable balancer | ||
| + | ceph balancer on | ||
| + | ceph balancer mode upmap | ||
| + | </ | ||
| - | **MORE INFO:** | + | === CRUSH reweight === |
| - | * https:// | + | |
| - | * https:// | + | |
| - | * https:// | + | |
| + | If possible use '' | ||
| - | ==== DB/WAL sizes ==== | + | Override default CRUSH assignment. |
| - | * If there is <1GB of fast storage, the best is to use it as WAL only (without DB). | + | |
| - | * if a DB device is specified but an explicit WAL device is not, the WAL will be implicitly colocated with the DB on the faster device. | + | |
| - | DB size: | ||
| - | * (still true for Octopus 15.2.6 ) DB should be 30GB. And this doesn' | ||
| - | * all block.db sizes except **4, 30, 286 GB** are pointless, | ||
| - | * see: [[https:// | ||
| - | * [[https:// | ||
| - | * should have as large as possible logical volumes | ||
| - | * for RGW (Rados Gateway) workloads: min 4% of block device size | ||
| - | * for RBD (Rados Block Device) workloads: 1-2% is enough (2% from 2TB is 40GB) | ||
| - | * according to '' | ||
| + | === PG autoscaler === | ||
| - | ===== Adding journal DB/WAL partition ===== | + | Better |
| - | + | <code bash> | |
| - | If an OSD needs to be shutdown for maintenance (i.e. adding new disc), please | + | ceph mgr module enable pg_autoscaler |
| - | + | #ceph osd pool set < | |
| - | ==== Create parition on NVM drive ==== | + | ceph osd pool set rbd pg_autoscale_mode warn |
| + | </ | ||
| - | Reorganize existing NVM/SSD disc to make some free space. Create empty partition on free space. | + | It is possible |
| - | === Cut some space from zpool cache NVM partition | + | ==== check cluster balance ==== |
| <code bash> | <code bash> | ||
| - | # remove cache partition from zpool | + | ceph -s |
| - | zpool list -v | + | ceph osd df # shows standard deviation |
| - | zpool remove rpool / | + | |
| - | ... | + | |
| - | reorganize partition | + | |
| - | ... | + | |
| - | blkid | + | |
| - | zpool add rpool cache 277455ae-1bfa-41f6-8b89-fd362d35515e | + | |
| </ | </ | ||
| - | === Cut some space from zpool === | + | no tools to show primary PG balancing. Tool on https:// |
| + | |||
| + | ==== fragmentation ==== | ||
| - | Example how to cut some space from '' | ||
| - | * we have 1 spare HDD which will be new Ceph OSD in future | ||
| - | * zpool doesn' | ||
| - | * move '' | ||
| <code bash> | <code bash> | ||
| - | zpool replace nvmpool nvme0n1p4 sdc | + | # ceph tell ' |
| + | osd.0: { | ||
| + | " | ||
| + | } | ||
| + | osd.1: { | ||
| + | " | ||
| + | } | ||
| + | osd.2: { | ||
| + | " | ||
| + | } | ||
| + | osd.3: { | ||
| + | " | ||
| + | } | ||
| + | osd.4: { | ||
| + | " | ||
| + | } | ||
| + | osd.5: { | ||
| + | " | ||
| + | } | ||
| + | osd.6: { | ||
| + | " | ||
| + | } | ||
| </ | </ | ||
| - | <code bash> | ||
| - | # zpool status nvmpool | ||
| - | pool: nvmpool | ||
| - | | ||
| - | status: One or more devices is currently being resilvered. | ||
| - | continue to function, possibly in a degraded state. | ||
| - | action: Wait for the resilver to complete. | ||
| - | scan: resilver in progress since Thu May 6 14:13:32 2021 | ||
| - | 70.2G scanned at 249M/s, 21.0G issued at 74.4M/s, 70.2G total | ||
| - | 21.1G resilvered, 29.91% done, 00:11:17 to go | ||
| - | config: | ||
| - | NAME | + | |
| - | nvmpool | + | ==== performance on slow HDDs ==== |
| - | replacing-0 | + | |
| - | nvme0n1p4 | + | Do not keep '' |
| - | sdc ONLINE | + | |
| - | </ | + | |
| - | * wait for resilver | + | |
| - | * reorganize partitions | + | |
| - | * replace disks again | + | |
| <code bash> | <code bash> | ||
| - | zpool replace nvmpool sdc nvme0n1p4 | + | ceph config set osd osd_memory_target 4294967296 |
| + | ceph config get osd osd_memory_target | ||
| + | 4294967296 | ||
| </ | </ | ||
| - | ==== Replace OSD ==== | + | If journal is on SSD, change low_threshold to sth bigger - NOTE - check if is valid for BLuestore, probably this is legacy paramater for Filestore: |
| <code bash> | <code bash> | ||
| - | ceph osd tree | + | # internal parameter calculated from other parameters: |
| + | ceph config get osd journal_throttle_low_threshhold | ||
| + | 0.600000 | ||
| - | ceph device ls-by-host pve5 | + | # 5GB: |
| - | DEVICE | + | ceph config get osd osd_journal_size |
| - | TOSHIBA_HDWD120_30HN40HAS | + | 5120 |
| + | </ | ||
| - | ### Switch OFF OSD. Ceph should rebalance data from replicas when OSD is switched off directly | + | === mClock scheduler === |
| - | ceph osd out X | + | |
| - | ## or better use lines below: | + | |
| - | # this is optional for safety for small clusters instead of using ceph out osd.2 | + | |
| - | ceph osd reweight osd.X 0 | + | |
| - | # wait for data migration away from osd.X | + | |
| - | watch 'ceph -s; ceph osd df tree' | + | |
| - | # Remove OSD | + | |
| - | ceph osd out X | + | |
| - | ceph osd safe-to-destroy osd.X | + | * [[https:// |
| - | ceph osd down X | + | |
| - | systemctl stop ceph-osd@X.service | + | |
| - | ceph osd destroy X | + | |
| - | #pveceph | + | * [[https:// |
| + | * [[https:// | ||
| + | Upon startup ceph mClock scheduler performs benchmarking of storage and configure IOPS according to results: | ||
| - | # to remove partition table, boot sector and any OSD leftover: | ||
| - | ceph-volume lvm zap /dev/sdX --destroy | ||
| - | ## it is not possible to specify DB partition with pveceph command (read begining of page): | + | <code bash> |
| - | # pveceph | + | # ceph tell 'osd.*' config show | grep osd_mclock_max_capacity_iops_hdd |
| - | ## it requires whole device as db dev with LVM and will create new LVM on free space, i.e. | + | " |
| - | # pveceph osd create /dev/sdc --db_dev / | + | " |
| - | ## so direct ceph command will be used: | + | " |
| + | " | ||
| + | " | ||
| + | " | ||
| + | " | ||
| - | # Prevent backfilling when new osd will be added | + | # ceph tell 'osd.*' config show | grep osd_mclock_max_sequential |
| - | ceph osd set nobackfill | + | " |
| + | " | ||
| + | </ | ||
| - | ### Create OSD: | + | Manual benchmark: |
| - | ceph-volume lvm create --osd-id X --bluestore --data /dev/sdc --block.db / | + | <code bash> |
| - | # or split above into two step: | + | ceph tell 'osd.*' bench 12288000 4096 4194304 100 |
| - | ceph-volume lvm prepare --bluestore --data /dev/sdX --block.db / | + | |
| - | ceph-volume lvm activate --bluestore X e56ecc53-826d-40b0-a647-xxxxxxxxxxxx | + | |
| - | # also possible: ceph-volume lvm activate --all | + | |
| - | + | ||
| - | ## DRAFTS: | + | |
| - | # | + | |
| - | # | + | |
| </ | </ | ||
| - | Verify: | + | Override settings: |
| <code bash> | <code bash> | ||
| - | ls -l /var/lib/ceph/ | + | ceph config dump | grep osd_mclock_max_capacity_iops |
| - | lrwxrwxrwx 1 ceph ceph 93 sty 28 17:59 block -> / | + | |
| - | lrwxrwxrwx 1 ceph ceph 14 sty 28 17:59 block.db -> / | + | |
| - | ceph daemon | + | for i in $(seq 0 7); do ceph config rm osd.$i osd_mclock_max_capacity_iops_hdd; |
| - | { | + | ceph config set global osd_mclock_max_capacity_iops_hdd 111 |
| - | " | + | |
| - | " | + | |
| - | " | + | |
| - | " | + | |
| - | " | + | |
| - | " | + | |
| - | ... | + | |
| - | # OR | + | |
| - | " | + | |
| - | " | + | |
| - | + | ||
| - | ceph device ls | + | |
| + | ceph config dump | grep osd_mclock_max_capacity_iops | ||
| </ | </ | ||
| - | And restore backfilling: | + | == mClock profiles == |
| <code bash> | <code bash> | ||
| - | ceph osd unset nobackfill | + | ceph tell 'osd.*' config show | grep osd_mclock_profile |
| </ | </ | ||
| - | Check benefits: | + | <code bash> |
| - | * Observe better latency on OSD with NVM/ | + | ceph tell 'osd.*' |
| - | | + | |
| - | + | ceph tell 'osd.*' config show | grep osd_mclock_profile | |
| - | + | ||
| - | + | ||
| - | ==== Issues ==== | + | |
| - | + | ||
| - | === auth: unable to find a keyring === | + | |
| - | + | ||
| - | It is not possible to create | + | |
| - | + | ||
| - | < | + | |
| - | Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring / | + | |
| - | | + | |
| - | 2021-01-28T10: | + | |
| </ | </ | ||
| - | <file init / | + | == mClock custom profile == |
| - | [client] | + | |
| - | | + | |
| - | [mds] | + | <code bash> |
| - | | + | ceph tell ' |
| - | </file> | + | |
| - | + | ||
| - | ceph.conf Variables | + | |
| - | * **$cluster** - cluster name. For proxmox it is '' | + | |
| - | * **$type** - daemon process '' | + | |
| - | * **$id** - daemon or client indentifier. For '' | + | |
| - | | + | |
| - | * **$name** - Expands to $type.$id. I.e: '' | + | |
| - | * **$pid** - Expands to daemon pid | + | |
| - | + | ||
| - | **SOLUTION: | + | |
| - | <code bash>cp / | + | |
| - | alternative to try: change ceph.conf | + | |
| - | + | ||
| - | === Unit -.mount is masked. === | + | |
| - | + | ||
| - | < | + | |
| - | Running command: / | + | |
| - | | + | |
| - | --> | + | |
| </ | </ | ||
| - | |||
| - | It was caused by '' | ||
| - | * [[https:// | ||
| - | * [[https:// | ||
| - | * [[https:// | ||
| - | |||
| - | **Solution: | ||
| - | |||
| - | <code bash> | ||
| - | |||
| - | To list runtime masked units: | ||
| - | <code bash>ls -l / | ||
| - | |||
| - | To unescape systemd unit names: | ||
| - | <code bash> systemd-escape -u ' | ||
| - | |||
| - | |||