meta data for this page
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| vm:proxmox:ceph:performance [2021/01/30 20:28] – niziak | vm:proxmox:ceph:performance [2025/01/08 19:05] (current) – niziak | ||
|---|---|---|---|
| Line 4: | Line 4: | ||
| * [[https:// | * [[https:// | ||
| * [[https:// | * [[https:// | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| - | ==== block.db and block.wal | + | ===== Performance tips ===== |
| - | The DB stores BlueStore’s internal metadata | + | Ceph is build for scale and works great in large clusters. In small cluster every node will be heavily loaded. |
| - | | + | |
| - | Since Ceph has to write all data to the journal | + | * use '' |
| + | * enable '' | ||
| - | For hosts with multiple HDDs (multiple OSDs), it is possible to use one SSD for all OSDS DB/WAL (one partition per OSD). | + | ==== performance on small cluster ==== |
| - | NOTE: The recommended scenario | + | * [[https:// |
| - | * multiple HDDS (one OSD per HDD) | + | * number of PG should be power of 2 (or middle between powers of 2) |
| - | * one fast SSD/NVMe drive for DB/ | + | * same utilization (% full) per device |
| + | * same number of PG per OSD := same number of request | ||
| + | * same number of primary PG per OSD = read operations spread evenly | ||
| + | * primary PG - original/first PG - others are replicas. Primary PG is used for read. | ||
| + | * use relatively more PG than for big cluster - better balance, but handling PGs consumes resources | ||
| + | * i.e. for 7 OSD x 2TB PG autoscaler recommends 256 PG. After changing to 384 IOops drastivally increases and latency drops. | ||
| + | Setting to 512 PG wasn't possible because limit of 250PG/OSD. | ||
| - | Proxmox UI and CLI expects only whole device as DB device, not partition!. It will not destroy existing drive. It expect LVM volume with free space and it will create new LVM volume for DB/WAK. | + | === balancer === |
| - | Ceph native CLI can work with partition specified as DB (it also works with whole drive or LVM). | ||
| - | |||
| - | **MORE INFO:** | ||
| - | * https:// | ||
| - | * https:// | ||
| - | * https:// | ||
| - | |||
| - | |||
| - | ==== DB/WAL sizes ==== | ||
| - | * If there is <1GB of fast storage, the best is to use it as WAL only (without DB). | ||
| - | * if a DB device is specified but an explicit WAL device is not, the WAL will be implicitly colocated with the DB on the faster device. | ||
| - | |||
| - | DB size: | ||
| - | * (still true for Octopus 15.2.6 ) DB should be 30GB. And this doesn' | ||
| - | * all block.db sizes except **4, 30, 286 GB** are pointless, | ||
| - | * see: [[https:// | ||
| - | * [[https:// | ||
| - | * should have as large as possible logical volumes | ||
| - | * for RGW (Rados Gateway) workloads: min 4% of block device size | ||
| - | * for RBD (Rados Block Device) workloads: 1-2% is enough (2% from 2TB is 40GB) | ||
| - | * according to '' | ||
| - | |||
| - | |||
| - | ===== Adding journal DB/WAL partition ===== | ||
| - | |||
| - | If an OSD needs to be shutdown for maintenance (i.e. adding new disc), please set '' | ||
| - | |||
| - | ==== Create parition on NVM drive ==== | ||
| - | |||
| - | Reorganize existing NVM/SSD disc to make some free space. Create empty partition on free space. | ||
| <code bash> | <code bash> | ||
| - | # remove cache partition from zpool | + | ceph mgr module enable balancer |
| - | zpool list -v | + | ceph balancer on |
| - | zpool remove rpool / | + | ceph balancer mode upmap |
| - | ... | + | |
| - | reorganize partition | + | |
| - | ... | + | |
| - | blkid | + | |
| - | zpool add rpool cache 277455ae-1bfa-41f6-8b89-fd362d35515e | + | |
| </ | </ | ||
| - | ==== Replace OSD ==== | + | === CRUSH reweight |
| - | <code bash> | + | If possible use '' |
| - | ceph osd tree | + | |
| - | ceph device ls-by-host pve5 | + | Override default CRUSH assignment. |
| - | DEVICE | + | |
| - | TOSHIBA_HDWD120_30HN40HAS | + | |
| - | ### Switch OFF OSD. Ceph should rebalance data from replicas when OSD is switched off directly | ||
| - | ceph osd out X | ||
| - | ## or better use lines below: | ||
| - | # this is optional for safety for small clusters instead of using ceph out osd.2 | ||
| - | ceph osd reweight osd.X 0 | ||
| - | # wait for data migration away from osd.X | ||
| - | watch 'ceph -s; ceph osd df tree' | ||
| - | # Remove OSD | ||
| - | ceph osd out X | ||
| - | ceph osd safe-to-destroy osd.X | + | === PG autoscaler === |
| - | ceph osd down X | + | |
| - | systemctl stop ceph-osd@X.service | + | |
| - | ceph osd destroy X | + | |
| - | #pveceph osd destroy X | + | Better |
| - | + | <code bash> | |
| - | + | ceph mgr module enable pg_autoscaler | |
| - | # to remove partition table, boot sector and any OSD leftover: | + | #ceph osd pool set <pool> pg_autoscale_mode |
| - | ceph-volume lvm zap /dev/sdX --destroy | + | ceph osd pool set rbd pg_autoscale_mode warn |
| - | + | ||
| - | ## it is not possible to specify DB partition with pveceph command (read begining of page): | + | |
| - | # pveceph osd create /dev/sdc --db_dev / | + | |
| - | ## it requires whole device as db dev with LVM and will create new LVM on free space, i.e. | + | |
| - | # pveceph osd create /dev/sdc --db_dev / | + | |
| - | ## so direct | + | |
| - | + | ||
| - | # Prevent backfilling when new osd will be added | + | |
| - | ceph osd set nobackfill | + | |
| - | + | ||
| - | ### Create OSD: | + | |
| - | ceph-volume lvm create --osd-id X --bluestore --data /dev/sdc --block.db / | + | |
| - | # or split above into two step: | + | |
| - | ceph-volume lvm prepare --bluestore --data /dev/sdX --block.db / | + | |
| - | ceph-volume lvm activate --bluestore X e56ecc53-826d-40b0-a647-xxxxxxxxxxxx | + | |
| - | # also possible: ceph-volume lvm activate --all | + | |
| - | + | ||
| - | ## DRAFTS: | + | |
| - | # | + | |
| - | # | + | |
| </ | </ | ||
| - | Verify: | + | It is possible to set desired/ |
| - | <code bash> | + | ==== check cluster balance ==== |
| - | ls -l / | + | |
| - | lrwxrwxrwx 1 ceph ceph 93 sty 28 17:59 block -> / | + | |
| - | lrwxrwxrwx 1 ceph ceph 14 sty 28 17:59 block.db -> / | + | |
| - | + | ||
| - | ceph daemon osd.X perf dump | jq ' | + | |
| - | { | + | |
| - | " | + | |
| - | " | + | |
| - | " | + | |
| - | " | + | |
| - | " | + | |
| - | " | + | |
| - | ... | + | |
| - | # OR | + | |
| - | " | + | |
| - | " | + | |
| - | + | ||
| - | ceph device ls | + | |
| - | + | ||
| - | </ | + | |
| - | + | ||
| - | And restore backfilling: | + | |
| <code bash> | <code bash> | ||
| - | ceph osd unset nobackfill | + | ceph -s |
| + | ceph osd df # shows standard deviation | ||
| </ | </ | ||
| - | Check benefits: | + | no tools to show primary PG balancing. Tool on https://github.com/JoshSalomon/Cephalocon-2019/blob/master/pool_pgs_osd.sh |
| - | * Observe better latency on OSD with NVM/SSD: <code bash> | + | |
| - | * check '' | + | |
| - | + | ||
| - | + | ||
| - | + | ||
| - | + | ||
| - | ==== Issues ==== | + | |
| - | + | ||
| - | === auth: unable | + | |
| - | + | ||
| - | It is not possible to create ceph OSD neither from WebUI nor cmdline: <code bash> | + | |
| - | + | ||
| - | < | + | |
| - | Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring / | + | |
| - | | + | |
| - | 2021-01-28T10: | + | |
| - | </ | + | |
| - | + | ||
| - | <file init / | + | |
| - | [client] | + | |
| - | | + | |
| - | + | ||
| - | [mds] | + | |
| - | | + | |
| - | </ | + | |
| - | + | ||
| - | ceph.conf Variables | + | |
| - | * **$cluster** - cluster name. For proxmox it is '' | + | |
| - | * **$type** - daemon process '' | + | |
| - | * **$id** - daemon or client indentifier. For '' | + | |
| - | * **$host** - hostname where the process is running | + | |
| - | * **$name** - Expands to $type.$id. I.e: '' | + | |
| - | * **$pid** - Expands to daemon pid | + | |
| - | + | ||
| - | **SOLUTION: | + | |
| - | <code bash>cp / | + | |
| - | alternative to try: change ceph.conf | + | |
| - | + | ||
| - | === Unit -.mount is masked. === | + | |
| - | + | ||
| - | < | + | |
| - | Running command: / | + | |
| - | | + | |
| - | --> | + | |
| - | </ | + | |
| - | + | ||
| - | It was caused by '' | + | |
| - | * [[https://askubuntu.com/questions/1191596/ | + | |
| - | * [[https://bugs.debian.org/cgi-bin/ | + | |
| - | * [[https:// | + | |
| - | + | ||
| - | **Solution: | + | |
| - | + | ||
| - | <code bash> | + | |
| - | + | ||
| - | To list runtime masked units: | + | |
| - | <code bash>ls -l / | + | |
| - | To unescape systemd unit names: | ||
| - | <code bash> systemd-escape -u ' | ||
| + | ==== performance on slow HDDs ==== | ||