meta data for this page
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| vm:proxmox:disaster_recovery [2021/05/18 18:19] – niziak | vm:proxmox:disaster_recovery [2024/02/12 08:26] (current) – niziak | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== Disaster recovery ====== | ====== Disaster recovery ====== | ||
| + | |||
| + | ===== replace NVM device ===== | ||
| + | |||
| + | Only 1 NVM slot available, so idea is to copy nvm to hdd and then restore it on new nvm device. | ||
| + | |||
| + | Stop CEPH: | ||
| + | <code bash> | ||
| + | systemctl stop ceph.target | ||
| + | systemctl stop ceph-osd.target | ||
| + | systemctl stop ceph-mgr.target | ||
| + | systemctl stop ceph-mon.target | ||
| + | systemctl stop ceph-mds.target | ||
| + | systemctl stop ceph-crash.service | ||
| + | </ | ||
| + | |||
| + | Backup partition layout | ||
| + | <code bash> | ||
| + | sgdisk -b nvm.sgdisk / | ||
| + | sgdisk -p / | ||
| + | </ | ||
| + | |||
| + | Move ZFS nvmpool to hdds: | ||
| + | <code bash> | ||
| + | zfs destroy hddpool/ | ||
| + | zfs create -s -b 8192 -V 387.8G hddpool/ | ||
| + | |||
| + | ls -l / | ||
| + | lrwxrwxrwx 1 root root 11 01-15 11:00 / | ||
| + | |||
| + | zpool attach nvmpool 7b375b69-3ef9-c94b-bab5-ef68f13df47c /dev/zd192 | ||
| + | </ | ||
| + | And '' | ||
| + | |||
| + | Remove NVM from '' | ||
| + | <code bash> | ||
| + | |||
| + | Remove all ZILS, L2ARCs and swap: | ||
| + | <code bash> | ||
| + | swapoff -a | ||
| + | vi /etc/fstab | ||
| + | |||
| + | zpool remove hddpool <ZIL DEVICE> | ||
| + | zpool remove hddpool <L2ARC DEVICE> | ||
| + | zpool remove rpool <L2ARC DEVICE> | ||
| + | </ | ||
| + | |||
| + | CEPH OSD will be created from scratch to force to rebuild OSD DB (which can be too big due to metadata bug from previous version of CEPH) | ||
| + | |||
| + | Replace NVM. | ||
| + | |||
| + | Recreate partitions or restore from backup <code bash> | ||
| + | * swap | ||
| + | * rpool_zil | ||
| + | * hddpool_zil | ||
| + | * hddpool_l2arc | ||
| + | * ceph_db (for 4GB ceph OSD create 4096MB+4MB) | ||
| + | |||
| + | Add ZILs and L2ARCs. | ||
| + | |||
| + | Start '' | ||
| + | |||
| + | Move '' | ||
| + | <code bash> | ||
| + | zpool attach nvmpool zd16 426718f1-1b1e-40c0-a6e2-1332fe5c3f2c | ||
| + | zpool detach nvmpool zd16 | ||
| + | </ | ||
| + | |||
| + | ===== Replace rpool device ===== | ||
| + | |||
| + | Proxmox rpool ZFS is located on 3rd partition (1st is Grub BOOT, 2nd is EFI, 3rd is ZFS). | ||
| + | To replace failed device it is needed to replicate partition layout: | ||
| + | |||
| + | With new device of greater or equal size, simple replicate partitions: | ||
| + | <code bash> | ||
| + | # replicate layout from SDA to SDB | ||
| + | sgdisk /dev/sda -R /dev/sdb | ||
| + | # generate new UUIDs: | ||
| + | sgdisk -G /dev/sdb | ||
| + | </ | ||
| + | |||
| + | To replicate layout on smaller device, need manually create partitions: | ||
| + | <code bash> | ||
| + | sgdisk -p /dev/sda | ||
| + | |||
| + | Number | ||
| + | | ||
| + | | ||
| + | | ||
| + | |||
| + | sgdisk --clear /dev/sdb | ||
| + | sgdisk /dev/sdb -a1 --new 1: | ||
| + | sgdisk / | ||
| + | sgdisk / | ||
| + | </ | ||
| + | |||
| + | Restore bootloader: | ||
| + | <code bash> | ||
| + | proxmox-boot-tool format /dev/sdb2 | ||
| + | proxmox-boot-tool init /dev/sdb2 | ||
| + | proxmox-boot-tool clean | ||
| + | </ | ||
| + | |||
| + | <code bash> | ||
| + | zpool attach rpool ata-SPCC_Solid_State_Disk_XXXXXXXXXXXX-part3 / | ||
| + | zpool offline rpool ata-SSDPR-CX400-128-G2_XXXXXXXXX-part3 | ||
| + | zpool detach rpool ata-SSDPR-CX400-128-G2_XXXXXXXXX-part3 | ||
| + | </ | ||
| ===== Migrate VM from dead node ===== | ===== Migrate VM from dead node ===== | ||
| Line 22: | Line 129: | ||
| ===== reinstall node ===== | ===== reinstall node ===== | ||
| - | Install fresh Proxmox. Create common cluster-wide mountpoints to local storage. | + | Remember to clean any additional device partition belonging to '' |
| + | |||
| + | Install fresh Proxmox. | ||
| + | Create common cluster-wide mountpoints to local storage. | ||
| Copy all zfs datasets from backup ZFS pool: | Copy all zfs datasets from backup ZFS pool: | ||
| <code bash> | <code bash> | ||
| zfs send rpool2/ | zfs send rpool2/ | ||
| ... | ... | ||
| - | |||
| </ | </ | ||
| + | For CT volumes it getting more complicated: | ||
| + | < | ||
| + | root@pve3: | ||
| + | warning: cannot send ' | ||
| + | cannot receive: failed to read from stream | ||
| + | </ | ||
| + | Reason of problem is that SOURCE is mounted. Solution: | ||
| + | <code bash> | ||
| + | zfs set canmount=off rpool2/ | ||
| + | </ | ||
| + | |||
| Try to join to cluster. From new (reinstalled) node '' | Try to join to cluster. From new (reinstalled) node '' | ||