Our Proxmox all VM went dead.
- Category: 電腦相關
- Last Updated: Friday, 30 June 2017 22:02
- Published: Friday, 23 June 2017 16:51
- Written by sam
聽某人說永遠不會死的架構,死了(ceph full...
Proxmox Ceph 七臺的HA及Cluster
所有上面的VM通通凍結…無法備份及匯出
那個某人用勇氣重開了最後乙臺機器想恢復正常…當然是沒有…
重開了之後,只有Ceph一直在跑Recovery…跑完之後結果還是相同
整串都掛掉了
來試試如何能把它們還原去別的Node
這次就兩臺主機作cluster再加上一臺裝Freenas作Storage,中間是10G網卡到10G Switch
先來試試正常的備份,都是記錄(雜亂)
vzdump --dumpdir ./ --compress 114
400 Parameter verification failed.
compress: value '114' does not have a value in the enumeration '0, 1, gzip, lzo'
vzdump {<vmid>} [OPTIONS]
告訴你無法備份,因為沒指定壓縮方式
vzdump --compress lzo 114
改成指定方式就行
多一行也很常用
-mode <snapshot | stop | suspend> (default = snapshot)
如果要備份全部的guest vm
-all <boolean> (default = 0)
備份失敗
root@uat163:/var/lib/ceph/osd/ceph-3/temp/dump# vzdump --compress lzo 114
INFO: starting new backup job: vzdump 114 --compress lzo
INFO: Starting Backup of VM 114 (qemu)
INFO: status = running
INFO: update VM 114: -lock backup
INFO: VM Name: s3
INFO: include disk 'scsi0' 'ceph-vm:vm-114-disk-1'
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/var/lib/vz/dump/vzdump-qemu-114-2017_06_22-19_59_25.vma.lzo'
ERROR: unable to connect to VM 114 qmp socket - timeout after 30 retries
INFO: aborting backup job
ERROR: VM 114 qmp command 'backup-cancel' failed - unable to connect to VM 114 qmp socket - timeout after 5680 retries
ERROR: Backup of VM 114 failed - unable to connect to VM 114 qmp socket - timeout after 30 retries
INFO: Backup job finished with errors
job errors
這個告訴你它和qmp失聯了…
QEMU Machine Protocol (QMP) 這個是它的全名
這個試了整晚都還是沒有對策
看一下vm
qm showcmd <vmid> show command line (debug info)
/usr/bin/kvm -id 114 -chardev 'socket,id=qmp,path=/var/run/qemu-server/114.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/114.pid -daemonize -smbios 'type=1,uuid=1ac4daa1-f07a-4589-acc8-5c7d4f574f22' -name s3 -smp '16,sockets=2,cores=8,maxcpus=16' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga qxl -vnc unix:/var/run/qemu-server/114.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 4096 -k en-us -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -spice 'tls-port=61001,addr=localhost,tls-ciphers=DES-CBC3-SHA,seamless-migration=on' -device 'virtio-serial,id=spice,bus=pci.0,addr=0x9' -chardev 'spicevmc,id=vdagent,name=vdagent' -device 'virtserialport,chardev=vdagent,name=com.redhat.spice.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:cc2148bea23b' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:ceph-vm/vm-114-disk-1:mon_host=10.56.56.157;10.56.56.158;10.56.56.159;10.56.56.160;10.56.56.161;10.56.56.162;10.56.56.163:id=admin:auth_supported=cephx:keyring=/etc/pve/priv/ceph/ceph-vm.keyring,if=none,id=drive-scsi0,cache=writethrough,format=raw,aio=threads,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap114i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=0E:30:35:D0:9D:DE,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300'
Backup again
ERROR: Backup of VM 114 failed - command 'qm set 114 --lock backup' failed: exit code 255
root@uat163:/# qm stop 114
VM quit/powerdown failed - terminating now with SIGTERM
root@uat163:/# qm shutdown 114
VM is locked (backup)
Look at qm list
root@uat163:/# qm list
VMID NAME STATUS MEM(MB) BOOTDISK(GB) PID
100 Stratoscale-pfsense stopped 2048 10.00 0
101 Stratoscale-2012R2 stopped 4096 110.00 0
114 s3 stopped 4096 50.00 0
104012 10.0.104.12-UGSDEVSB02-Pete stopped 4096 60.00 0
104013 10.0.104.13-UGSDEVSB03-Pete stopped 4096 60.00 0
Look at path
root@uat163:/etc/pve/qemu-server# pvesm path ceph-vm:vm-114-disk-1
rbd:ceph-vm/vm-114-disk-1:mon_host=10.56.56.157;10.56.56.158;10.56.56.159;10.56.56.160;10.56.56.161;10.56.56.162;10.56.56.163:id=admin:auth_supported=cephx:keyring=/etc/pve/priv/ceph/ceph-vm.keyring
Try copy
root@uat163:/etc/pve/qemu-server# rbd -p rbd -m 10.0.252.163 -n client.admin --keyring /etc/pve/priv/ceph/myceph.keyring --auth_supported cephx mv vm-114-disk-1 vm-114-disk-1
find key location
root@uat163:/etc/pve# cat ceph.conf
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 10.56.56.0/24
filestore xattr use omap = true
fsid = 25654002-7cca-4ad3-89c5-a837e99796a8
keyring = /etc/pve/priv/$cluster.$name.keyring
osd journal size = 5120
osd pool default min size = 1
public network = 10.56.56.0/24
try again
root@uat163:/etc/pve# rbd -p rbd -m 10.0.252.163 -n client.admin --keyring /etc/pve/priv/ceph.client.admin.keyring --auth_supported cephx mv vm-114-disk-1 vm-114-disk-1
fail again
2017-06-22 20:51:39.435554 7f36ef685700 0 -- :/3114797443 >> 10.0.252.163:6789/0 pipe(0x55850514f410 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x55850514d140).fault
2017-06-22 20:51:42.437515 7f36d170c700 0 -- :/3114797443 >> 10.0.252.163:6789/0 pipe(0x7f36c4000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f36c4001f90).fault
2017-06-22 20:51:45.441301 7f36ef685700 0 -- :/3114797443 >> 10.0.252.163:6789/0 pipe(0x7f36c4005190 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f36c4006450).fault
Check Ceph
root@uat163:~# ceph -s
cluster 25654002-7cca-4ad3-89c5-a837e99796a8
health HEALTH_ERR
1 full osd(s)
3 near full osd(s)
nearfull,full,sortbitwise,require_jewel_osds flag(s) set
1 mons down, quorum 1,2,3,4,5,6 1,2,3,4,5,6
monmap e7: 7 mons at {0=10.56.56.157:6789/0,1=10.56.56.158:6789/0,2=10.56.56.159:6789/0,3=10.56.56.160:6789/0,4=10.56.56.161:6789/0,5=10.56.56.162:6789/0,6=10.56.56.163:6789/0}
election epoch 98, quorum 1,2,3,4,5,6 1,2,3,4,5,6
osdmap e74: 7 osds: 7 up, 7 in
flags nearfull,full,sortbitwise,require_jewel_osds
pgmap v8362693: 512 pgs, 2 pools, 1110 GB data, 278 kobjects
3341 GB used, 530 GB / 3871 GB avail
511 active+clean
1 active+clean+scrubbing+deep
掛一個nfs的資料夾過來試試dump了
root@uat163:/nfs# mount 10.0.252.231:/mnt/DATA /nfs -o rsize=65536,wsize=65536,async,rw,tcp
檢測先前備份的備份檔是否正常
gunzip vzdump-qemu-111-2017_06_04-16_17_48.vma.gz 因為是壓縮先解開
vma verify vzdump-qemu-111-2017_06_04-16_17_48.vma -v 測試一下
CFG: size: 303 name: qemu-server.conf
DEV: dev_id=1 size: 53687091200 devname: drive-scsi0
CTIME: Sun Jun 4 16:17:49 2017
progress 1% (read 536870912 bytes, duration 1 sec)
progress 2% (read 1073741824 bytes, duration 4 sec)
progress 100% (read 53687091200 bytes, duration 43 sec)
total bytes read 53687091200, sparse bytes 46506516480 (86.6%)
space reduction due to 4K zero blocks 0.839%
再來是直接解開
vma extract vzdump-qemu-111-2017_06_04-16_17_48.vma abc
root@prox148:/var/lib/vz/dump/abc# ls
disk-drive-scsi0.raw qemu-server.conf
這樣就可以直接拿來用了
但我們的機器是"平常沒備份"的
查出目前用的pool裡的image
root@uat163:/# rbd list --pool ceph-vm
base-8889-disk-1
vm-100-disk-1
vm-101-disk-1
vm-101250-disk-1
vm-102-disk-1
vm-103-disk-1
vm-104-disk-1
把它dump出來
root@uat163:/nfs# rbd export ceph-vm/vm-114-disk-1 ./114.ig
Exporting image: 100% complete...done.
查它的格式
root@uat163:/nfs# qemu-img info 114.ig
image: 114.ig
file format: raw
virtual size: 50G (53687091200 bytes)
disk size: 4.0G
新的是採用lvm格式
所以先建立lvm
lvcreate -L60G -n 104012sam LVM
看一下有沒有 pvdisplay
--- Logical volume ---
LV Path /dev/LVM/104012sam
LV Name 104012sam
VG Name LVM
LV UUID I5v7DY-gSJA-jttP-Zayc-5Pqn-fVYR-ngu9Ty
LV Write Access read/write
LV Creation host, time prox149, 2017-06-23 10:39:22 +0800
LV Status available
# open 1
LV Size 60.00 GiB
Current LE 15360
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:8
再來就用dd試試了…
root@prox149:/var/lib/vz/dump# dd if=104012.ig | pv -s 60G |dd of=/dev/LVM/104012sam
中間的 pv 是拿來看dd的進度的
建議大家安裝,不然dd是看不到status bar的
-s 接的是你image大小
這一次用的是freenas來作storage
所以弄回舊的conf之後要加工一下
root@prox149:/etc/pve/qemu-server# cat 104002.conf
balloon: 1024
bootdisk: virtio0
cores: 2
memory: 4096
name: 10.0.104.2-UGSDEVDC01-Pete
net0: virtio=52:8A:6D:EC:13:E3,bridge=vmbr0,tag=104
numa: 1
ostype: win10
scsihw: virtio-scsi-pci
smbios1: uuid=3efc9c90-04b8-493a-9690-deb8e7f41538
sockets: 2
virtio0: ceph-vm:vm-104002-disk-1,cache=writeback,size=60G
virtio1: ceph-vm:vm-104002-disk-2,cache=writeback,size=100G
root@prox149:/etc/pve/qemu-server# sed -i -- 's/ceph-vm/freenas-LVM/g' 104002.conf 當然記得改成* 不用一個個作
root@prox149:/etc/pve/qemu-server# cat 104002.conf
balloon: 1024
bootdisk: virtio0
cores: 2
memory: 4096
name: 10.0.104.2-UGSDEVDC01-Pete
net0: virtio=52:8A:6D:EC:13:E3,bridge=vmbr0,tag=104
numa: 1
ostype: win10
scsihw: virtio-scsi-pci
smbios1: uuid=3efc9c90-04b8-493a-9690-deb8e7f41538
sockets: 2
virtio0: freenas-LVM:vm-104002-disk-1,cache=writeback,size=60G
virtio1: freenas-LVM:vm-104002-disk-2,cache=writeback,size=100G
以上確認可行…如果有更好的方法,請來信告知我
因為我們還有另外一串六臺的cluster的ceph完全沒有狀態…
如果平常有備份的話,也可以使用 rbd import 來匯入
root@uat156:/etc/pve/qemu-server# rbd import /mnt/pve/freenas-NFS/115.ig rbd/vm-115-disk-1
##########################
新建立的host *2 上面外掛的freenas 先是lvm會自動消失…後來又做了nfs…就在剛剛也是消失,趣事不斷~~~
root@prox148:/mnt/pve/freenas-NFS/images# ls 115 107
107:
115: