Ceph back to nornal
- Category: 電腦相關
- Last Updated: Friday, 30 June 2017 15:35
- Published: Thursday, 29 June 2017 10:52
- Written by sam
繼上篇將vm guest 搬至新機之後,可以檢查ceph的狀況了
初估是因為我們無限制的開啟新的guest,導致了嚴重的超標,在調整之前pools是220%的使用
目前也只降至148%
先從這個基本的問題下手(但這個也是主要的問題)
root@uat157:~# systemctl status This email address is being protected from spambots. You need JavaScript enabled to view it.
● This email address is being protected from spambots. You need JavaScript enabled to view it. - Ceph cluster monitor daemon
Loaded: loaded (/lib/systemd/system/[email protected]; enabled)
Drop-In: /lib/systemd/system/[email protected]
└─ceph-after-pve-cluster.conf
Active: failed (Result: start-limit) since Tue 2017-06-20 17:10:33 CST; 1 weeks 0 days ago
Process: 28325 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=28)
Main PID: 28325 (code=exited, status=28)
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
root@uat157:~# systemctl status This email address is being protected from spambots. You need JavaScript enabled to view it.
● This email address is being protected from spambots. You need JavaScript enabled to view it. - Ceph object storage daemon
Loaded: loaded (/lib/systemd/system/[email protected]; enabled)
Drop-In: /lib/systemd/system/[email protected]
└─ceph-after-pve-cluster.conf
Active: active (running) since Wed 2017-01-25 17:14:01 CST; 5 months 1 days ago
Main PID: 6634 (ceph-osd)
CGroup: /system.slice/system-ceph\x2dosd.slice/This email address is being protected from spambots. You need JavaScript enabled to view it.
└─6634 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
以上可以見到osd的部份是正常的,但在mon的部份則是異常,並且無法取到參數的部份
再來看ceph的狀態,以下輸出可以見到在quorum的部份,和之前ui介面相同,0的部份已不在清單
root@uat163:~# ceph
ceph> health
HEALTH_ERR 1 full osd(s); 3 near full osd(s); nearfull,full,sortbitwise,require_jewel_osds flag(s) set; 1 mons down, quorum 1,2,3,4,5,6 1,2,3,4,5,6; mon.4 low disk space
讓我們直接進入ceph察看,相同是一個down,1個full,3個接近滿的狀態
ceph> status
cluster 25654002-7cca-4ad3-89c5-a837e99796a8
health HEALTH_ERR
1 full osd(s)
3 near full osd(s)
nearfull,full,sortbitwise,require_jewel_osds flag(s) set
1 mons down, quorum 1,2,3,4,5,6 1,2,3,4,5,6
mon.4 low disk space
monmap e7: 7 mons at {0=10.56.56.157:6789/0,1=10.56.56.158:6789/0,2=10.56.56.159:6789/0,3=10.56.56.160:6789/0,4=10.56.56.161:6789/0,5=10.56.56.162:6789/0,6=10.56.56.163:6789/0}
election epoch 98, quorum 1,2,3,4,5,6 1,2,3,4,5,6
osdmap e99: 7 osds: 7 up, 7 in
flags nearfull,full,sortbitwise,require_jewel_osds
pgmap v8444950: 512 pgs, 2 pools, 1110 GB data, 278 kobjects
3341 GB used, 530 GB / 3871 GB avail
512 active+clean
這邊順便看一下ceph的磁碟狀態,以下圖來看,osd6的部份算是非常的高
root@uat163:~# ceph osd perf
osd fs_commit_latency(ms) fs_apply_latency(ms)
6 130 276
5 31 51
4 26 41
3 27 44
2 28 44
1 28 46
0 24 34
root@uat163:~# ceph osd perf
osd fs_commit_latency(ms) fs_apply_latency(ms)
6 130 276
5 31 51
4 26 41
3 27 44
2 28 44
1 28 46
0 24 34
先來手動一下把mon0跑起來
root@uat157:~# /usr/bin/ceph-mon -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
starting mon.0 rank 0 at 10.56.56.157:6789/0 mon_data /var/lib/ceph/mon/ceph-0 fsid 25654002-7cca-4ad3-89c5-a837e99796a8
如果無法啟動並告知過滿,可以暫先採用以下指令(這是預設值,所以請改小,也可以從系統碟清理著手)
mon data avail warn = 5
mon data avail crit = 5
正常啟動mon之後,來看一下要砍誰
root@uat163:~# rbd ls ceph-vm -l
NAME SIZE PARENT FMT PROT LOCK
base-8889-disk-1 61440M 2
base-8889-disk-1@__base__ 61440M 2 yes
vm-100-disk-1 10240M 2 excl
vm-101-disk-1 110G 2 excl
vm-101250-disk-1 32768M 2 excl
vm-102-disk-1 102400M 2 excl
vm-103-disk-1 102400M 2 excl
vm-104-disk-1 102400M 2 excl
vm-104002-disk-1 61440M 2 excl
vm-104002-disk-2 102400M 2 excl
vm-104003-disk-1 61440M 2 excl
vm-104004-disk-1 61440M 2 excl
vm-104005-disk-1 61440M 2 excl
vm-104006-disk-1 61440M 2 excl
vm-104007-disk-1 61440M 2 excl
vm-104008-disk-1 61440M 2 excl
vm-104009-disk-1 61440M 2 excl
vm-104010-disk-1 61440M 2 excl
vm-104011-disk-1 61440M 2 excl
vm-104012-disk-1 61440M 2 excl
vm-104013-disk-1 61440M 2 excl
vm-104014-disk-1 40960M 2 excl
vm-104014-disk-2 200G 2
vm-104014-disk-3 200G 2 excl
vm-104015-disk-1 40960M 2 excl
vm-104015-disk-2 200G 2 excl
vm-104016-disk-1 40960M 2 excl
vm-104017-disk-1 61440M 2 excl
vm-104018-disk-1 61440M 2 excl
vm-105-disk-1 102400M 2 excl
vm-106-disk-1 102400M 2 excl
vm-10601-disk-1 300G 2
vm-107-disk-1 102400M 2 excl
vm-108-disk-1 10240M 2 excl
vm-109-disk-1 51200M 2 excl
vm-110-disk-1 51200M 2 excl
vm-111-disk-1 51200M 2 excl
vm-112-disk-1 51200M 2 excl
vm-113-disk-1 51200M 2 excl
vm-114-disk-1 51200M 2 excl
vm-115-disk-1 200G 2 excl
vm-117882-disk-1 61440M 2 excl
vm-117888-disk-1 102400M 2 excl
vm-204200-disk-1 32768M 2 excl
vm-391-disk-1 20480M 2
vm-821-disk-1 32768M 2
找一個最不重要的測試機下手,當然是出現了錯誤,過滿,無法動作
root@uat163:~# rbd remove vm-112-disk-1 --pool ceph-vm
2017-06-28 11:23:50.359616 7fbbfdb2a700 0 client.19374307.objecter FULL, paused modify 0x7fbbe4007bd0 tid 6
看一下目前osd滿載程度
root@uat163:/var/lib/ceph/osd/ceph-3/temp/dump/dump# ceph health detail
HEALTH_ERR 1 full osd(s); 3 near full osd(s); nearfull,full,sortbitwise,require_jewel_osds flag(s) set; mon.0 low disk space, shutdown imminent; mon.4 low disk space
osd.6 is full at 95%
osd.2 is near full at 89%
osd.3 is near full at 91%
osd.5 is near full at 92%
nearfull,full,sortbitwise,require_jewel_osds flag(s) set
mon.0 low disk space, shutdown imminent -- 3% avail
mon.4 low disk space -- 21% avail
加大一下滿載的ratio,再執行一次指令就行了(至此也能到proxmox pv 的介面去做了
root@uat163:/var/lib/ceph/osd/ceph-3/temp/dump/dump# rbd remove vm-112-disk-1 --pool ceph-vm
Removing image: 39% complete
正常後再檢查一下,似乎都正常了,連延遲也降低
root@uat163:~# ceph osd perf
osd fs_commit_latency(ms) fs_apply_latency(ms)
6 9 17
5 7 8
4 6 7
3 4 5
2 4 4
1 5 6
0 8 9
root@uat163:~# ceph
ceph> status
cluster 25654002-7cca-4ad3-89c5-a837e99796a8
health HEALTH_ERR
1 pgs inconsistent
3 near full osd(s)
1 scrub errors
mon.0 low disk space, shutdown imminent
mon.4 low disk space
monmap e7: 7 mons at {0=10.56.56.157:6789/0,1=10.56.56.158:6789/0,2=10.56.56.159:6789/0,3=10.56.56.160:6789/0,4=10.56.56.161:6789/0,5=10.56.56.162:6789/0,6=10.56.56.163:6789/0}
election epoch 110, quorum 0,1,2,3,4,5,6 0,1,2,3,4,5,6
osdmap e118: 7 osds: 7 up, 7 in
flags nearfull,sortbitwise,require_jewel_osds
pgmap v8493948: 512 pgs, 2 pools, 1051 GB data, 263 kobjects
3165 GB used, 706 GB / 3871 GB avail
511 active+clean
1 active+clean+inconsistent
client io 191 kB/s wr, 0 op/s rd, 16 op/s wr
上表有列出其中有個error,再次檢查並修復它
root@uat163:~# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 3 near full osd(s); 1 scrub errors; mon.0 low disk space, shutdown imminent; mon.4 low disk space
pg 1.64 is active+clean+inconsistent, acting [5,2,4]
osd.3 is near full at 86%
osd.5 is near full at 87%
osd.6 is near full at 90%
1 scrub errors
mon.0 low disk space, shutdown imminent -- 3% avail
mon.4 low disk space -- 21% avail
使用指令修復它
root@uat163:~# ceph pg repair 1.64
instructing pg 1.64 on osd.5 to repair
正常了,除了空間問題
root@uat163:~# ceph health detail
HEALTH_ERR 3 near full osd(s); mon.0 low disk space, shutdown imminent; mon.4 low disk space
osd.3 is near full at 86%
osd.5 is near full at 87%
osd.6 is near full at 90%
mon.0 low disk space, shutdown imminent -- 3% avail
mon.4 low disk space -- 21% avail