公司内部系统使用的Kubernetes集群部署在自建机房中,因为停电,导致集群出现问题,症状就是执行kubectl命令没有响应,很明显Kube Apiserver停止服务了,问题初步定位后,需继续分析。在正式开始前,有必要介绍下kubernetes环境,
节点操作系统:CentOS Linux release 7.9.2009 (Core)
Kubernetes集群:三个节点组成控制平面,集群使用kubeadm搭建,三个etcd节点以容器方式运行并组成集群,K8s版本为v1.22.3
控制平面主机名:master-1.example.xyz,master-2.example.xyz,master-3.example.xyz。IP分别为192.168.9.128、192.168.9.129、192.168.9.130
master-1节点安装ansible,可以操作集群中的各个节点。
kube apiserver是无状态服务,如果它意外停止了,重启容器即可。但问题没有那么简单,执行下面命令,发现三台控制平面主机上运行的etcd都停掉了,
ansible master-1,master-2,master-3 -m shell -a 'docker ps -a|grep etcd'master-1的报错日志如下,
panic: freepages: failed to get all reachable pages (page 72: multiple references)
goroutine 138 [running]:
go.etcd.io/bbolt.(*DB).freepages.func2(0xc00008a660)
/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.etcd.io/bbolt@v1.3.6/db.go:1056 +0xe9
created by go.etcd.io/bbolt.(*DB).freepages
/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.etcd.io/bbolt@v1.3.6/db.go:1054 +0x1cd其他两台的报错日志如下,
{"level":"warn","ts":"2022-08-28T04:51:59.218Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":730075,"snapshot-file-path":"/var/lib/etcd/member/snap/00000000000b23db.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2022-08-28T04:51:59.218Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:515
go.etcd.io/etcd/server/v3/embed.StartEtcd
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:244
go.etcd.io/etcd/server/v3/etcdmain.startEtcd
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:227
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:122
go.etcd.io/etcd/server/v3/etcdmain.Main
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/main.go:40
main.main
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/main.go:32
runtime.main
/home/remote/sbatsche/.gvm/gos/go1.16.3/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot
goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0001140c0, 0xc00003d180, 0x1, 0x1)
/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000198820, 0x122e2fc, 0x2a, 0xc00003d180, 0x1, 0x1)
/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.NewServer(0x7ffc53c36e3d, 0xe, 0x0, 0x0, 0x0, 0x0, 0xc0004987e0, 0x1, 0x1, 0xc000498a20, ...)
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:515 +0x1656
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc000578000, 0xc000578600, 0x0, 0x0)
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:244 +0xef8
go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc000578000, 0x1202a6f, 0x6, 0xc0000e0101, 0x2)
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:227 +0x32
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00003a120, 0x12, 0x12)
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:122 +0x257a
go.etcd.io/etcd/server/v3/etcdmain.Main(0xc00003a120, 0x12, 0x12)
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/main.go:40 +0x11f
main.main()
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/main.go:32 +0x45在网上查了多篇文档,大致意思就是etcd的数据文件损坏了,要做数据恢复,还好做了备份。
注意,etcd的数据一定要定期备份,否则就没有这篇文章了。
查看etcd的manifests文件,默认为/etc/kubernetes/manifests/etcd.yaml,发现etcd的数据目录为/var/lib/etcd
etcd数据目录
我们不动原来etcd数据目录,创建新的数据目录/var/lib/etcdnew,并修改权限(在三个控制平面节点都操作)
mkdir /var/lib/etcdnew
chmod 700 /var/lib/etcnew在master-1节点上执行恢复数据操作,
ETCDCTL_API=3 etcdctl snapshot restore /opt/etcdbak/snapshot-2022-08-03 \
--name master-1.example.xyz \
--initial-cluster=master-1.example.xyz=https://192.168.9.128:2380,master-2.example.xyz=https://192.168.9.129:2380,master-3.example.xyz=https://192.168.9.130:2380 \
--cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt' \
--data-dir=/var/lib/etcdnew \
--initial-advertise-peer-urls=https://192.168.9.128:2380修改etcd的manifest文件,数据目录修改为“/var/lib/etcdnew”,删掉etcd容器,kubelet会自动创建新的容器。操作后发现master-1上的etcd容器已经正常启动。
[root@master-1 ~]# etcdctl --endpoints master-1.example.xyz:2379 \
--cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt member list
15eecdcb869aa54d, started, master-1.example.xyz, https://192.168.9.128:2380, https://192.168.9.128:2379, false删掉master-2和master-3上的etcd容器,并将其上的etcd实例加入包含master-1的etcd集群。将master-2加入(下面命令在master-1上执行)。
[root@master-1 ~]# ETCDCTL_API=3 etcdctl --endpoints master-1.example.xyz:2379 \
--cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt member add master-2.example.xyz --peer-urls=https://192.168.9.129:2380
Member c232e54655fae25b added to cluster f6092b7880c7274master-3类似。
操作完成后,查看集群的状态,
[root@master-1 ~]# etcdctl --endpoints master-1.example.xyz:2379 \
--cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt member list
15eecdcb869aa54d, started, master-1.example.xyz, https://192.168.9.128:2380, https://192.168.9.128:2379, false
4483f575a2bcbd42, started, master-2.example.xyz, https://192.168.9.129:2380, https://192.168.9.129:2379, false
c232e54655fae25b, started, master-3.example.xyz, https://192.168.9.130:2380, https://192.168.9.130:2379, false集群已经恢复正常,因为有kubelet的作用kube apiserver会自动启动,
执行kubectl命令,发现集群恢复正常。
[root@master-1 ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
master-1.example.xyz Ready 3h43m v1.22.3
master-2.example.xyz Ready 3h43m v1.22.3
master-3.example.xyz Ready 3h43m v1.22.3
...... 总结
Etcd是kubernetes的核心服务,需要定期对数据进行备份。生产环境一般运行3个或者5个节点组成集群,这里注意,节点并不是越多越好,过多的节点会导致写操作变慢,因为多数节点完成写操作,结果才返回。集群间使用Raft协议来实现一致性和高可用,各个节点上的数据保持最终一致性,某一时刻有节点和其他节点的数据不一样,但最后会保持数据相同,所以上面只需要恢复一台节点,后面两台会从第一台上自动同步数据。
| 留言与评论(共有 0 条评论) “” |