记一次keepalived异常导致pod一直pening

乡下的树 2021年07月09日 666次浏览

问题:新创建的pod,一直Pending状态,kubectl descirbe pods查看状态,无任何异常

[root@k8s-7-21 ~]# kubectl get pods -n app
NAME                                  READY   STATUS    RESTARTS   AGE
dubbo-demo-service-5db6ddc9c5-hvzkp   0/1     Pending   0          15s

查看kubectl nodes状态,发现两个节点处于NotReady

[root@k8s-7-21 ~]# kubectl get nodes
NAME                STATUS     ROLES         AGE   VERSION
k8s-7-21.host.top   NotReady   master,node   20d   v1.15.4
k8s-7-22.host.top   NotReady   master,node   20d   v1.15.4

查看kubectl服务日志,提示dial tcp 10.4.7.10:7443: connect: no route to host
说明:10.4.7.10为keepalive的负载ip

[root@k8s-7-22 ~]# tail -f /data/logs/kubernetes/kube-kubelet/kubelet.stdout.log 
E0111 16:19:34.350714    6544 kubelet.go:2252] node "k8s-7-22.host.top" not found
E0111 16:19:34.450795    6544 kubelet.go:2252] node "k8s-7-22.host.top" not found
E0111 16:19:34.551401    6544 kubelet.go:2252] node "k8s-7-22.host.top" not found
E0111 16:19:34.616951    6544 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://10.4.7.10:7443/api/v1/pods?fieldSelector=spec.nodeName%3Dk8s-7-22.host.top&limit=500&resourceVersion=0: dial tcp 10.4.7.10:7443: connect: no route to host
E0111 16:19:34.617011    6544 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:454: Failed to list *v1.Node: Get https://10.4.7.10:7443/api/v1/nodes?fieldSelector=metadata.name%3Dk8s-7-22.host.top&limit=500&resourceVersion=0: dial tcp 10.4.7.10:7443: connect: no route to host
E0111 16:19:34.617060    6544 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.CSIDriver: Get https://10.4.7.10:7443/apis/storage.k8s.io/v1beta1/csidrivers?limit=500&resourceVersion=0: dial tcp 10.4.7.10:7443: connect: no route to host
E0111 16:19:34.617100    6544 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:445: Failed to list *v1.Service: Get https://10.4.7.10:7443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.4.7.10:7443: connect: no route to host
E0111 16:19:34.617146    6544 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.RuntimeClass: Get https://10.4.7.10:7443/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: dial tcp 10.4.7.10:7443: connect: no route to host
E0111 16:19:34.652720    6544 kubelet.go:2252] node "k8s-7-22.host.top" not found


检查ip情况,确认ip10.4.7.10丢失

[root@k8s-7-11 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:0c:29:86:56:33 brd ff:ff:ff:ff:ff:ff
    inet 10.4.7.11/24 brd 10.4.7.255 scope global noprefixroute ens33
       valid_lft forever preferred_lft forever
    inet6 fe80::d137:b7e3:1bb8:2cad/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

重启所有节点keepalive服务

[root@k8s-7-11 ~]# systemctl restart keepalived

再次检查ip,已经恢复

[root@k8s-7-11 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:0c:29:86:56:33 brd ff:ff:ff:ff:ff:ff
    inet 10.4.7.11/24 brd 10.4.7.255 scope global noprefixroute ens33
       valid_lft forever preferred_lft forever
    inet 10.4.7.10/32 scope global ens33
       valid_lft forever preferred_lft forever
    inet6 fe80::d137:b7e3:1bb8:2cad/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

查看kubectl 日志,恢复正常

[root@k8s-7-21 ~]# tail -f /data/logs/kubernetes/kube-kubelet/kubelet.stdout.log
I0111 16:21:43.325773    5600 reconciler.go:203] operationExecutor.VerifyControllerAttachedVolume started for volume "docker" (UniqueName: "kubernetes.io/host-path/f798c6c7-7424-4166-91cd-9fc3f58ed9ec-docker") pod "jenkins-79f4d4496c-zh728" (UID: "f798c6c7-7424-4166-91cd-9fc3f58ed9ec") 
I0111 16:21:43.325803    5600 reconciler.go:203] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-84jbv" (UniqueName: "kubernetes.io/secret/f798c6c7-7424-4166-91cd-9fc3f58ed9ec-default-token-84jbv") pod "jenkins-79f4d4496c-zh728" (UID: "f798c6c7-7424-4166-91cd-9fc3f58ed9ec") 
I0111 16:21:43.325816    5600 reconciler.go:203] operationExecutor.VerifyControllerAttachedVolume started for volume "heapster-token-t6n7n" (UniqueName: "kubernetes.io/secret/c59e429f-a19f-4446-a51b-5d97e74f41f3-heapster-token-t6n7n") pod "heapster-976d8cd5-rml8z" (UID: "c59e429f-a19f-4446-a51b-5d97e74f41f3") 
I0111 16:21:43.325848    5600 reconciler.go:203] operationExecutor.VerifyControllerAttachedVolume started for volume "data" (UniqueName: "kubernetes.io/nfs/f798c6c7-7424-4166-91cd-9fc3f58ed9ec-data") pod "jenkins-79f4d4496c-zh728" (UID: "f798c6c7-7424-4166-91cd-9fc3f58ed9ec") 
I0111 16:21:43.325857    5600 reconciler.go:150] Reconciler: start to sync state
W0111 16:21:43.507176    5600 kubelet_pods.go:849] Unable to retrieve pull secret kube-system/harbor for kube-system/kubernetes-dashboard-97c6f7f7c-m879z due to secret "harbor" not found.  The image pull may not succeed.
I0111 16:21:49.124452    5600 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
I0111 16:21:49.149778    5600 kubelet_node_status.go:72] Attempting to register node k8s-7-21.host.top
I0111 16:21:49.156513    5600 kubelet_node_status.go:114] Node k8s-7-21.host.top was previously registered
I0111 16:21:49.156589    5600 kubelet_node_status.go:75] Successfully registered node k8s-7-21.host.top

查看kubectl nodes状态,已经变为ready

[root@k8s-7-21 ~]# kubectl get nodes
NAME                STATUS   ROLES         AGE   VERSION
k8s-7-21.host.top   Ready    master,node   20d   v1.15.4
k8s-7-22.host.top   Ready    master,node   20d   v1.15.4

查看pods,已经变为running

[root@k8s-7-21 ~]# kubectl get pods -n app
NAME                                  READY   STATUS    RESTARTS   AGE
dubbo-demo-service-5db6ddc9c5-c2pv4   1/1     Running   0          17s
[root@k8s-7-21 ~]#

后记:故障原因是之前手动systemctl restart network重启过网络服务,导致keepalived停止了