k8sのコントロールプレーンで大量のエラーが出ていた話
Table of Contents
next: enひかりクロス alert-managerから大量のアラートが飛んできたので、調べてみたらmasterノードが逝ってた話
kube-system kube-controller-manager-ubuntu1 0/1 Error 584 (10m ago) 91d
kube-system kube-proxy-gh67p 1/1 Running 2 (80d ago) 91d
kube-system kube-proxy-pwdk4 1/1 Running 2 (80d ago) 91d
kube-system kube-proxy-sbbrd 1/1 Running 2 (80d ago) 91d
kube-system kube-scheduler-ubuntu1 0/1 CrashLoopBackOff 594 (11s ago) 91d
# etcdからtime outエラーが
4nm1tsu@ubuntu1:~/manifests$ kubectl apply -f nethack-server.yaml
namespace/nethack-server created
service/service-nethack-server created
persistentvolume/pv-nethack-server created
persistentvolumeclaim/pvc-nethack-server created
Error from server: error when creating "nethack-server.yaml": etcdserver: request timed out
4nm1tsu@ubuntu1:~/manifests$ kubectl exec -n kube-system etcd-ubuntu1 -- sh -c "ETCDCTL_API=3 etcdctl endpoint status --cluster -w table --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key"
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.1.201:2379 | 60c9a074182650b3 | 3.5.6 | 110 MB | true | false | 8 | 28533436 | 28533436 | |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
4nm1tsu@ubuntu1:~/manifests$ kubectl exec -n kube-system etcd-ubuntu1 -- sh -c "ETCDCTL_API=3 etcdctl --command-timeout=30s --endpoints=https://192.168.1.201:2379 defrag --cluster -w table --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key"
Finished defragmenting etcd member[https://192.168.1.201:2379]
4nm1tsu@ubuntu1:~/manifests$ kubectl exec -n kube-system etcd-ubuntu1 -- sh -c "ETCDCTL_API=3 etcdctl endpoint status --cluster -w table --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key"
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.1.201:2379 | 60c9a074182650b3 | 3.5.6 | 7.1 MB | true | false | 8 | 28538299 | 28538299 | |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
4nm1tsu@ubuntu1:~$ sudo dd if=/dev/zero of=/tmp.data bs=1M count=1000 oflag=direct
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 108.15 s, 9.7 MB/s
4nm1tsu@ubuntu2:/$ sudo dd if=/dev/zero of=/tmp.data bs=1M count=1000 oflag=direct
[sudo] password for 4nm1tsu:
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 5.121 s, 205 MB/s
# smart診断
4nm1tsu@ubuntu1:~$ sudo smartctl -c /dev/sda
smartctl 7.2 2020-12-30 r5155 [aarch64-linux-5.15.0-1041-raspi] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x71) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 10) minutes.
Conveyance self-test routine
recommended polling time: ( 1) minutes.
# 実行(バックグラウンド)
4nm1tsu@ubuntu1:~$ sudo smartctl -t long /dev/sda
smartctl 7.2 2020-12-30 r5155 [aarch64-linux-5.15.0-1041-raspi] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 10 minutes for test to complete.
Test will complete after Sat Oct 21 16:09:10 2023 JST
Use smartctl -X to abort test.
# 結果
4nm1tsu@ubuntu1:~$ sudo smartctl -l selftest /dev/sda
[sudo] password for 4nm1tsu:
smartctl 7.2 2020-12-30 r5155 [aarch64-linux-5.15.0-1041-raspi] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 1580 -
# hdparm
4nm1tsu@ubuntu1:~/manifests/mail$ sudo hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 226 MB in 2.00 seconds = 112.96 MB/sec
Timing buffered disk reads: 30 MB in 5.81 seconds = 5.16 MB/sec
4nm1tsu@ubuntu2:~$ sudo hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 1598 MB in 2.00 seconds = 800.43 MB/sec
Timing buffered disk reads: 700 MB in 3.13 seconds = 223.61 MB/sec
# k8s reset 後
4nm1tsu@ubuntu1:~$ sudo hdparm -tT /dev/sda
[sudo] password for 4nm1tsu:
/dev/sda:
Timing cached reads: 1728 MB in 2.00 seconds = 865.82 MB/sec
Timing buffered disk reads: 764 MB in 3.05 seconds = 250.59 MB/sec
etcdctl — Error: context deadline exceeded [Accessing ETCD Control in newer versions of Kubernetes]
Linuxでの HDDの S.M.A.R.T.情報の取得と、HDD自己診断の実施方法
2024/02/18追記
ubuntu2も死んだかも
4nm1tsu@ubuntu1:~/manifests/minecraft-server-mod$ kubectl get pods -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cert-manager cert-manager-7bfbbd5f46-94gln 1/1 Running 1 (87s ago) 4d10h 10.244.3.39 ubuntu4 <none> <none>
cert-manager cert-manager-7bfbbd5f46-kl9r5 1/1 Terminating 12 (67d ago) 96d 10.244.1.31 ubuntu2 <none> <none>
cert-manager cert-manager-cainjector-84cfdc869c-bq59f 1/1 Running 14 (67d ago) 96d 10.244.2.60 ubuntu3 <none> <none>
cert-manager cert-manager-webhook-649b4d699f-qckq8 1/1 Running 1 (87s ago) 4d10h 10.244.3.38 ubuntu4 <none> <none>
cert-manager cert-manager-webhook-649b4d699f-t87h8 1/1 Terminating 5 (67d ago) 96d 10.244.1.29 ubuntu2 <none> <none>
ingress-system ingress-nginx-controller-798796947c-2xvp2 1/1 Running 5 (67d ago) 96d 10.244.2.64 ubuntu3 <none> <none>
kube-flannel kube-flannel-ds-hgv4m 1/1 Running 8 (67d ago) 96d 192.168.1.201 ubuntu1 <none> <none>
kube-flannel kube-flannel-ds-n2kpk 1/1 Running 5 (67d ago) 96d 192.168.1.203 ubuntu3 <none> <none>
kube-flannel kube-flannel-ds-s9h9x 1/1 Running 5 (67d ago) 96d 192.168.1.202 ubuntu2 <none> <none>
kube-flannel kube-flannel-ds-z6z9m 1/1 Running 1 (87s ago) 67d 192.168.1.204 ubuntu4 <none> <none>
kube-system coredns-5dd5756b68-c6ntt 1/1 Running 5 (67d ago) 96d 10.244.2.59 ubuntu3 <none> <none>
kube-system coredns-5dd5756b68-hbgsn 1/1 Running 5 (67d ago) 96d 10.244.2.57 ubuntu3 <none> <none>
kube-system etcd-ubuntu1 1/1 Running 1 (67d ago) 88d 192.168.1.201 ubuntu1 <none> <none>
kube-system kube-apiserver-ubuntu1 1/1 Running 0 67d 192.168.1.201 ubuntu1 <none> <none>
kube-system kube-controller-manager-ubuntu1 1/1 Running 11 (67d ago) 95d 192.168.1.201 ubuntu1 <none> <none>
kube-system kube-proxy-6kvcd 1/1 Running 1 (87s ago) 67d 192.168.1.204 ubuntu4 <none> <none>
kube-system kube-proxy-m4g5m 1/1 Running 5 (67d ago) 95d 192.168.1.203 ubuntu3 <none> <none>
kube-system kube-proxy-m6x5t 1/1 Running 5 (67d ago) 95d 192.168.1.202 ubuntu2 <none> <none>
kube-system kube-proxy-nb2cz 1/1 Running 6 (67d ago) 95d 192.168.1.201 ubuntu1 <none> <none>
kube-system kube-scheduler-ubuntu1 1/1 Running 11 (67d ago) 95d 192.168.1.201 ubuntu1 <none> <none>
kube-system metrics-server-69647569bb-44d6b 1/1 Terminating 5 (67d ago) 94d 10.244.1.39 ubuntu2 <none> <none>
kube-system metrics-server-69647569bb-8jtcd 1/1 Running 0 4d10h 10.244.2.65 ubuntu3 <none> <none>
metallb-system controller-6db6d84dc7-l927k 1/1 Running 1 (87s ago) 4d10h 10.244.3.35 ubuntu4 <none> <none>
metallb-system speaker-4qvcn 1/1 Running 10 (67d ago) 96d 192.168.1.202 ubuntu2 <none> <none>
metallb-system speaker-6qxkw 1/1 Running 10 (67d ago) 96d 192.168.1.203 ubuntu3 <none> <none>
metallb-system speaker-gf6mz 1/1 Running 12 (67d ago) 96d 192.168.1.201 ubuntu1 <none> <none>
metallb-system speaker-tn8fz 1/1 Running 2 (45s ago) 67d 192.168.1.204 ubuntu4 <none> <none>
monitoring alertmanager-kube-prometheus-stack-alertmanager-0 2/2 Terminating 2 (67d ago) 77d 10.244.1.33 ubuntu2 <none> <none>
monitoring kube-prometheus-stack-grafana-69478f5b9d-pcwrg 3/3 Terminating 3 (67d ago) 77d 10.244.1.27 ubuntu2 <none> <none>
monitoring kube-prometheus-stack-grafana-69478f5b9d-z8kbc 0/3 ContainerCreating 0 4d10h <none> ubuntu4 <none> <none>
monitoring kube-prometheus-stack-kube-state-metrics-7ccc7bb9c9-4cbhq 1/1 Running 1 (87s ago) 4d10h 10.244.3.36 ubuntu4 <none> <none>
monitoring kube-prometheus-stack-kube-state-metrics-7ccc7bb9c9-h9dnj 1/1 Terminating 1 (67d ago) 77d 10.244.1.37 ubuntu2 <none> <none>
monitoring kube-prometheus-stack-operator-865fdd5fd5-lvp5l 1/1 Terminating 1 (67d ago) 77d 10.244.1.38 ubuntu2 <none> <none>
monitoring kube-prometheus-stack-operator-865fdd5fd5-mp8sx 1/1 Running 1 (87s ago) 4d10h 10.244.3.37 ubuntu4 <none> <none>
monitoring kube-prometheus-stack-prometheus-node-exporter-4vbgk 1/1 Running 1 (67d ago) 77d 192.168.1.201 ubuntu1 <none> <none>
monitoring kube-prometheus-stack-prometheus-node-exporter-5c7j6 1/1 Running 1 (87s ago) 67d 192.168.1.204 ubuntu4 <none> <none>
monitoring kube-prometheus-stack-prometheus-node-exporter-66blp 1/1 Running 1 (67d ago) 77d 192.168.1.203 ubuntu3 <none> <none>
monitoring kube-prometheus-stack-prometheus-node-exporter-gq2xf 1/1 Running 1 (67d ago) 77d 192.168.1.202 ubuntu2 <none> <none>
monitoring prometheus-kube-prometheus-stack-prometheus-0 2/2 Terminating 2 (67d ago) 77d 10.244.1.28 ubuntu2 <none> <none>
nethack-server nethack-server-deployment-6fcffdb948-jrgs8 0/2 Init:0/1 0 4d10h <none> ubuntu4 <none> <none>
nethack-server nethack-server-deployment-6fcffdb948-qk84s 2/2 Terminating 4 (67d ago) 96d 10.244.1.32 ubuntu2 <none> <none>
nginx db 1/1 Running 2 (67d ago) 96d 10.244.2.61 ubuntu3 <none> <none>
nginx deployment-blog-86f74567d7-6fbs7 1/1 Running 2 (67d ago) 96d 10.244.2.58 ubuntu3 <none> <none>
nginx deployment-blog-86f74567d7-kjkjg 0/1 ContainerCreating 0 4d10h <none> ubuntu4 <none> <none>
nginx deployment-blog-86f74567d7-tbnlk 1/1 Terminating 2 (67d ago) 96d 10.244.1.30 ubuntu2 <none> <none>
nginx mailserver-6c9cf499dd-nfc7n 0/1 ContainerCreating 0 4d10h 192.168.1.204 ubuntu4 <none> <none>
nginx redis 1/1 Terminating 2 (67d ago) 96d 10.244.1.36 ubuntu2 <none> <none>
nginx web-deployment-f6b9465db-h4nc4 1/1 Running 2 (67d ago) 96d 10.244.2.62 ubuntu3 <none> <none>
nginx web-deployment-f6b9465db-q5csb 1/1 Terminating 2 (67d ago) 96d 10.244.1.35 ubuntu2 <none> <none>
nginx web-deployment-f6b9465db-t4p4j 0/1 ContainerCreating 0 4d10h <none> ubuntu4 <none> <none>
zangyo-analyzer zangyo-analyzer-deployment-57889d69c9-srcwh 1/1 Running 8 (67d ago) 96d 10.244.2.63 ubuntu3 <none> <none>
4nm1tsu@ubuntu1:~/manifests/minecraft-server-mod$ kubectl drain --ignore-daemonsets --force ubuntu2
node/ubuntu2 cordoned
error: unable to drain node "ubuntu2" due to error:cannot delete Pods with local storage (use --delete-emptydir-data to override): kube-system/metrics-server-69647569bb-44d6b, monitoring/alertmanager-kube-prometheus-stack-alertmanager-0, monitoring/kube-prometheus-stack-grafana-69478f5b9d-pcwrg, monitoring/prometheus-kube-prometheus-stack-prometheus-0, continuing command...
There are pending nodes to be drained:
ubuntu2
cannot delete Pods with local storage (use --delete-emptydir-data to override): kube-system/metrics-server-69647569bb-44d6b, monitoring/alertmanager-kube-prometheus-stack-alertmanager-0, monitoring/kube-prometheus-stack-grafana-69478f5b9d-pcwrg, monitoring/prometheus-kube-prometheus-stack-prometheus-0
4nm1tsu@ubuntu1:~/manifests/minecraft-server-mod$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ubuntu1 Ready control-plane 96d v1.28.2
ubuntu2 NotReady,SchedulingDisabled worker 96d v1.28.2
ubuntu3 Ready worker 96d v1.28.2
ubuntu4 Ready worker 68d v1.28.2
Read other posts