k8sのコントロールプレーンで大量のエラーが出ていた話

next: enひかりクロス alert-managerから大量のアラートが飛んできたので、調べてみたらmasterノードが逝ってた話

kube-system       kube-controller-manager-ubuntu1                             0/1     Error              584 (10m ago)   91d
kube-system       kube-proxy-gh67p                                            1/1     Running            2 (80d ago)     91d
kube-system       kube-proxy-pwdk4                                            1/1     Running            2 (80d ago)     91d
kube-system       kube-proxy-sbbrd                                            1/1     Running            2 (80d ago)     91d
kube-system       kube-scheduler-ubuntu1                                      0/1     CrashLoopBackOff   594 (11s ago)   91d


# etcdからtime outエラーが
4nm1tsu@ubuntu1:~/manifests$ kubectl apply -f nethack-server.yaml
namespace/nethack-server created
service/service-nethack-server created
persistentvolume/pv-nethack-server created
persistentvolumeclaim/pvc-nethack-server created
Error from server: error when creating "nethack-server.yaml": etcdserver: request timed out


4nm1tsu@ubuntu1:~/manifests$ kubectl exec  -n kube-system etcd-ubuntu1 -- sh -c "ETCDCTL_API=3 etcdctl endpoint status --cluster -w table --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt  --key /etc/kubernetes/pki/etcd/server.key"
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.1.201:2379 | 60c9a074182650b3 |   3.5.6 |  110 MB |      true |      false |         8 |   28533436 |           28533436 |        |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+


4nm1tsu@ubuntu1:~/manifests$ kubectl exec  -n kube-system etcd-ubuntu1 -- sh -c "ETCDCTL_API=3 etcdctl --command-timeout=30s --endpoints=https://192.168.1.201:2379 defrag --cluster -w table --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt  --key /etc/kubernetes/pki/etcd/server.key"
Finished defragmenting etcd member[https://192.168.1.201:2379]

4nm1tsu@ubuntu1:~/manifests$ kubectl exec  -n kube-system etcd-ubuntu1 -- sh -c "ETCDCTL_API=3 etcdctl endpoint status --cluster -w table --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt  --key /etc/kubernetes/pki/etcd/server.key"
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.1.201:2379 | 60c9a074182650b3 |   3.5.6 |  7.1 MB |      true |      false |         8 |   28538299 |           28538299 |        |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+


4nm1tsu@ubuntu1:~$ sudo dd if=/dev/zero of=/tmp.data bs=1M count=1000 oflag=direct
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 108.15 s, 9.7 MB/s

4nm1tsu@ubuntu2:/$ sudo dd if=/dev/zero of=/tmp.data bs=1M count=1000 oflag=direct
[sudo] password for 4nm1tsu:
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 5.121 s, 205 MB/s

# smart診断
4nm1tsu@ubuntu1:~$ sudo smartctl -c /dev/sda
smartctl 7.2 2020-12-30 r5155 [aarch64-linux-5.15.0-1041-raspi] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x71) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  10) minutes.
Conveyance self-test routine
recommended polling time:        (   1) minutes.

# 実行(バックグラウンド)
4nm1tsu@ubuntu1:~$ sudo smartctl -t long /dev/sda
smartctl 7.2 2020-12-30 r5155 [aarch64-linux-5.15.0-1041-raspi] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 10 minutes for test to complete.
Test will complete after Sat Oct 21 16:09:10 2023 JST
Use smartctl -X to abort test.

# 結果
4nm1tsu@ubuntu1:~$ sudo smartctl -l selftest /dev/sda
[sudo] password for 4nm1tsu:
smartctl 7.2 2020-12-30 r5155 [aarch64-linux-5.15.0-1041-raspi] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      1580         -

# hdparm
4nm1tsu@ubuntu1:~/manifests/mail$ sudo hdparm -tT /dev/sda
/dev/sda:
 Timing cached reads:   226 MB in  2.00 seconds = 112.96 MB/sec
 Timing buffered disk reads:  30 MB in  5.81 seconds =   5.16 MB/sec


4nm1tsu@ubuntu2:~$ sudo hdparm -tT /dev/sda
/dev/sda:
 Timing cached reads:   1598 MB in  2.00 seconds = 800.43 MB/sec
 Timing buffered disk reads: 700 MB in  3.13 seconds = 223.61 MB/sec

# k8s reset 後
4nm1tsu@ubuntu1:~$ sudo hdparm -tT /dev/sda
[sudo] password for 4nm1tsu:

/dev/sda:
 Timing cached reads:   1728 MB in  2.00 seconds = 865.82 MB/sec
 Timing buffered disk reads: 764 MB in  3.05 seconds = 250.59 MB/sec

etcdctl — Error: context deadline exceeded [Accessing ETCD Control in newer versions of Kubernetes]

In the newer versions of kubernetes, ETCD servers advertise the client address with TLS enabled by default. This sometimes become difficult…

medium.com

Linuxでの HDDの S.M.A.R.T.情報の取得と、HDD自己診断の実施方法

Linux(CentOS,Ubuntu)で HDDの S.M.A.R.T.情報を表示させる方法と、HDDの自己診断の実行方法です。 HDDには S.M.A.R.T. と呼ばれる、HDDの項目値を読み出せる機能があり、これを通してHDDの各種

pcvogel.sarakura.net

2024/02/18追記

ubuntu2も死んだかも

4nm1tsu@ubuntu1:~/manifests/minecraft-server-mod$ kubectl get pods -A -o wide
NAMESPACE         NAME                                                        READY   STATUS              RESTARTS       AGE     IP              NODE      NOMINATED NODE   READINESS GATES
cert-manager      cert-manager-7bfbbd5f46-94gln                               1/1     Running             1 (87s ago)    4d10h   10.244.3.39     ubuntu4   <none>           <none>
cert-manager      cert-manager-7bfbbd5f46-kl9r5                               1/1     Terminating         12 (67d ago)   96d     10.244.1.31     ubuntu2   <none>           <none>
cert-manager      cert-manager-cainjector-84cfdc869c-bq59f                    1/1     Running             14 (67d ago)   96d     10.244.2.60     ubuntu3   <none>           <none>
cert-manager      cert-manager-webhook-649b4d699f-qckq8                       1/1     Running             1 (87s ago)    4d10h   10.244.3.38     ubuntu4   <none>           <none>
cert-manager      cert-manager-webhook-649b4d699f-t87h8                       1/1     Terminating         5 (67d ago)    96d     10.244.1.29     ubuntu2   <none>           <none>
ingress-system    ingress-nginx-controller-798796947c-2xvp2                   1/1     Running             5 (67d ago)    96d     10.244.2.64     ubuntu3   <none>           <none>
kube-flannel      kube-flannel-ds-hgv4m                                       1/1     Running             8 (67d ago)    96d     192.168.1.201   ubuntu1   <none>           <none>
kube-flannel      kube-flannel-ds-n2kpk                                       1/1     Running             5 (67d ago)    96d     192.168.1.203   ubuntu3   <none>           <none>
kube-flannel      kube-flannel-ds-s9h9x                                       1/1     Running             5 (67d ago)    96d     192.168.1.202   ubuntu2   <none>           <none>
kube-flannel      kube-flannel-ds-z6z9m                                       1/1     Running             1 (87s ago)    67d     192.168.1.204   ubuntu4   <none>           <none>
kube-system       coredns-5dd5756b68-c6ntt                                    1/1     Running             5 (67d ago)    96d     10.244.2.59     ubuntu3   <none>           <none>
kube-system       coredns-5dd5756b68-hbgsn                                    1/1     Running             5 (67d ago)    96d     10.244.2.57     ubuntu3   <none>           <none>
kube-system       etcd-ubuntu1                                                1/1     Running             1 (67d ago)    88d     192.168.1.201   ubuntu1   <none>           <none>
kube-system       kube-apiserver-ubuntu1                                      1/1     Running             0              67d     192.168.1.201   ubuntu1   <none>           <none>
kube-system       kube-controller-manager-ubuntu1                             1/1     Running             11 (67d ago)   95d     192.168.1.201   ubuntu1   <none>           <none>
kube-system       kube-proxy-6kvcd                                            1/1     Running             1 (87s ago)    67d     192.168.1.204   ubuntu4   <none>           <none>
kube-system       kube-proxy-m4g5m                                            1/1     Running             5 (67d ago)    95d     192.168.1.203   ubuntu3   <none>           <none>
kube-system       kube-proxy-m6x5t                                            1/1     Running             5 (67d ago)    95d     192.168.1.202   ubuntu2   <none>           <none>
kube-system       kube-proxy-nb2cz                                            1/1     Running             6 (67d ago)    95d     192.168.1.201   ubuntu1   <none>           <none>
kube-system       kube-scheduler-ubuntu1                                      1/1     Running             11 (67d ago)   95d     192.168.1.201   ubuntu1   <none>           <none>
kube-system       metrics-server-69647569bb-44d6b                             1/1     Terminating         5 (67d ago)    94d     10.244.1.39     ubuntu2   <none>           <none>
kube-system       metrics-server-69647569bb-8jtcd                             1/1     Running             0              4d10h   10.244.2.65     ubuntu3   <none>           <none>
metallb-system    controller-6db6d84dc7-l927k                                 1/1     Running             1 (87s ago)    4d10h   10.244.3.35     ubuntu4   <none>           <none>
metallb-system    speaker-4qvcn                                               1/1     Running             10 (67d ago)   96d     192.168.1.202   ubuntu2   <none>           <none>
metallb-system    speaker-6qxkw                                               1/1     Running             10 (67d ago)   96d     192.168.1.203   ubuntu3   <none>           <none>
metallb-system    speaker-gf6mz                                               1/1     Running             12 (67d ago)   96d     192.168.1.201   ubuntu1   <none>           <none>
metallb-system    speaker-tn8fz                                               1/1     Running             2 (45s ago)    67d     192.168.1.204   ubuntu4   <none>           <none>
monitoring        alertmanager-kube-prometheus-stack-alertmanager-0           2/2     Terminating         2 (67d ago)    77d     10.244.1.33     ubuntu2   <none>           <none>
monitoring        kube-prometheus-stack-grafana-69478f5b9d-pcwrg              3/3     Terminating         3 (67d ago)    77d     10.244.1.27     ubuntu2   <none>           <none>
monitoring        kube-prometheus-stack-grafana-69478f5b9d-z8kbc              0/3     ContainerCreating   0              4d10h   <none>          ubuntu4   <none>           <none>
monitoring        kube-prometheus-stack-kube-state-metrics-7ccc7bb9c9-4cbhq   1/1     Running             1 (87s ago)    4d10h   10.244.3.36     ubuntu4   <none>           <none>
monitoring        kube-prometheus-stack-kube-state-metrics-7ccc7bb9c9-h9dnj   1/1     Terminating         1 (67d ago)    77d     10.244.1.37     ubuntu2   <none>           <none>
monitoring        kube-prometheus-stack-operator-865fdd5fd5-lvp5l             1/1     Terminating         1 (67d ago)    77d     10.244.1.38     ubuntu2   <none>           <none>
monitoring        kube-prometheus-stack-operator-865fdd5fd5-mp8sx             1/1     Running             1 (87s ago)    4d10h   10.244.3.37     ubuntu4   <none>           <none>
monitoring        kube-prometheus-stack-prometheus-node-exporter-4vbgk        1/1     Running             1 (67d ago)    77d     192.168.1.201   ubuntu1   <none>           <none>
monitoring        kube-prometheus-stack-prometheus-node-exporter-5c7j6        1/1     Running             1 (87s ago)    67d     192.168.1.204   ubuntu4   <none>           <none>
monitoring        kube-prometheus-stack-prometheus-node-exporter-66blp        1/1     Running             1 (67d ago)    77d     192.168.1.203   ubuntu3   <none>           <none>
monitoring        kube-prometheus-stack-prometheus-node-exporter-gq2xf        1/1     Running             1 (67d ago)    77d     192.168.1.202   ubuntu2   <none>           <none>
monitoring        prometheus-kube-prometheus-stack-prometheus-0               2/2     Terminating         2 (67d ago)    77d     10.244.1.28     ubuntu2   <none>           <none>
nethack-server    nethack-server-deployment-6fcffdb948-jrgs8                  0/2     Init:0/1            0              4d10h   <none>          ubuntu4   <none>           <none>
nethack-server    nethack-server-deployment-6fcffdb948-qk84s                  2/2     Terminating         4 (67d ago)    96d     10.244.1.32     ubuntu2   <none>           <none>
nginx             db                                                          1/1     Running             2 (67d ago)    96d     10.244.2.61     ubuntu3   <none>           <none>
nginx             deployment-blog-86f74567d7-6fbs7                            1/1     Running             2 (67d ago)    96d     10.244.2.58     ubuntu3   <none>           <none>
nginx             deployment-blog-86f74567d7-kjkjg                            0/1     ContainerCreating   0              4d10h   <none>          ubuntu4   <none>           <none>
nginx             deployment-blog-86f74567d7-tbnlk                            1/1     Terminating         2 (67d ago)    96d     10.244.1.30     ubuntu2   <none>           <none>
nginx             mailserver-6c9cf499dd-nfc7n                                 0/1     ContainerCreating   0              4d10h   192.168.1.204   ubuntu4   <none>           <none>
nginx             redis                                                       1/1     Terminating         2 (67d ago)    96d     10.244.1.36     ubuntu2   <none>           <none>
nginx             web-deployment-f6b9465db-h4nc4                              1/1     Running             2 (67d ago)    96d     10.244.2.62     ubuntu3   <none>           <none>
nginx             web-deployment-f6b9465db-q5csb                              1/1     Terminating         2 (67d ago)    96d     10.244.1.35     ubuntu2   <none>           <none>
nginx             web-deployment-f6b9465db-t4p4j                              0/1     ContainerCreating   0              4d10h   <none>          ubuntu4   <none>           <none>
zangyo-analyzer   zangyo-analyzer-deployment-57889d69c9-srcwh                 1/1     Running             8 (67d ago)    96d     10.244.2.63     ubuntu3   <none>           <none>

4nm1tsu@ubuntu1:~/manifests/minecraft-server-mod$ kubectl drain --ignore-daemonsets --force ubuntu2
node/ubuntu2 cordoned
error: unable to drain node "ubuntu2" due to error:cannot delete Pods with local storage (use --delete-emptydir-data to override): kube-system/metrics-server-69647569bb-44d6b, monitoring/alertmanager-kube-prometheus-stack-alertmanager-0, monitoring/kube-prometheus-stack-grafana-69478f5b9d-pcwrg, monitoring/prometheus-kube-prometheus-stack-prometheus-0, continuing command...
There are pending nodes to be drained:
 ubuntu2
cannot delete Pods with local storage (use --delete-emptydir-data to override): kube-system/metrics-server-69647569bb-44d6b, monitoring/alertmanager-kube-prometheus-stack-alertmanager-0, monitoring/kube-prometheus-stack-grafana-69478f5b9d-pcwrg, monitoring/prometheus-kube-prometheus-stack-prometheus-0
4nm1tsu@ubuntu1:~/manifests/minecraft-server-mod$ kubectl get nodes
NAME      STATUS                        ROLES           AGE   VERSION
ubuntu1   Ready                         control-plane   96d   v1.28.2
ubuntu2   NotReady,SchedulingDisabled   worker          96d   v1.28.2
ubuntu3   Ready                         worker          96d   v1.28.2
ubuntu4   Ready                         worker          68d   v1.28.2

k8sのコントロールプレーンで大量のエラーが出ていた話

Table of Contents

etcdctl — Error: context deadline exceeded [Accessing ETCD Control in newer versions of Kubernetes]

Linuxでの HDDの S.M.A.R.T.情報の取得と、HDD自己診断の実施方法

2024/02/18追記 #

2024/02/18追記