K8sのコントロールプレーンで大量のエラーが出ていた話
Table of Contents
kube-system kube-controller-manager-ubuntu1 0/1 Error 584 (10m ago) 91d
kube-system kube-proxy-gh67p 1/1 Running 2 (80d ago) 91d
kube-system kube-proxy-pwdk4 1/1 Running 2 (80d ago) 91d
kube-system kube-proxy-sbbrd 1/1 Running 2 (80d ago) 91d
kube-system kube-scheduler-ubuntu1 0/1 CrashLoopBackOff 594 (11s ago) 91d
# etcdからtime outエラーが
4nm1tsu@ubuntu1:~/manifests$ kubectl apply -f nethack-server.yaml
namespace/nethack-server created
service/service-nethack-server created
persistentvolume/pv-nethack-server created
persistentvolumeclaim/pvc-nethack-server created
Error from server: error when creating "nethack-server.yaml": etcdserver: request timed out
4nm1tsu@ubuntu1:~/manifests$ kubectl exec -n kube-system etcd-ubuntu1 -- sh -c "ETCDCTL_API=3 etcdctl endpoint status --cluster -w table --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key"
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.1.201:2379 | 60c9a074182650b3 | 3.5.6 | 110 MB | true | false | 8 | 28533436 | 28533436 | |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
4nm1tsu@ubuntu1:~/manifests$ kubectl exec -n kube-system etcd-ubuntu1 -- sh -c "ETCDCTL_API=3 etcdctl --command-timeout=30s --endpoints=https://192.168.1.201:2379 defrag --cluster -w table --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key"
Finished defragmenting etcd member[https://192.168.1.201:2379]
4nm1tsu@ubuntu1:~/manifests$ kubectl exec -n kube-system etcd-ubuntu1 -- sh -c "ETCDCTL_API=3 etcdctl endpoint status --cluster -w table --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key"
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.1.201:2379 | 60c9a074182650b3 | 3.5.6 | 7.1 MB | true | false | 8 | 28538299 | 28538299 | |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
4nm1tsu@ubuntu1:~$ sudo dd if=/dev/zero of=/tmp.data bs=1M count=1000 oflag=direct
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 108.15 s, 9.7 MB/s
4nm1tsu@ubuntu2:/$ sudo dd if=/dev/zero of=/tmp.data bs=1M count=1000 oflag=direct
[sudo] password for 4nm1tsu:
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 5.121 s, 205 MB/s
# smart診断
4nm1tsu@ubuntu1:~$ sudo smartctl -c /dev/sda
smartctl 7.2 2020-12-30 r5155 [aarch64-linux-5.15.0-1041-raspi] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x71) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 10) minutes.
Conveyance self-test routine
recommended polling time: ( 1) minutes.
# 実行(バックグラウンド)
4nm1tsu@ubuntu1:~$ sudo smartctl -t long /dev/sda
smartctl 7.2 2020-12-30 r5155 [aarch64-linux-5.15.0-1041-raspi] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 10 minutes for test to complete.
Test will complete after Sat Oct 21 16:09:10 2023 JST
Use smartctl -X to abort test.
# 結果
4nm1tsu@ubuntu1:~$ sudo smartctl -l selftest /dev/sda
[sudo] password for 4nm1tsu:
smartctl 7.2 2020-12-30 r5155 [aarch64-linux-5.15.0-1041-raspi] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 1580 -
# hdparm
4nm1tsu@ubuntu1:~/manifests/mail$ sudo hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 226 MB in 2.00 seconds = 112.96 MB/sec
Timing buffered disk reads: 30 MB in 5.81 seconds = 5.16 MB/sec
4nm1tsu@ubuntu2:~$ sudo hdparm -tT /dev/sda
/dev/sda:
Timing cached reads: 1598 MB in 2.00 seconds = 800.43 MB/sec
Timing buffered disk reads: 700 MB in 3.13 seconds = 223.61 MB/sec
# k8s reset 後
4nm1tsu@ubuntu1:~$ sudo hdparm -tT /dev/sda
[sudo] password for 4nm1tsu:
/dev/sda:
Timing cached reads: 1728 MB in 2.00 seconds = 865.82 MB/sec
Timing buffered disk reads: 764 MB in 3.05 seconds = 250.59 MB/sec
etcdctl — Error: context deadline exceeded [Accessing ETCD Control in newer versions of Kubernetes]
Linuxでの HDDの S.M.A.R.T.情報の取得と、HDD自己診断の実施方法
Read other posts