Alertmanagerでk8sクラスタの障害を通知する

以前導入したkube-prometehus-stackに含まれるAlertmanagerを使って、k8sに起こった障害をメールで通知する

はじめに

以前の記事では、kube-prometheus-stackというk8sの監視で役立つサービスをまとめてデプロイできるhelm chartを利用して、k8sのシステム監視を導入した。

今回は、このkube-prometheus-stackに含まれるAlertmanagerを有効化して、システムの障害をメール通知する設定を行う。

PrometheusのコンソールでAlertの項目を眺めていたところエラーを見つけたので、まずはその対処から始めていく。

Prometheus上でエラーが出ていた件

PrometheusのコンソールからAlertの項目を確認すると、下記ののAlertがFiring状態になっていた。

etcdMembersDown
etcdInsufficientMembers
KubeControllerManagerDown
KubeProxyDown
KubeSchedulerDown

因みに下記のAlertはFiring状態でも問題ない。

Watchdog
InfoInhibitor

Prometheusのコンソールから「Status」→「Targets」と進むと下記のエラーが出ており、どうやら各種エンドポイントに対して疎通が取れていないことが原因のようだ。

GitHubのIssueを調べていたら、ピンポイントのコメントを見つけたので、これらを参考に対応していく。

[prometheus-kube-stack] Target Kubelet 0/0 up and others are down · Issue #204 · prometheus-community/helm-charts

Hello, I have deployed kube-prometheus stack over Kubernetes with helm (https://github.com/helm/charts/blob/master/stable/prometheus-operator/values.yaml) Almost things are working except: Targets ...

github.com

[kube-prometheus-stack] KubeScheduler & Etcd & CubeController-Manager Targets Down - Prometheus Operator · Issue #1966 · prometheus-community/helm-charts

Describe the bug a clear and concise description of what the bug is. Control Plane targets in Prometheus after setup Prometheus-stack are down What's your helm version? version.BuildInfo{Version:"v...

github.com

原因はこれらサービスがローカルループバックのみにbindしていて、外部からアクセスできないということらしい。

etcd/kube-controller-manager/kube-scheduler

control planeにSSHログインし、下記のファイルのbind-address/listen-metrics-urlsとlivenessProbe/startupProbeのhostアドレスを0.0.0.0に変更する。

/etc/kubernetes/manifests/kube-controller-manager.yaml
/etc/kubernetes/manifests/kube-scheduler.yaml
/etc/kubernetes/manifests/etcd.yaml

以下、kube-controller-manager.yamlとetcd.yamlの書き換え例

kube-controller-manager.yaml

spec:
  containers:
  - command:
  ...
    - --bind-address=0.0.0.0
    ...
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 0.0.0.0
        path: /healthz
        port: 10257
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    name: kube-controller-manager
    resources:
      requests:
        cpu: 200m
    startupProbe:
      failureThreshold: 24
      httpGet:
        host: 0.0.0.0
        path: /healthz
        port: 10257
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15

etcd.yaml

spec:
  containers:
  - command:
  ...
    - --listen-metrics-urls=http://0.0.0.0:2381
    ...
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 0.0.0.0
        path: /health?exclude=NOSPACE&serializable=true
        port: 2381
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    name: etcd
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
    startupProbe:
      failureThreshold: 24
      httpGet:
        host: 0.0.0.0
        path: /health?serializable=false
        port: 2381
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15

kube-proxy

kubectl edit cm/kube-proxy -n kube-system

エディタが立ち上がるので、metricsBindAddressを0.0.0.0:10249に変更後

kubectl delete pod -l k8s-app=kube-proxy -n kube-system

とすると、再度kube-proxyが立ち上がってくるので待つ。

ここまでで、先程出ていたエラーは解決しているはず。

手順

ここから、Alertmanagerの設定をしていく。とは言っても、values弄って変更を適用するだけ。

helmの変数定義

values-monitoring.yaml

alertmanager:
  ## Deploy alertmanager
  ##
  enabled: true
...
  config:
    global:
      resolve_timeout: 5m
      smtp_smarthost: 'sample.com:465'
      smtp_from: 'alertmanager@sample.com'
      smtp_auth_username: 'alertmanager@sample.com'
      smtp_auth_password: 'xxx'
      smtp_require_tls: false
...
    route:
      group_by: ['alertname']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 24h
      receiver: alert-email
...
    receivers:
    - name: 'alert-email'
      email_configs:
        - to: 'xxx@sample.com'
...
    replicas: 2

適用。リリース名とnamespaceはお使いの環境に合わせて。

helm upgrade -f values-monitoring.yaml kube-prometheus-stack prometheus-community/kube-prometheus-stack -n monitoring

レプリカ数を2としている理由は後述。

Alertmanagerがdisabled表示されてしまう

NodePortなりで公開後にAlertmanagerのコンソールにアクセスすると、Cluster Statusがdisabledとなっていた。

Alertmanager Cluster Status is Disabled · Issue #1452 · prometheus-community/helm-charts

Describe the bug a clear and concise description of what the bug is. Alertmanager Cluster Status is Disabled always while installing Prometheus/Prometheus-stack using helm -install prometheus stabl...

github.com

どうやらレプリカ数が2以上(冗長化構成)でないと、ここがdisabledになるらしい。まあ折角なので冗長化することにした。

通知が重複してしまう

Alertmanagerのレプリカ数を2にして、試しに以下のjobをcreateしてみたところ、同じ通知が2つ飛んできた。

always-fail-job.yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: always-fail-job
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: always-fail-container
        image: busybox
        command: ["sh", "-c", "exit 1"]
  backoffLimit: 0

Alertmanagerを冗長構成にして動作させるときは、それぞれのAlertmanagerに協調動作させるために、引数--cluster.peerとしてpeerの情報を入れてやる必要があるらしい。

helmでデプロイしている場合は、clusterDomainとしてその設定を差し込めるようだった。

values-monitoring.yaml

...
  ## The clusterDomain value will be added to the cluster.peer option of the alertmanager.
  ## Without this specified option cluster.peer will have value alertmanager-monitoring-alertmanager-0.alertmanager-operated:9094 (default value)
  ## With this specified option cluster.peer will have value alertmanager-monitoring-alertmanager-0.alertmanager-operated.namespace.svc.cluster-domain:9094

  clusterDomain: "cluster.local"
...

おわりに

kube-prometehus-stackのおかげで簡単に障害通知を行うことができた。

当然だが今回のような構成では、全てのPodが死んでしまうレベルの障害には気付けない。そのため実用上はSaaSの外形監視などを併用するのが良いと思う。

自分はブログとmisskeyのエンドポイントに対して、UptimeRobotというSaaSの死活監視で定期的にhttpの疎通確認をすることにした。ステータスは以下から確認でき、疎通が取れなくなると通知が飛んでくる。

4nm1tsu server status

stats.uptimerobot.com

参考文献

Prometheus の Alertmanager（と Postfix）でメール通知 - Humanity

前回 Prometheus server と node_exporter を同じノード上にインストールしてグラフが取れてることを確認したりしました。なので今度はメール通知をやってみようと思う。メール通知するためには Alertmanager というアラートを出す専用のやつをインストールして Prometheus server と連携する必要がある。なので構成としては監視サーバ（ホスト名：promhost） Prometheus server, Alertmanager, node_exporter（エージェント）監視対象のサーバ（ホスト名：targethost） node_expor…

tyru.hatenablog.com

Alertmanagerでk8sクラスタの障害を通知する

Table of Contents

はじめに #

Prometheus上でエラーが出ていた件 #

[prometheus-kube-stack] Target Kubelet 0/0 up and others are down · Issue #204 · prometheus-community/helm-charts

[kube-prometheus-stack] KubeScheduler & Etcd & CubeController-Manager Targets Down - Prometheus Operator · Issue #1966 · prometheus-community/helm-charts

etcd/kube-controller-manager/kube-scheduler #

kube-proxy #

手順 #

helmの変数定義 #

Alertmanagerがdisabled表示されてしまう #

Alertmanager Cluster Status is Disabled · Issue #1452 · prometheus-community/helm-charts

通知が重複してしまう #

おわりに #

4nm1tsu server status

参考文献 #

Prometheus の Alertmanager（と Postfix）でメール通知 - Humanity

はじめに

Prometheus上でエラーが出ていた件

etcd/kube-controller-manager/kube-scheduler

kube-proxy

手順

helmの変数定義

Alertmanagerがdisabled表示されてしまう

通知が重複してしまう

おわりに

参考文献