Skip to main content

Alerts

List of alerts

  • job:down
  • pvc:high
  • pvc:fs:high_inode
  • pod:pending
  • certificate:expiration:critical
  • loadbalancer:svc:pending
  • pvc:resize:fail
  • vmagent:down
  • coredns:response-time:high
  • kubelet:eviction:high
  • haproxy:active-backends:none
  • haproxy:backend:down
  • Kubeadm:apiserver:certificate:expiration
  • kubeadm:etcd_ca:certificate:expiration
  • kubeadm:etcd_server:certificate:expiration
  • kubeadm:scheduler_client:certificate:expiration
  • pod:rawfile:CrashLoopBackOff
  • pod:control-plane:down
  • packet:loss:critical
  • frigate:hlb:down
  • apiserver:response-time:high
  • etcd:response-time:high
  • traefik:cpu-throttling:high
  • job:etcd-backup:failed
  • etcd:disk-space:low
  • etcd-c28-c31-kp1-c20-c44:disk:low
  • etcd:disk:low
  • loki:event:discarded
  • pod:unhealthy
  • svc:haproxy
  • heartbeat
  • loki:imagepull:error
  • loki:imagepull:504gatewaytimeout
  • loki:imagepull:502badgateway
  • imagepull:digest:error
  • torob:c25:imagecache2:reboot:high
  • pod:unhealthy:metrics-server
  • KernelOops
  • OOMKilling
  • OOMKilling-Extended
  • ReadOnlyFilesystem
  • DockerHung
  • CorruptDockerOverlay
  • TaskHung
  • IOError
  • NodeProblemGauge
  • node:disk:high-ng
  • node:clock:skew
  • node:nodefs_inodes:eviction
  • node:imagefs_inode:eviction
  • node:NotReady
  • node:not-healthy
  • node:data-disk:none
  • node:inode:low
  • node:taint:missing
  • node:host-id:label-mismatch
  • rawfile:overprovisioned
  • conntrack:limit:low
  • yektanet-gw:load:high
  • volume:metrics:missing
  • svc:pending
  • node:unhealthy
  • pod:imagepullbackoff:recent:crio:high
  • pod:imagepull:containerCreating
  • pod:oom
  • pod:down:critical
  • pvc:high:critical
  • traefik:c13
  • updamus:down
  • vm:disk:high

Notes / handling hints

vmagent:down

  • Might not be related to us
  • Might be related to network issues or dns outage/latency
  • Check general cluster health and network

coredns:response-time:high

  • Check coredns and dnsmasq dashboards
  • Change dnsmasq’s upstream resolvers if needed

Kubeadm:apiserver:certificate:expiration

  • Check master’s resource usage during update
  • kubeadm certs check-expiration
  • kubeadm certs renew all
  • Reset kube-apiserver, kube-controller-manager, kube-scheduler and etcd
  • crictl ps | grep kube-apiserver
  • crictl stop container_id

frigate:hlb:down

  • Check frigate’s logs (journalctl -xfu frigate)
  • If there’s an error in the logs, fix it.
  • If not, check the latest sync request in frigate’s logs
  • If the latest request is more than a few minutes old, check hlb’s logs and restart it if needed (see loadbalancer’s known issues)

job:etcd-backup:failed

  • Either there is an old failed backup job, just delete it. (kubectl get jobs -n kube-system)
  • Or the pvc does not have enough space:
    • Check etcd back up pod’s logs
    • Add disk to data disk on master
    • Increase pvc

etcd:disk-space:low

  • Increase etcd quota using autohan
  • or in case of emergency add --quota-backend-bytes=8589934592 to etcd manifest

conntrack:limit:low

  • Check and increase conntrack immediately
  • Might be a false alarm

volume:metrics:missing

  • In case of docker nodes, restart kubelet