List of alerts
- job:down
- pvc:high
- pvc:fs:high_inode
- pod:pending
- certificate:expiration:critical
- loadbalancer:svc:pending
- pvc:resize:fail
- vmagent:down
- coredns:response-time:high
- kubelet:eviction:high
- haproxy:active-backends:none
- haproxy:backend:down
- Kubeadm:apiserver:certificate:expiration
- kubeadm:etcd_ca:certificate:expiration
- kubeadm:etcd_server:certificate:expiration
- kubeadm:scheduler_client:certificate:expiration
- pod:rawfile:CrashLoopBackOff
- pod:control-plane:down
- packet:loss:critical
- frigate:hlb:down
- apiserver:response-time:high
- etcd:response-time:high
- traefik:cpu-throttling:high
- job:etcd-backup:failed
- etcd:disk-space:low
- etcd-c28-c31-kp1-c20-c44:disk:low
- etcd:disk:low
- loki:event:discarded
- pod:unhealthy
- svc:haproxy
- heartbeat
- loki:imagepull:error
- loki:imagepull:504gatewaytimeout
- loki:imagepull:502badgateway
- imagepull:digest:error
- torob:c25:imagecache2:reboot:high
- pod:unhealthy:metrics-server
- KernelOops
- OOMKilling
- OOMKilling-Extended
- ReadOnlyFilesystem
- DockerHung
- CorruptDockerOverlay
- TaskHung
- IOError
- NodeProblemGauge
- node:disk:high-ng
- node:clock:skew
- node:nodefs_inodes:eviction
- node:imagefs_inode:eviction
- node:NotReady
- node:not-healthy
- node:data-disk:none
- node:inode:low
- node:taint:missing
- node:host-id:label-mismatch
- rawfile:overprovisioned
- conntrack:limit:low
- yektanet-gw:load:high
- volume:metrics:missing
- svc:pending
- node:unhealthy
- pod:imagepullbackoff:recent:crio:high
- pod:imagepull:containerCreating
- pod:oom
- pod:down:critical
- pvc:high:critical
- traefik:c13
- updamus:down
- vm:disk:high
Notes / handling hints
vmagent:down
- Might not be related to us
- Might be related to network issues or dns outage/latency
- Check general cluster health and network
coredns:response-time:high
- Check coredns and dnsmasq dashboards
- Change dnsmasq’s upstream resolvers if needed
Kubeadm:apiserver:certificate:expiration
- Check master’s resource usage during update
- kubeadm certs check-expiration
- kubeadm certs renew all
- Reset kube-apiserver, kube-controller-manager, kube-scheduler and etcd
- crictl ps | grep kube-apiserver
- crictl stop container_id
frigate:hlb:down
- Check frigate’s logs (journalctl -xfu frigate)
- If there’s an error in the logs, fix it.
- If not, check the latest sync request in frigate’s logs
- If the latest request is more than a few minutes old, check hlb’s logs and restart it if needed (see loadbalancer’s known issues)
job:etcd-backup:failed
- Either there is an old failed backup job, just delete it. (kubectl get jobs -n kube-system)
- Or the pvc does not have enough space:
- Check etcd back up pod’s logs
- Add disk to data disk on master
- Increase pvc
etcd:disk-space:low
- Increase etcd quota using autohan
- or in case of emergency add --quota-backend-bytes=8589934592 to etcd manifest
conntrack:limit:low
- Check and increase conntrack immediately
- Might be a false alarm
volume:metrics:missing
- In case of docker nodes, restart kubelet