Skip to main content

Alerts

List of alerts

Notes / handling hints

vmagent:down

Might not be related to us
Might be related to network issues or dns outage/latency
Check general cluster health and network

coredns:response-time:high

Check coredns and dnsmasq dashboards
Change dnsmasq’s upstream resolvers if needed

Kubeadm:apiserver:certificate:expiration

Check master’s resource usage during update
kubeadm certs check-expiration
kubeadm certs renew all
Reset kube-apiserver, kube-controller-manager, kube-scheduler and etcd
crictl ps | grep kube-apiserver
crictl stop container_id

frigate:hlb:down

Check frigate’s logs (journalctl -xfu frigate)
If there’s an error in the logs, fix it.
If not, check the latest sync request in frigate’s logs
If the latest request is more than a few minutes old, check hlb’s logs and restart it if needed (see loadbalancer’s known issues)

job:etcd-backup:failed

Either there is an old failed backup job, just delete it. (kubectl get jobs -n kube-system)
Or the pvc does not have enough space:
- Check etcd back up pod’s logs
- Add disk to data disk on master
- Increase pvc

etcd:disk-space:low

Increase etcd quota using autohan
or in case of emergency add --quota-backend-bytes=8589934592 to etcd manifest

conntrack:limit:low

Check and increase conntrack immediately
Might be a false alarm

volume:metrics:missing

In case of docker nodes, restart kubelet

List of alerts
Notes / handling hints