۸ آذر
- Alert: Some
c25 nodes are NotReady. Since it stays NotReady for a long time, it has been silenced for 2 weeks.
- Alert:
c28-db4 (Yektanet) alerts: due to a disaster (disk loss), data restored from backups elsewhere. Node is now NotReady, so it is silenced.
- Alert: Due to Asiatech router issues, this service was intentionally stopped on firewalls of some clusters.
- Alert: Certificate
n1.jita.security.hsre.ir on c12 has issues; HTTP challenge is not working.
- Alert: Seems Torob reduced some resources and
hlb-controller pod is Pending. Needs follow-up.
- Alert:
kp1 has some node-gc pods pending; needs follow-up to allocate resources.
- Alert:
node-gc on c12 is pending; follow up with public cluster manager about resources.
- Alert:
kubenurse on c18 is pending; follow up with public cluster manager about resources.
- Alert: Certificate
vmauth-router-tls on c7/hamravesh-monitoring is reported broken but seems incorrect.
- Alert:
kp1 has a Traefik pod pending for 15 days; needs follow-up or investigation.
- Alert: For unknown reasons / system bug,
frigate metrics show job down.
- Alert:
kp1 alert node:inode:low for node kp1-g3; needs investigation.
- Alert: New clusters
idk-c8, hamravesh-c12, hamravesh-c38, khodro45-c54 have low fw-vector job; needs investigation.
- Alert:
etcd metric probably not enabled for hamravesh-c4 and airtour-c3.
- Alert:
helper-meterics job has issues in new clusters (bitpin-c101, dropp-c107, hamravesh-c105, hamravesh-c9, yektanet-c97); needs investigation.
- Alert: Rawfile masters
c3-m1, c2-m1, ks3-m1, c44-m1 have slightly less disk than required; need to add disk.
- Alert: etcd backup PVC size on
hectora-c22 cannot be increased; must be solved.
- Alert: New clusters have issues calculating control-plane certificate expiration; must be solved.
۹ آذر
- Alert: Need to check Torob
p resources. Some external-dns and hecant jobs become Pending on Torob.
۱۰ آذر
- Alert: Ingress
slx.central.hamds.ir resolves from c38 and HTTP challenge fails on c85.
- Ticket created for Maratus.
- They said their zone is failover and they can’t do anything; we probably need a solution.
- Alert: IP
45.129.39.172 belongs to pirates-windmill-worker-2. We should make our GoW NACs more distinguishable.
۲۲ آذر
- Alert: Torob
c25 nodes are all tainted, so cm-acme-http-solver pods have nowhere to schedule.