Skip to main content

Alerts Log

۸ آذر

  • Alert: Some c25 nodes are NotReady. Since it stays NotReady for a long time, it has been silenced for 2 weeks.
  • Alert: c28-db4 (Yektanet) alerts: due to a disaster (disk loss), data restored from backups elsewhere. Node is now NotReady, so it is silenced.
  • Alert: Due to Asiatech router issues, this service was intentionally stopped on firewalls of some clusters.
  • Alert: Certificate n1.jita.security.hsre.ir on c12 has issues; HTTP challenge is not working.
  • Alert: Seems Torob reduced some resources and hlb-controller pod is Pending. Needs follow-up.
  • Alert: kp1 has some node-gc pods pending; needs follow-up to allocate resources.
  • Alert: node-gc on c12 is pending; follow up with public cluster manager about resources.
  • Alert: kubenurse on c18 is pending; follow up with public cluster manager about resources.
  • Alert: Certificate vmauth-router-tls on c7/hamravesh-monitoring is reported broken but seems incorrect.
  • Alert: kp1 has a Traefik pod pending for 15 days; needs follow-up or investigation.
  • Alert: For unknown reasons / system bug, frigate metrics show job down.
  • Alert: kp1 alert node:inode:low for node kp1-g3; needs investigation.
  • Alert: New clusters idk-c8, hamravesh-c12, hamravesh-c38, khodro45-c54 have low fw-vector job; needs investigation.
  • Alert: etcd metric probably not enabled for hamravesh-c4 and airtour-c3.
  • Alert: helper-meterics job has issues in new clusters (bitpin-c101, dropp-c107, hamravesh-c105, hamravesh-c9, yektanet-c97); needs investigation.
  • Alert: Rawfile masters c3-m1, c2-m1, ks3-m1, c44-m1 have slightly less disk than required; need to add disk.
  • Alert: etcd backup PVC size on hectora-c22 cannot be increased; must be solved.
  • Alert: New clusters have issues calculating control-plane certificate expiration; must be solved.

۹ آذر

  • Alert: Need to check Torob p resources. Some external-dns and hecant jobs become Pending on Torob.

۱۰ آذر

  • Alert: Ingress slx.central.hamds.ir resolves from c38 and HTTP challenge fails on c85.
    • Ticket created for Maratus.
    • They said their zone is failover and they can’t do anything; we probably need a solution.
  • Alert: IP 45.129.39.172 belongs to pirates-windmill-worker-2. We should make our GoW NACs more distinguishable.

۲۲ آذر

  • Alert: Torob c25 nodes are all tainted, so cm-acme-http-solver pods have nowhere to schedule.