Skip to main content

Known issues

Node creation

Mci-kp1 and mci-ks3

  • If node is g add

    • Pods getting created using crio’s cni ip instead of cilium (10.85….)

      • Solution: Delete pods with wrong ips or reboot the node
    • Might need to specify vlan when creating

    • Possible crio gpg error

      • Solution: Comment out crio’s outdated source file in apt or update crio’s gpg key
    • Some nodes may still have a target in boundary from before

      • Solution: Delete the boundary target

Yektanet 172.30.128.0/24 nodes need an additional route

ssh -J root@c28-w1.yektanet.local root@172.30.128.x

ip r add default via 172.30.28.1 dev ens160 onlink

Hamravesh-c11 172.30.10.0/24 needs handle routes manually

Add to /etc/netplan/90-custom-routes.yaml:

network:
version: 2
renderer: networkd
ethernets:
ens160:
routes:
- to: 172.30.10.0/24
scope: link

Then run:

netplan apply

Runner nodes (c11-gr)

Might need to double check custom runner configs after adding the node:

  • max_pid in crio
  • custom egress ip in frigate’s pre-rules

Example sysctls:

user.max_inotify_watches = 4200000
user.max_ipc_namespaces = 385874
user.max_mnt_namespaces = 385874
user.max_net_namespaces = 385874
user.max_pid_namespaces = 385874
user.max_time_namespaces = 385874
user.max_user_namespaces = 385874
user.max_uts_namespaces = 385874
kernel.unprivileged_userns_clone = 1
user.max_cgroup_namespaces = 385874
net.nf_conntrack_max = 262144
net.netfilter.nf_conntrack_count = 6778
kernel.dmesg_restrict = 1
net.ipv4.ip_forward = 1
vm.max_map_count = 262144
net.netfilter.nf_conntrack_max = 7864200

Increase disk

  • Nodes with thin provisioning will fail and needs to be added manually (If reboot is not possible)

Pods stuck in terminating due to hanging exec process for the container

Most seen on c38-n6:


PVs not being deleted despite PVC’s delete reclaim policy

  • Might be caused by kubelet’s unstage directory change (in k8s version 1.24)

Loadbalancer ips

hlb

  • After creating a loadbalancer with a new ip, the new ip must be manually added to fw’s ips (netplan apply)
  • Might get disconnected from apiserver and needs to get restarted (frigate:hlb:down alert)

gw-controller

  • In case of apiserver outage, might need to manually check loadbalancer ips
  • In some combinations of cilium and kernel versions, loadbalancers using externalTrafficPolicy: Cluster might not work
    • Escalate to hosseinmohammadi if needed

Can’t connect to a new cluster’s nodes

(baundary keeps getting killed or ssh sessions last a few seconds only)

Cause:

  • frigate’s constant netplan apply

Steps to fix:

  1. Decrease hlb’s replicas to zero
  2. Comment out netplan apply in frigate’s code (/root/frigate/main.py)
  3. Increase hlb’s replicas to one again

Hecant

root@hecant.hamravesh-nac.local

IP banned

ssh to hecant vm and run:

ip route replace default via 172.31.1.1 dev eth0 proto dhcp src 116.203.11.219 metric 100

Disk being full results in deb packages not getting downloaded

Error: (18) transfer closed with xxxx bytes remaining to read

  • Prune container logs

Wrong client ip in traefik logs

  • Might be caused by a bug in frigate
  • Need to upgrade or downgrade to a version that doesn’t include the bug

Kube-Event-Exporter

The kube-event-exporter sometimes gets stuck while generating logs from events—when this happens, the pod’s logs repeat the same entry.

Current resolution:

  • Roll out its deployment