Known issues

Node creation

Mci-kp1 and mci-ks3

If node is g add
- Pods getting created using crio’s cni ip instead of cilium (10.85….)
  - Solution: Delete pods with wrong ips or reboot the node
- Might need to specify vlan when creating
- Possible crio gpg error
  - Solution: Comment out crio’s outdated source file in apt or update crio’s gpg key
- Some nodes may still have a target in boundary from before
  - Solution: Delete the boundary target

Yektanet 172.30.128.0/24 nodes need an additional route

ssh -J root@c28-w1.yektanet.local root@172.30.128.x

ip r add default via 172.30.28.1 dev ens160 onlink

Hamravesh-c11 172.30.10.0/24 needs handle routes manually

Add to /etc/netplan/90-custom-routes.yaml:

network:
  version: 2
  renderer: networkd
  ethernets:
    ens160:
      routes:
      - to: 172.30.10.0/24
        scope: link

Then run:

netplan apply

Runner nodes (c11-gr)

Might need to double check custom runner configs after adding the node:

max_pid in crio
custom egress ip in frigate’s pre-rules

Example sysctls:

user.max_inotify_watches = 4200000
user.max_ipc_namespaces = 385874
user.max_mnt_namespaces = 385874
user.max_net_namespaces = 385874
user.max_pid_namespaces = 385874
user.max_time_namespaces = 385874
user.max_user_namespaces = 385874
user.max_uts_namespaces = 385874
kernel.unprivileged_userns_clone = 1
user.max_cgroup_namespaces = 385874
net.nf_conntrack_max = 262144
net.netfilter.nf_conntrack_count = 6778
kernel.dmesg_restrict = 1
net.ipv4.ip_forward = 1
vm.max_map_count = 262144
net.netfilter.nf_conntrack_max = 7864200

Increase disk

Nodes with thin provisioning will fail and needs to be added manually (If reboot is not possible)

Pods stuck in terminating due to hanging exec process for the container

Most seen on c38-n6:

https://github.com/cri-o/cri-o/issues/6699

PVs not being deleted despite PVC’s delete reclaim policy

Might be caused by kubelet’s unstage directory change (in k8s version 1.24)

Loadbalancer ips

hlb

After creating a loadbalancer with a new ip, the new ip must be manually added to fw’s ips (netplan apply)
Might get disconnected from apiserver and needs to get restarted (frigate:hlb:down alert)

gw-controller

In case of apiserver outage, might need to manually check loadbalancer ips
In some combinations of cilium and kernel versions, loadbalancers using externalTrafficPolicy: Cluster might not work
- Escalate to hosseinmohammadi if needed

Can’t connect to a new cluster’s nodes

(baundary keeps getting killed or ssh sessions last a few seconds only)

Cause:

frigate’s constant netplan apply

Steps to fix:

Decrease hlb’s replicas to zero
Comment out netplan apply in frigate’s code (/root/frigate/main.py)
Increase hlb’s replicas to one again

Hecant

root@hecant.hamravesh-nac.local

IP banned

ssh to hecant vm and run:

ip route replace default via 172.31.1.1 dev eth0 proto dhcp src 116.203.11.219 metric 100

Disk being full results in deb packages not getting downloaded

Error: (18) transfer closed with xxxx bytes remaining to read

Prune container logs

Wrong client ip in traefik logs

Might be caused by a bug in frigate
Need to upgrade or downgrade to a version that doesn’t include the bug

Kube-Event-Exporter

The kube-event-exporter sometimes gets stuck while generating logs from events—when this happens, the pod’s logs repeat the same entry.

Current resolution:

Roll out its deployment

Node creation​

Mci-kp1 and mci-ks3​

Yektanet 172.30.128.0/24 nodes need an additional route​

Hamravesh-c11 172.30.10.0/24 needs handle routes manually​

Runner nodes (c11-gr)​

Increase disk​

Pods stuck in terminating due to hanging exec process for the container​

PVs not being deleted despite PVC’s delete reclaim policy​

Loadbalancer ips​

hlb​

gw-controller​

Can’t connect to a new cluster’s nodes​

Hecant​

IP banned​

Disk being full results in deb packages not getting downloaded​

Wrong client ip in traefik logs​

Kube-Event-Exporter​