Known issues
Node creation
Mci-kp1 and mci-ks3
-
If node is g add
-
Pods getting created using crio’s cni ip instead of cilium (10.85….)
- Solution: Delete pods with wrong ips or reboot the node
-
Might need to specify vlan when creating
-
Possible crio gpg error
- Solution: Comment out crio’s outdated source file in apt or update crio’s gpg key
-
Some nodes may still have a target in boundary from before
- Solution: Delete the boundary target
-
Yektanet 172.30.128.0/24 nodes need an additional route
ssh -J root@c28-w1.yektanet.local root@172.30.128.x
ip r add default via 172.30.28.1 dev ens160 onlink
Hamravesh-c11 172.30.10.0/24 needs handle routes manually
Add to /etc/netplan/90-custom-routes.yaml:
network:
version: 2
renderer: networkd
ethernets:
ens160:
routes:
- to: 172.30.10.0/24
scope: link
Then run:
netplan apply
Runner nodes (c11-gr)
Might need to double check custom runner configs after adding the node:
max_pidin crio- custom egress ip in frigate’s pre-rules
Example sysctls:
user.max_inotify_watches = 4200000
user.max_ipc_namespaces = 385874
user.max_mnt_namespaces = 385874
user.max_net_namespaces = 385874
user.max_pid_namespaces = 385874
user.max_time_namespaces = 385874
user.max_user_namespaces = 385874
user.max_uts_namespaces = 385874
kernel.unprivileged_userns_clone = 1
user.max_cgroup_namespaces = 385874
net.nf_conntrack_max = 262144
net.netfilter.nf_conntrack_count = 6778
kernel.dmesg_restrict = 1
net.ipv4.ip_forward = 1
vm.max_map_count = 262144
net.netfilter.nf_conntrack_max = 7864200
Increase disk
- Nodes with thin provisioning will fail and needs to be added manually (If reboot is not possible)
Pods stuck in terminating due to hanging exec process for the container
Most seen on c38-n6:
PVs not being deleted despite PVC’s delete reclaim policy
- Might be caused by kubelet’s unstage directory change (in k8s version 1.24)
Loadbalancer ips
hlb
- After creating a loadbalancer with a new ip, the new ip must be manually added to fw’s ips (
netplan apply) - Might get disconnected from apiserver and needs to get restarted (
frigate:hlb:downalert)
gw-controller
- In case of apiserver outage, might need to manually check loadbalancer ips
- In some combinations of cilium and kernel versions, loadbalancers using
externalTrafficPolicy: Clustermight not work- Escalate to
hosseinmohammadiif needed
- Escalate to
Can’t connect to a new cluster’s nodes
(baundary keeps getting killed or ssh sessions last a few seconds only)
Cause:
- frigate’s constant
netplan apply
Steps to fix:
- Decrease hlb’s replicas to zero
- Comment out netplan apply in frigate’s code (
/root/frigate/main.py) - Increase hlb’s replicas to one again
Hecant
root@hecant.hamravesh-nac.local
IP banned
ssh to hecant vm and run:
ip route replace default via 172.31.1.1 dev eth0 proto dhcp src 116.203.11.219 metric 100
Disk being full results in deb packages not getting downloaded
Error: (18) transfer closed with xxxx bytes remaining to read
- Prune container logs
Wrong client ip in traefik logs
- Might be caused by a bug in frigate
- Need to upgrade or downgrade to a version that doesn’t include the bug
Kube-Event-Exporter
The kube-event-exporter sometimes gets stuck while generating logs from events—when this happens, the pod’s logs repeat the same entry.
Current resolution:
- Roll out its deployment