Clusters
airtour-c3
- On airtour’s own server(s)
- Has it’s own vcenter
https://172.30.64.100(accessible by openvpn tunnel)
- Airtour’s vpn might randomly disconnect and needs to get connected again
- Uses private ips as ingress controller ips
- We don’t handle airtour’s public ips
- Has an haproxy that handles public ips and routes them to our haproxy or frigate
- Wildcard certificate might be handled using their haproxy and not using cert-manager (tls is also terminated there)
- Must use their dns resolvers in frigate
- Cluster’s apiserver is whitelisted
- Any ip exposing needs to be coordinated with them
- Has custom ip route rules to access other services in other vlans
- Has a custom deployed haproxy to handle traefik’s ips
Known issues
- Random
node:NotReadyalerts onc3-p8
alphatech-c60
- No customizations yet
- On our servers in Asiatech
- Uses
gw-controllerinstead ofhlb
asanbar-c41
- No customizations yet
- On our servers in Parsonline
- Has at least one vm (IaaS)
bitpin-c70
- On our servers in Asiatech-Miremad
- Bitpin’s production cluster
- Bitpin uses their own sso (managed by Samurai) to connect to the cluster
- Uses managed (by us)
ingress-nginxas it’s ingress controller - Has two ingress-nginxs for different use cases
- Some of their domains are whitelisted using ingress-nginx’s annotations
- Has dedicated servers though not specified in contract
- Has custom vlans configured
- Has a dedicated vcenter for servers
- Some of the managed postgresql databases are based on vms in bitpin's servers/vcenter
- Has a few reserved and static nodes (
c70-d?) - Has custom rules to connect to
bitpin-c71using private ips - Uses hlb’s private ip feature (like a lot), managed by us
- Might request exposing new databases using private ip loadbalancer to
bitpin-c71
Known issues
- ingress-nginx logs don’t get rotated
- Rotation isn’t enabled in loghandler
- Manually rotate if needed
bitpin-c71
Same as above.
- Bitpin’s tools cluster
- Bitpin’s gitlab has custom iptables rules in frigate to limit access to port 22
Known issues
- Random
job:downalert forc71-gr1’s node-exporter - ingress-nginx logs don’t get rotated
- Rotation isn’t enabled in loghandler
- Manually rotate if needed
cafebazaar-c55
- On our servers in Asiatech
- Cafebazaar’s tools cluster
- Cafebazaar uses their own sso (managed by Samurai) to connect to the cluster
- Has re-deployed monitoring stack and other components
- Has a custom kubelet config to allow network related sysctls
chapterone-c2
- On their servers in Asiatech-Base
- Servers are managed (by phoenix)
- Chapterone uses their own sso to connect to the cluster
- Has a history of automatically blocking ips in the middle of the night when performing maintenances or updating
daal-c51
- On our servers in Asiatech
- Has dedicated servers (G10 servers with enterprise SSD)
- Servers are billed instead of cluster resources
- All nodes must be created on their own servers
- Adding a new server must be communicated with them
- New server must match the previous spec
hamravesh-c11
- On our servers in Asiatech
- Hamravesh’s public cluster
- Support team may request new nodes to use as a reserve and uncordon if needed
- Has custom iptables rules in frigate
- Gitlab Runners are deployed here (gr nodes)
c11-grnodes use a separate egress ip (configured using frigate rules)
hamravesh-c12
- On our servers in Asiatech
- Hamravesh’s semi-public cluster
- Hamravesh’s SaaS apps reside here (yektanet’s artifactory, etc)
- Might have custom iptables rules in frigate
- Does not have organized and specific nodepools
c12-n5data stores are complicated :))) there are many
hamravesh-c13
- On our servers in Parsonline
- Hamravesh’s public cluster
- Support team may request new nodes to use as a reserve and uncordon if needed
- Has custom iptables rules in frigate
- Has a custom config in both the apiserver and cilium to support an increased nodeport count
hamravesh-c17
- On our servers in Parsonline
- Hamravesh’s storage and backup cluster
- Mostly uses HDD disks
- Might use servers with custom specs
- Max pods is raised to 200 on backup nodes
Known issues
- Outage in
c17-bnodes at night or when taking a lot of backups - Random
pod:control-plane:downorapiserver:response-time:highalerts caused by high disk latency inc17-m1which effects etcd and causes a higher response time in the control plane- Usually resolves on its own in a few minutes
- Check if it doesn’t get resolved on its own
hamravesh-c18
- On our servers in Parsonline
- Hamravesh’s main services cluster
- Hamravesh’s artifactory resides here so beware of potential cyclic dependencies in case of artifactory outages or planned maintenances
- Has custom iptables rules in frigate
Known issues
- After rebooting
c18-s8, cri-o doesn’t manually start and might need to be manually restarted:systemctl restart crio
hamravesh-c23
- On our servers in Hetzner
- Hamravesh’s public cluster
- Has custom iptables rules in frigate
- No dedicated loadbalancer available in this cluster
Known issues
- Traefik’s ip might get filtered
- Check:
https://c23.uptime.hamdc.irviahttps://check-host.net/on HTTP mode.
- Check:
Steps to add new ip
- Pick ip from
ipam.hamravesh.ir(current c23 range) - Add iptables rules to frigate’s post-rules
- Check frigate logs
- Add new ip to netplan config manually and run
netplan apply - Change
c23.hamravesh.hamserver.irandc23.hamravesh.onhamravesh.irrecords to new ip
Curl to update hamserver record:
curl --request PATCH --url 'https://api.cloudflare.com/client/v4/zones/282df448e83cb9fb689c98f00efc09b1/dns_records/62a392e94dd606ae371d902cb63980e3?type=A' --header 'Authorization: TOKEN' --header 'Content-Type: application/json' --data '{"content": "x.x.x.x"}'
Curl to update onhamravesh record:
curl --request PATCH --url 'https://api.cloudflare.com/client/v4/zones/bafa98ec1702525b8b92eaf383b1ed65/dns_records/057d1855525837459f694f4102b9a65b?type=A' --header 'Authorization: TOKEN' --header 'Content-Type: application/json' --data '{ "content": "157.90.43.55"}'
Sometimes, we pre-assign an IP address, and you just need to apply it to the DNS records.
If a new ip is already added, just change the dns records to the new ip.
hamravesh-c27
- On our servers in Parsonline
- Camelus should determine the server to create the requested node on!
- Hamravesh’s semi-public and storage cluster
- Only customer is Torob
- Mostly has it’s own servers with custom disk
- 8 * 1 TB SSD disks
hamravesh-c38
- On our servers in Asiatech
- Hamravesh’s observability cluster
Known issues
- Pods stuck in terminating due to hanging exec process for the container (most seen on
c38-n6):
hamravesh-c4
- On our servers in Fanava
- Hamravesh’s old cluster
- Is going to be deprecated and isn’t actively maintained
- Uses kubernetes version
1.21
hamravesh-c42
- On our servers in Asiatech
- Mostly uses HDD disks
- Might use servers with custom specs
- Max pods is raised to 200 on backup nodes
Known issues
- Outage in
c42-bnodes at night or when taking a lot of backups - Random
pod:control-plane:downorapiserver:response-time:highalerts caused by high disk latency inc42-m1which effects etcd and causes a higher response time in the control plane- Usually resolves on its own in a few minutes
- Check if it doesn’t get resolved on its own
hamravesh-c5
- On our servers in Beeline/Armenia
- Hamravesh’s public cluster
- Is very limited and isn’t used much
- Servers are located in hectora’s vcenter
hamravesh-c7
- On our servers in Zirsakht
- Mainly used as an observability cluster
hamravesh-c78
- On our servers in Parsonline
- Hamravesh’s security cluster
hamravesh-c85
- On our servers in Parsonline
- Hamravesh’s observability cluster
hectora-c22
- The LAN interface of each node on servers other than hectora needs to be changed in the middle of node creation to
VLAN-4002- (check
c22-d8interfaces)
- (check
hectora-c26
(Not filled in yet.)
idk-c8
(Not filled in yet.)
karafsapp-c63
(Not filled in yet.)
khodro45-c54
(Not filled in yet.)
mci-kp1
For changing the ntp server of mci nodes:
for node in $(kubectl get nodes -o wide | awk 'NR>1 {print $1}'); do
echo "===== $node ====="
ssh root@"$node".mci.local "sed -i 's/^NTP=.*/ips you want/' /etc/systemd/timesyncd.conf && systemctl daemon-reload && systemctl restart systemd-timesyncd.service"
done
mci-ks3
MCI Staging
ks3-m1physical location currently has lots of high latency disk io issue and cause control plane outage alot during night and day.
melligold-c66
- On our servers in Asiatech
- Uses
gw-controllerinstead ofhlb
mixin-c20
(Not filled in yet.)
nikandishan-c73
(Not filled in yet.)
nobitex-c75 / c76 / c81 / c84 / c86
- Nobitex uses their own sso to connect to the cluster
novinhub-c72
(Not filled in yet.)
raastin-c16
- Ask Phoenix where SSD Enterprise are.
- On our servers in Asiatech
- Preferably must use enterprise ssd disks in
c16-dnodes. But currently you can create d nodes on other nodes.
royanegar-c43
- No customizations yet
- On our servers in Asiatech
saraf-c39
- It does not contain dedicated servers, so the directory in the vcenter is useless.
- Uses
gw-controllerinstead ofhlb
saraf-c40
- Uses
gw-controllerinstead ofhlb
sindad-c15
- On our servers in Fanava
- Hamravesh’s public cluster
- Sindad is the hardware provider
torob-c25
- On torob’s servers in Parsonline
- Torob’s production cluster
- Torob uses their own sso to connect to the cluster
- Has it’s own vcenter
- Torob uses their own ingress-controller
- Cluster’s apiserver is whitelisted
- Hardcore service-mesh users (istio is deployed and managed by us)
- Node
c25-requests-clickhouse0has pid limits16384in both kubelet and CRI-O configs. - On the same network with
mixin-c20(public)
torob-c34
- On torob’s servers in Hetzner
- Torob’s international/foreign brand
- Torob uses their own sso to connect to the cluster
- Has it’s own vcenter (separate from torob-c25’s one)
- Torob uses their own ingress-controller
- Hardcore service-mesh users (istio is deployed and managed by us)
triboonnet-c61
- No customizations yet
- On our servers in Asiatech
updamus-probers
- Uses k3s instead of k8s
- Not managed by us
- Managed by maratus
- Accessed only with admin token (not connected to oidc)
yektanet-c28
- On our servers in Asiatech
- Yektanet’s production cluster
- Yektanet uses their own sso to connect to the cluster
- Uses ingress-nginx (managed by them) as it’s ingress-controller
- Has custom haproxy gateways for inbound traffic
- Yektanet uses their own dedicated public ip range (
87.107.167.0/24) in their asiatech clusters - All nodes must use pcores
- Some nodepools have reserved and static cpus:
c28-n,c28-w,c28-db
- Custom metrics and ingress-nginx metrics are in
yektanet-c38-proddatasource in grafana - Uses
host-idandrack-idlabels - Can tolerate downtime of one
c28-norc28-sor a fewc28-wnodes - All
c28-dbnodes must be created on enterprise ssd disks and on separate servers - Has two ip ranges:
172.30.28.0/24172.30.128.0/24
- Has higher system reserved and kubelet reserved in it’s kubelet config
- Might have custom iptables rules in frigate
- Might have custom hardcoded dns records in frigate
- Uses hlb’s private loadbalancer ip
- Uptime is very important
c28-helperhas a separate public ip- All four of yektanet’s asiatech clusters can be accessed from each other using private ips
Known issues
- Pods not getting created at all during or after a maintenance might be caused by Yektanet’s own mutating or validating webhooks
- Performance gets visibly effected by ilo’s power settings in power saving mode (must be high performance)
- Connectivity issues in connecting to private loadbalancer ips:
- In
c28-helper:nc -v 172.30.128.65 32489-> connection failed - Escalate to
h.marvi
- In
- Pods getting stuck in ContainerCreating with cilium logging identity map is full
- caused by an increase in ciliumidentities caused by a large number of pods getting created and deleted
- Find the offending jobs/pods and communicate with yektanet
- Escalate to
h.marviorsemekhif needed
- coredns pods getting recreated might cause outage in haproxy gateways
- haproxy pods need to be recreated to fix it
- Pods using
hsds-btrfsorhsds-rawfilestorage class getting stuck in ContainerCreating- Usually in
devnamespace - might see errors about
mutli-attach is not possiblein describe pod - Escalate to
h.marvi
- Usually in
yektanet-c29
- On our servers in Asiatech
- Yektanet’s staging cluster
- Yektanet uses their own sso to connect to the cluster
- Yektanet uses their own dedicated public ip range (
87.107.167.0/24) in their asiatech clusters - Some nodepools have reserved and static cpus:
c29-db
- Max pods is raised to 200
- Pods not getting created at all during or after a maintenance might be caused by Yektanet’s own mutating or validating webhooks
- All four of yektanet’s asiatech clusters can be accessed from each other using private ips
- Has custom haproxy gateways for inbound traffic
yektanet-c31
- On our servers in Asiatech
- Yektanet’s tools cluster
- Yektanet uses their own sso to connect to the cluster
- Yektanet uses their own dedicated public ip range (
87.107.167.0/24) in their asiatech clusters - Some nodepools have reserved and static cpus:
c31-db
- Might have custom iptables rules in frigate
- Might have custom hardcoded dns records in frigate
- Uses hlb’s private loadbalancer ip
- All four of yektanet’s asiatech clusters can be accessed from each other using private ips
- Has custom haproxy gateways for inbound traffic
yektanet-c32 (deleted)
- Yektanet uses their own dedicated public ip range (
87.107.167.0/24) in their asiatech clusters - Yektanet uses their own sso to connect to the cluster
- All four of yektanet’s asiatech clusters can be accessed from each other using private ips
- Has custom haproxy gateways for inbound traffic
yektanet-c44
- Yektanet uses their own sso to connect to the cluster
yektanet-c53
- Yektanet uses their own sso to connect to the cluster
Joining a new node
- Login through
console.hetzner.comusing hectora credential into yektanet project (do not use mehran credentials) - Don’t use external volume for vm creation (its over-network and slow somehow)
- Don’t allocate any public ip for the vm
- Select db placement group if it was a db node
- Provide a cloud-init. you can find a one on
/var/lib/cloud/instance/user-data.txtin another node.
Example cloud-init:
#cloud-config
runcmd:
- |
cat <<EOF > /etc/systemd/network/10-enp7s0.network
[Match]
Name=enp7s0
[Network]
DHCP=yes
Gateway=172.30.53.1
EOF
- systemctl restart systemd-networkd
- Add Semekh and Setup keys
- Kubernetes sysctls (podolica does it automatically for c53 now!)
#!/usr/bin/env bash
set -e
modprobe br_netfilter
echo "br_netfilter" | tee /etc/modules-load.d/k8s.conf > /dev/null
tee /etc/sysctl.d/k8s.conf > /dev/null <<EOF
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
EOF
sysctl --system
zarinpal-c68
- On their servers in Asiatech
- Servers are managed (by phoenix)
- Has a different ip range (
172.16.67.0/24) - Server might get randomly disconnected in vcenter (contact phoenix if needed)
- Has a history of hardware problems in the server that cause outage
- Uses
gw-controllerinstead ofhlb
Known issues
- In case of
c68-fw1getting rebooted, need to manually run this:
iptables-restore < /var/lib/frigate/pre-rules
- In case of server/nodes getting rebooted, loadbalancer ips might need to be manually checked
- Escalate to
hosseinmohammadiif needed