Clusters

airtour-c3

On airtour’s own server(s)
Has it’s own vcenter
- https://172.30.64.100 (accessible by openvpn tunnel)
Airtour’s vpn might randomly disconnect and needs to get connected again
Uses private ips as ingress controller ips
We don’t handle airtour’s public ips
Has an haproxy that handles public ips and routes them to our haproxy or frigate
Wildcard certificate might be handled using their haproxy and not using cert-manager (tls is also terminated there)
Must use their dns resolvers in frigate
Cluster’s apiserver is whitelisted
Any ip exposing needs to be coordinated with them
Has custom ip route rules to access other services in other vlans
Has a custom deployed haproxy to handle traefik’s ips

Known issues

Random node:NotReady alerts on c3-p8

alphatech-c60

No customizations yet
On our servers in Asiatech
Uses gw-controller instead of hlb

asanbar-c41

No customizations yet
On our servers in Parsonline
Has at least one vm (IaaS)

bitpin-c70

On our servers in Asiatech-Miremad
Bitpin’s production cluster
Bitpin uses their own sso (managed by Samurai) to connect to the cluster
Uses managed (by us) ingress-nginx as it’s ingress controller
Has two ingress-nginxs for different use cases
Some of their domains are whitelisted using ingress-nginx’s annotations
Has dedicated servers though not specified in contract
Has custom vlans configured
Has a dedicated vcenter for servers
Some of the managed postgresql databases are based on vms in bitpin's servers/vcenter
Has a few reserved and static nodes (c70-d?)
Has custom rules to connect to bitpin-c71 using private ips
Uses hlb’s private ip feature (like a lot), managed by us
Might request exposing new databases using private ip loadbalancer to bitpin-c71

Known issues

ingress-nginx logs don’t get rotated
- Rotation isn’t enabled in loghandler
- Manually rotate if needed

bitpin-c71

Same as above.

Bitpin’s tools cluster
Bitpin’s gitlab has custom iptables rules in frigate to limit access to port 22

Known issues

Random job:down alert for c71-gr1’s node-exporter
ingress-nginx logs don’t get rotated
- Rotation isn’t enabled in loghandler
- Manually rotate if needed

cafebazaar-c55

On our servers in Asiatech
Cafebazaar’s tools cluster
Cafebazaar uses their own sso (managed by Samurai) to connect to the cluster
Has re-deployed monitoring stack and other components
Has a custom kubelet config to allow network related sysctls

chapterone-c2

On their servers in Asiatech-Base
Servers are managed (by phoenix)
Chapterone uses their own sso to connect to the cluster
Has a history of automatically blocking ips in the middle of the night when performing maintenances or updating

daal-c51

On our servers in Asiatech
Has dedicated servers (G10 servers with enterprise SSD)
Servers are billed instead of cluster resources
All nodes must be created on their own servers
Adding a new server must be communicated with them
New server must match the previous spec

hamravesh-c11

On our servers in Asiatech
Hamravesh’s public cluster
Support team may request new nodes to use as a reserve and uncordon if needed
Has custom iptables rules in frigate
Gitlab Runners are deployed here (gr nodes)
c11-gr nodes use a separate egress ip (configured using frigate rules)

hamravesh-c12

On our servers in Asiatech
Hamravesh’s semi-public cluster
Hamravesh’s SaaS apps reside here (yektanet’s artifactory, etc)
Might have custom iptables rules in frigate
Does not have organized and specific nodepools
c12-n5 data stores are complicated :))) there are many

hamravesh-c13

On our servers in Parsonline
Hamravesh’s public cluster
Support team may request new nodes to use as a reserve and uncordon if needed
Has custom iptables rules in frigate
Has a custom config in both the apiserver and cilium to support an increased nodeport count

hamravesh-c17

On our servers in Parsonline
Hamravesh’s storage and backup cluster
Mostly uses HDD disks
Might use servers with custom specs
Max pods is raised to 200 on backup nodes

Known issues

Outage in c17-b nodes at night or when taking a lot of backups
Random pod:control-plane:down or apiserver:response-time:high alerts caused by high disk latency in c17-m1 which effects etcd and causes a higher response time in the control plane
- Usually resolves on its own in a few minutes
- Check if it doesn’t get resolved on its own

hamravesh-c18

On our servers in Parsonline
Hamravesh’s main services cluster
Hamravesh’s artifactory resides here so beware of potential cyclic dependencies in case of artifactory outages or planned maintenances
Has custom iptables rules in frigate

Known issues

After rebooting c18-s8, cri-o doesn’t manually start and might need to be manually restarted:
- systemctl restart crio

hamravesh-c23

On our servers in Hetzner
Hamravesh’s public cluster
Has custom iptables rules in frigate
No dedicated loadbalancer available in this cluster

Known issues

Traefik’s ip might get filtered
- Check: https://c23.uptime.hamdc.ir via https://check-host.net/ on HTTP mode.

Steps to add new ip

Pick ip from ipam.hamravesh.ir (current c23 range)
Add iptables rules to frigate’s post-rules
Check frigate logs
Add new ip to netplan config manually and run netplan apply
Change c23.hamravesh.hamserver.ir and c23.hamravesh.onhamravesh.ir records to new ip

Curl to update hamserver record:

curl --request PATCH --url 'https://api.cloudflare.com/client/v4/zones/282df448e83cb9fb689c98f00efc09b1/dns_records/62a392e94dd606ae371d902cb63980e3?type=A'   --header 'Authorization: TOKEN'   --header 'Content-Type: application/json'   --data '{"content": "x.x.x.x"}'

Curl to update onhamravesh record:

curl --request PATCH --url 'https://api.cloudflare.com/client/v4/zones/bafa98ec1702525b8b92eaf383b1ed65/dns_records/057d1855525837459f694f4102b9a65b?type=A'   --header 'Authorization: TOKEN'   --header 'Content-Type: application/json'   --data '{ "content": "157.90.43.55"}'

Sometimes, we pre-assign an IP address, and you just need to apply it to the DNS records.

If a new ip is already added, just change the dns records to the new ip.

hamravesh-c27

On our servers in Parsonline
Camelus should determine the server to create the requested node on!
Hamravesh’s semi-public and storage cluster
Only customer is Torob
Mostly has it’s own servers with custom disk
8 * 1 TB SSD disks

hamravesh-c38

On our servers in Asiatech
Hamravesh’s observability cluster

Known issues

Pods stuck in terminating due to hanging exec process for the container (most seen on c38-n6):
- https://github.com/cri-o/cri-o/issues/6699

hamravesh-c4

On our servers in Fanava
Hamravesh’s old cluster
Is going to be deprecated and isn’t actively maintained
Uses kubernetes version 1.21

hamravesh-c42

On our servers in Asiatech
Mostly uses HDD disks
Might use servers with custom specs
Max pods is raised to 200 on backup nodes

Known issues

Outage in c42-b nodes at night or when taking a lot of backups
Random pod:control-plane:down or apiserver:response-time:high alerts caused by high disk latency in c42-m1 which effects etcd and causes a higher response time in the control plane
- Usually resolves on its own in a few minutes
- Check if it doesn’t get resolved on its own

hamravesh-c5

On our servers in Beeline/Armenia
Hamravesh’s public cluster
Is very limited and isn’t used much
Servers are located in hectora’s vcenter

hamravesh-c7

On our servers in Zirsakht
Mainly used as an observability cluster

hamravesh-c78

On our servers in Parsonline
Hamravesh’s security cluster

hamravesh-c85

On our servers in Parsonline
Hamravesh’s observability cluster

hectora-c22

The LAN interface of each node on servers other than hectora needs to be changed in the middle of node creation to VLAN-4002
- (check c22-d8 interfaces)

hectora-c26

(Not filled in yet.)

idk-c8

(Not filled in yet.)

karafsapp-c63

(Not filled in yet.)

khodro45-c54

(Not filled in yet.)

mci-kp1

For changing the ntp server of mci nodes:

for node in $(kubectl get nodes -o wide | awk 'NR>1 {print $1}'); do
  echo "===== $node ====="
  ssh root@"$node".mci.local     "sed -i 's/^NTP=.*/ips you want/' /etc/systemd/timesyncd.conf && systemctl daemon-reload  && systemctl restart systemd-timesyncd.service"
done

mci-ks3

MCI Staging

ks3-m1 physical location currently has lots of high latency disk io issue and cause control plane outage alot during night and day.

melligold-c66

On our servers in Asiatech
Uses gw-controller instead of hlb

mixin-c20

(Not filled in yet.)

nikandishan-c73

(Not filled in yet.)

nobitex-c75 / c76 / c81 / c84 / c86

Nobitex uses their own sso to connect to the cluster

novinhub-c72

(Not filled in yet.)

raastin-c16

Ask Phoenix where SSD Enterprise are.
On our servers in Asiatech
Preferably must use enterprise ssd disks in c16-d nodes. But currently you can create d nodes on other nodes.

royanegar-c43

No customizations yet
On our servers in Asiatech

saraf-c39

It does not contain dedicated servers, so the directory in the vcenter is useless.
Uses gw-controller instead of hlb

saraf-c40

Uses gw-controller instead of hlb

sindad-c15

On our servers in Fanava
Hamravesh’s public cluster
Sindad is the hardware provider

torob-c25

On torob’s servers in Parsonline
Torob’s production cluster
Torob uses their own sso to connect to the cluster
Has it’s own vcenter
Torob uses their own ingress-controller
Cluster’s apiserver is whitelisted
Hardcore service-mesh users (istio is deployed and managed by us)
Node c25-requests-clickhouse0 has pid limits 16384 in both kubelet and CRI-O configs.
On the same network with mixin-c20 (public)

torob-c34

On torob’s servers in Hetzner
Torob’s international/foreign brand
Torob uses their own sso to connect to the cluster
Has it’s own vcenter (separate from torob-c25’s one)
Torob uses their own ingress-controller
Hardcore service-mesh users (istio is deployed and managed by us)

triboonnet-c61

No customizations yet
On our servers in Asiatech

updamus-probers

Uses k3s instead of k8s
Not managed by us
Managed by maratus
Accessed only with admin token (not connected to oidc)

yektanet-c28

On our servers in Asiatech
Yektanet’s production cluster
Yektanet uses their own sso to connect to the cluster
Uses ingress-nginx (managed by them) as it’s ingress-controller
Has custom haproxy gateways for inbound traffic
Yektanet uses their own dedicated public ip range (87.107.167.0/24) in their asiatech clusters
All nodes must use pcores
Some nodepools have reserved and static cpus:
- c28-n, c28-w, c28-db
Custom metrics and ingress-nginx metrics are in yektanet-c38-prod datasource in grafana
Uses host-id and rack-id labels
Can tolerate downtime of one c28-n or c28-s or a few c28-w nodes
All c28-db nodes must be created on enterprise ssd disks and on separate servers
Has two ip ranges:
- 172.30.28.0/24
- 172.30.128.0/24
Has higher system reserved and kubelet reserved in it’s kubelet config
Might have custom iptables rules in frigate
Might have custom hardcoded dns records in frigate
Uses hlb’s private loadbalancer ip
Uptime is very important
c28-helper has a separate public ip
All four of yektanet’s asiatech clusters can be accessed from each other using private ips

Known issues

Pods not getting created at all during or after a maintenance might be caused by Yektanet’s own mutating or validating webhooks
Performance gets visibly effected by ilo’s power settings in power saving mode (must be high performance)
Connectivity issues in connecting to private loadbalancer ips:
- In c28-helper: nc -v 172.30.128.65 32489 -> connection failed
- Escalate to h.marvi
Pods getting stuck in ContainerCreating with cilium logging identity map is full
- caused by an increase in ciliumidentities caused by a large number of pods getting created and deleted
- Find the offending jobs/pods and communicate with yektanet
- Escalate to h.marvi or semekh if needed
coredns pods getting recreated might cause outage in haproxy gateways
- haproxy pods need to be recreated to fix it
Pods using hsds-btrfs or hsds-rawfile storage class getting stuck in ContainerCreating
- Usually in dev namespace
- might see errors about mutli-attach is not possible in describe pod
- Escalate to h.marvi

yektanet-c29

On our servers in Asiatech
Yektanet’s staging cluster
Yektanet uses their own sso to connect to the cluster
Yektanet uses their own dedicated public ip range (87.107.167.0/24) in their asiatech clusters
Some nodepools have reserved and static cpus:
- c29-db
Max pods is raised to 200
Pods not getting created at all during or after a maintenance might be caused by Yektanet’s own mutating or validating webhooks
All four of yektanet’s asiatech clusters can be accessed from each other using private ips
Has custom haproxy gateways for inbound traffic

yektanet-c31

On our servers in Asiatech
Yektanet’s tools cluster
Yektanet uses their own sso to connect to the cluster
Yektanet uses their own dedicated public ip range (87.107.167.0/24) in their asiatech clusters
Some nodepools have reserved and static cpus:
- c31-db
Might have custom iptables rules in frigate
Might have custom hardcoded dns records in frigate
Uses hlb’s private loadbalancer ip
All four of yektanet’s asiatech clusters can be accessed from each other using private ips
Has custom haproxy gateways for inbound traffic

yektanet-c32 (deleted)

Yektanet uses their own dedicated public ip range (87.107.167.0/24) in their asiatech clusters
Yektanet uses their own sso to connect to the cluster
All four of yektanet’s asiatech clusters can be accessed from each other using private ips
Has custom haproxy gateways for inbound traffic

yektanet-c44

Yektanet uses their own sso to connect to the cluster

yektanet-c53

Yektanet uses their own sso to connect to the cluster

Joining a new node

Login through console.hetzner.com using hectora credential into yektanet project (do not use mehran credentials)
Don’t use external volume for vm creation (its over-network and slow somehow)
Don’t allocate any public ip for the vm
Select db placement group if it was a db node
Provide a cloud-init. you can find a one on /var/lib/cloud/instance/user-data.txt in another node.

Example cloud-init:

#cloud-config

runcmd:
- |
  cat <<EOF > /etc/systemd/network/10-enp7s0.network
  [Match]
  Name=enp7s0

  [Network]
  DHCP=yes
  Gateway=172.30.53.1
  EOF
- systemctl restart systemd-networkd

Add Semekh and Setup keys
Kubernetes sysctls (podolica does it automatically for c53 now!)

#!/usr/bin/env bash
set -e

modprobe br_netfilter
echo "br_netfilter" | tee /etc/modules-load.d/k8s.conf > /dev/null

tee /etc/sysctl.d/k8s.conf > /dev/null <<EOF
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
EOF

sysctl --system

zarinpal-c68

On their servers in Asiatech
Servers are managed (by phoenix)
Has a different ip range (172.16.67.0/24)
Server might get randomly disconnected in vcenter (contact phoenix if needed)
Has a history of hardware problems in the server that cause outage
Uses gw-controller instead of hlb

Known issues

In case of c68-fw1 getting rebooted, need to manually run this:

iptables-restore < /var/lib/frigate/pre-rules

In case of server/nodes getting rebooted, loadbalancer ips might need to be manually checked
Escalate to hosseinmohammadi if needed

airtour-c3​

alphatech-c60​

asanbar-c41​

bitpin-c70​

bitpin-c71​

cafebazaar-c55​

chapterone-c2​

daal-c51​

hamravesh-c11​

hamravesh-c12​

hamravesh-c13​

hamravesh-c17​

hamravesh-c18​

hamravesh-c23​

Steps to add new ip​

hamravesh-c27​

hamravesh-c38​

hamravesh-c4​

hamravesh-c42​

hamravesh-c5​

hamravesh-c7​

hamravesh-c78​

hamravesh-c85​

hectora-c22​

hectora-c26​

idk-c8​

karafsapp-c63​

khodro45-c54​

mci-kp1​

mci-ks3​

melligold-c66​

mixin-c20​

nikandishan-c73​

nobitex-c75 / c76 / c81 / c84 / c86​

novinhub-c72​

raastin-c16​

royanegar-c43​

saraf-c39​

saraf-c40​

sindad-c15​

torob-c25​

torob-c34​

triboonnet-c61​

updamus-probers​

yektanet-c28​

yektanet-c29​

yektanet-c31​

yektanet-c32 (deleted)​

yektanet-c44​

yektanet-c53​

Joining a new node​

zarinpal-c68​

airtour-c3

alphatech-c60

asanbar-c41

bitpin-c70

bitpin-c71

cafebazaar-c55

chapterone-c2

daal-c51

hamravesh-c11

hamravesh-c12

hamravesh-c13

hamravesh-c17

hamravesh-c18

hamravesh-c23

Steps to add new ip

hamravesh-c27

hamravesh-c38

hamravesh-c4

hamravesh-c42

hamravesh-c5

hamravesh-c7

hamravesh-c78

hamravesh-c85

hectora-c22

hectora-c26

idk-c8

karafsapp-c63

khodro45-c54

mci-kp1

mci-ks3

melligold-c66

mixin-c20

nikandishan-c73

nobitex-c75 / c76 / c81 / c84 / c86

novinhub-c72

raastin-c16

royanegar-c43

saraf-c39

saraf-c40

sindad-c15

torob-c25

torob-c34

triboonnet-c61

updamus-probers

yektanet-c28

yektanet-c29

yektanet-c31

yektanet-c32 (deleted)

yektanet-c44

yektanet-c53

Joining a new node

zarinpal-c68