Skip to main content

Clusters

airtour-c3

  • On airtour’s own server(s)
  • Has it’s own vcenter
    • https://172.30.64.100 (accessible by openvpn tunnel)
  • Airtour’s vpn might randomly disconnect and needs to get connected again
  • Uses private ips as ingress controller ips
  • We don’t handle airtour’s public ips
  • Has an haproxy that handles public ips and routes them to our haproxy or frigate
  • Wildcard certificate might be handled using their haproxy and not using cert-manager (tls is also terminated there)
  • Must use their dns resolvers in frigate
  • Cluster’s apiserver is whitelisted
  • Any ip exposing needs to be coordinated with them
  • Has custom ip route rules to access other services in other vlans
  • Has a custom deployed haproxy to handle traefik’s ips

Known issues

  • Random node:NotReady alerts on c3-p8

alphatech-c60

  • No customizations yet
  • On our servers in Asiatech
  • Uses gw-controller instead of hlb

asanbar-c41

  • No customizations yet
  • On our servers in Parsonline
  • Has at least one vm (IaaS)

bitpin-c70

  • On our servers in Asiatech-Miremad
  • Bitpin’s production cluster
  • Bitpin uses their own sso (managed by Samurai) to connect to the cluster
  • Uses managed (by us) ingress-nginx as it’s ingress controller
  • Has two ingress-nginxs for different use cases
  • Some of their domains are whitelisted using ingress-nginx’s annotations
  • Has dedicated servers though not specified in contract
  • Has custom vlans configured
  • Has a dedicated vcenter for servers
  • Some of the managed postgresql databases are based on vms in bitpin's servers/vcenter
  • Has a few reserved and static nodes (c70-d?)
  • Has custom rules to connect to bitpin-c71 using private ips
  • Uses hlb’s private ip feature (like a lot), managed by us
  • Might request exposing new databases using private ip loadbalancer to bitpin-c71

Known issues

  • ingress-nginx logs don’t get rotated
    • Rotation isn’t enabled in loghandler
    • Manually rotate if needed

bitpin-c71

Same as above.

  • Bitpin’s tools cluster
  • Bitpin’s gitlab has custom iptables rules in frigate to limit access to port 22

Known issues

  • Random job:down alert for c71-gr1’s node-exporter
  • ingress-nginx logs don’t get rotated
    • Rotation isn’t enabled in loghandler
    • Manually rotate if needed

cafebazaar-c55

  • On our servers in Asiatech
  • Cafebazaar’s tools cluster
  • Cafebazaar uses their own sso (managed by Samurai) to connect to the cluster
  • Has re-deployed monitoring stack and other components
  • Has a custom kubelet config to allow network related sysctls

chapterone-c2

  • On their servers in Asiatech-Base
  • Servers are managed (by phoenix)
  • Chapterone uses their own sso to connect to the cluster
  • Has a history of automatically blocking ips in the middle of the night when performing maintenances or updating

daal-c51

  • On our servers in Asiatech
  • Has dedicated servers (G10 servers with enterprise SSD)
  • Servers are billed instead of cluster resources
  • All nodes must be created on their own servers
  • Adding a new server must be communicated with them
  • New server must match the previous spec

hamravesh-c11

  • On our servers in Asiatech
  • Hamravesh’s public cluster
  • Support team may request new nodes to use as a reserve and uncordon if needed
  • Has custom iptables rules in frigate
  • Gitlab Runners are deployed here (gr nodes)
  • c11-gr nodes use a separate egress ip (configured using frigate rules)

hamravesh-c12

  • On our servers in Asiatech
  • Hamravesh’s semi-public cluster
  • Hamravesh’s SaaS apps reside here (yektanet’s artifactory, etc)
  • Might have custom iptables rules in frigate
  • Does not have organized and specific nodepools
  • c12-n5 data stores are complicated :))) there are many

hamravesh-c13

  • On our servers in Parsonline
  • Hamravesh’s public cluster
  • Support team may request new nodes to use as a reserve and uncordon if needed
  • Has custom iptables rules in frigate
  • Has a custom config in both the apiserver and cilium to support an increased nodeport count

hamravesh-c17

  • On our servers in Parsonline
  • Hamravesh’s storage and backup cluster
  • Mostly uses HDD disks
  • Might use servers with custom specs
  • Max pods is raised to 200 on backup nodes

Known issues

  • Outage in c17-b nodes at night or when taking a lot of backups
  • Random pod:control-plane:down or apiserver:response-time:high alerts caused by high disk latency in c17-m1 which effects etcd and causes a higher response time in the control plane
    • Usually resolves on its own in a few minutes
    • Check if it doesn’t get resolved on its own

hamravesh-c18

  • On our servers in Parsonline
  • Hamravesh’s main services cluster
  • Hamravesh’s artifactory resides here so beware of potential cyclic dependencies in case of artifactory outages or planned maintenances
  • Has custom iptables rules in frigate

Known issues

  • After rebooting c18-s8, cri-o doesn’t manually start and might need to be manually restarted:
    • systemctl restart crio

hamravesh-c23

  • On our servers in Hetzner
  • Hamravesh’s public cluster
  • Has custom iptables rules in frigate
  • No dedicated loadbalancer available in this cluster

Known issues

  • Traefik’s ip might get filtered
    • Check: https://c23.uptime.hamdc.ir via https://check-host.net/ on HTTP mode.

Steps to add new ip

  1. Pick ip from ipam.hamravesh.ir (current c23 range)
  2. Add iptables rules to frigate’s post-rules
  3. Check frigate logs
  4. Add new ip to netplan config manually and run netplan apply
  5. Change c23.hamravesh.hamserver.ir and c23.hamravesh.onhamravesh.ir records to new ip

Curl to update hamserver record:

curl --request PATCH --url 'https://api.cloudflare.com/client/v4/zones/282df448e83cb9fb689c98f00efc09b1/dns_records/62a392e94dd606ae371d902cb63980e3?type=A'   --header 'Authorization: TOKEN'   --header 'Content-Type: application/json'   --data '{"content": "x.x.x.x"}'

Curl to update onhamravesh record:

curl --request PATCH --url 'https://api.cloudflare.com/client/v4/zones/bafa98ec1702525b8b92eaf383b1ed65/dns_records/057d1855525837459f694f4102b9a65b?type=A'   --header 'Authorization: TOKEN'   --header 'Content-Type: application/json'   --data '{ "content": "157.90.43.55"}'

Sometimes, we pre-assign an IP address, and you just need to apply it to the DNS records.

If a new ip is already added, just change the dns records to the new ip.


hamravesh-c27

  • On our servers in Parsonline
  • Camelus should determine the server to create the requested node on!
  • Hamravesh’s semi-public and storage cluster
  • Only customer is Torob
  • Mostly has it’s own servers with custom disk
  • 8 * 1 TB SSD disks

hamravesh-c38

  • On our servers in Asiatech
  • Hamravesh’s observability cluster

Known issues


hamravesh-c4

  • On our servers in Fanava
  • Hamravesh’s old cluster
  • Is going to be deprecated and isn’t actively maintained
  • Uses kubernetes version 1.21

hamravesh-c42

  • On our servers in Asiatech
  • Mostly uses HDD disks
  • Might use servers with custom specs
  • Max pods is raised to 200 on backup nodes

Known issues

  • Outage in c42-b nodes at night or when taking a lot of backups
  • Random pod:control-plane:down or apiserver:response-time:high alerts caused by high disk latency in c42-m1 which effects etcd and causes a higher response time in the control plane
    • Usually resolves on its own in a few minutes
    • Check if it doesn’t get resolved on its own

hamravesh-c5

  • On our servers in Beeline/Armenia
  • Hamravesh’s public cluster
  • Is very limited and isn’t used much
  • Servers are located in hectora’s vcenter

hamravesh-c7

  • On our servers in Zirsakht
  • Mainly used as an observability cluster

hamravesh-c78

  • On our servers in Parsonline
  • Hamravesh’s security cluster

hamravesh-c85

  • On our servers in Parsonline
  • Hamravesh’s observability cluster

hectora-c22

  • The LAN interface of each node on servers other than hectora needs to be changed in the middle of node creation to VLAN-4002
    • (check c22-d8 interfaces)

hectora-c26

(Not filled in yet.)


idk-c8

(Not filled in yet.)


karafsapp-c63

(Not filled in yet.)


khodro45-c54

(Not filled in yet.)


mci-kp1

For changing the ntp server of mci nodes:

for node in $(kubectl get nodes -o wide | awk 'NR>1 {print $1}'); do
echo "===== $node ====="
ssh root@"$node".mci.local "sed -i 's/^NTP=.*/ips you want/' /etc/systemd/timesyncd.conf && systemctl daemon-reload && systemctl restart systemd-timesyncd.service"
done

mci-ks3

MCI Staging

  • ks3-m1 physical location currently has lots of high latency disk io issue and cause control plane outage alot during night and day.

melligold-c66

  • On our servers in Asiatech
  • Uses gw-controller instead of hlb

mixin-c20

(Not filled in yet.)


nikandishan-c73

(Not filled in yet.)


nobitex-c75 / c76 / c81 / c84 / c86

  • Nobitex uses their own sso to connect to the cluster

novinhub-c72

(Not filled in yet.)


raastin-c16

  • Ask Phoenix where SSD Enterprise are.
  • On our servers in Asiatech
  • Preferably must use enterprise ssd disks in c16-d nodes. But currently you can create d nodes on other nodes.

royanegar-c43

  • No customizations yet
  • On our servers in Asiatech

saraf-c39

  • It does not contain dedicated servers, so the directory in the vcenter is useless.
  • Uses gw-controller instead of hlb

saraf-c40

  • Uses gw-controller instead of hlb

sindad-c15

  • On our servers in Fanava
  • Hamravesh’s public cluster
  • Sindad is the hardware provider

torob-c25

  • On torob’s servers in Parsonline
  • Torob’s production cluster
  • Torob uses their own sso to connect to the cluster
  • Has it’s own vcenter
  • Torob uses their own ingress-controller
  • Cluster’s apiserver is whitelisted
  • Hardcore service-mesh users (istio is deployed and managed by us)
  • Node c25-requests-clickhouse0 has pid limits 16384 in both kubelet and CRI-O configs.
  • On the same network with mixin-c20 (public)

torob-c34

  • On torob’s servers in Hetzner
  • Torob’s international/foreign brand
  • Torob uses their own sso to connect to the cluster
  • Has it’s own vcenter (separate from torob-c25’s one)
  • Torob uses their own ingress-controller
  • Hardcore service-mesh users (istio is deployed and managed by us)

triboonnet-c61

  • No customizations yet
  • On our servers in Asiatech

updamus-probers

  • Uses k3s instead of k8s
  • Not managed by us
  • Managed by maratus
  • Accessed only with admin token (not connected to oidc)

yektanet-c28

  • On our servers in Asiatech
  • Yektanet’s production cluster
  • Yektanet uses their own sso to connect to the cluster
  • Uses ingress-nginx (managed by them) as it’s ingress-controller
  • Has custom haproxy gateways for inbound traffic
  • Yektanet uses their own dedicated public ip range (87.107.167.0/24) in their asiatech clusters
  • All nodes must use pcores
  • Some nodepools have reserved and static cpus:
    • c28-n, c28-w, c28-db
  • Custom metrics and ingress-nginx metrics are in yektanet-c38-prod datasource in grafana
  • Uses host-id and rack-id labels
  • Can tolerate downtime of one c28-n or c28-s or a few c28-w nodes
  • All c28-db nodes must be created on enterprise ssd disks and on separate servers
  • Has two ip ranges:
    • 172.30.28.0/24
    • 172.30.128.0/24
  • Has higher system reserved and kubelet reserved in it’s kubelet config
  • Might have custom iptables rules in frigate
  • Might have custom hardcoded dns records in frigate
  • Uses hlb’s private loadbalancer ip
  • Uptime is very important
  • c28-helper has a separate public ip
  • All four of yektanet’s asiatech clusters can be accessed from each other using private ips

Known issues

  • Pods not getting created at all during or after a maintenance might be caused by Yektanet’s own mutating or validating webhooks
  • Performance gets visibly effected by ilo’s power settings in power saving mode (must be high performance)
  • Connectivity issues in connecting to private loadbalancer ips:
    • In c28-helper: nc -v 172.30.128.65 32489 -> connection failed
    • Escalate to h.marvi
  • Pods getting stuck in ContainerCreating with cilium logging identity map is full
    • caused by an increase in ciliumidentities caused by a large number of pods getting created and deleted
    • Find the offending jobs/pods and communicate with yektanet
    • Escalate to h.marvi or semekh if needed
  • coredns pods getting recreated might cause outage in haproxy gateways
    • haproxy pods need to be recreated to fix it
  • Pods using hsds-btrfs or hsds-rawfile storage class getting stuck in ContainerCreating
    • Usually in dev namespace
    • might see errors about mutli-attach is not possible in describe pod
    • Escalate to h.marvi

yektanet-c29

  • On our servers in Asiatech
  • Yektanet’s staging cluster
  • Yektanet uses their own sso to connect to the cluster
  • Yektanet uses their own dedicated public ip range (87.107.167.0/24) in their asiatech clusters
  • Some nodepools have reserved and static cpus:
    • c29-db
  • Max pods is raised to 200
  • Pods not getting created at all during or after a maintenance might be caused by Yektanet’s own mutating or validating webhooks
  • All four of yektanet’s asiatech clusters can be accessed from each other using private ips
  • Has custom haproxy gateways for inbound traffic

yektanet-c31

  • On our servers in Asiatech
  • Yektanet’s tools cluster
  • Yektanet uses their own sso to connect to the cluster
  • Yektanet uses their own dedicated public ip range (87.107.167.0/24) in their asiatech clusters
  • Some nodepools have reserved and static cpus:
    • c31-db
  • Might have custom iptables rules in frigate
  • Might have custom hardcoded dns records in frigate
  • Uses hlb’s private loadbalancer ip
  • All four of yektanet’s asiatech clusters can be accessed from each other using private ips
  • Has custom haproxy gateways for inbound traffic

yektanet-c32 (deleted)

  • Yektanet uses their own dedicated public ip range (87.107.167.0/24) in their asiatech clusters
  • Yektanet uses their own sso to connect to the cluster
  • All four of yektanet’s asiatech clusters can be accessed from each other using private ips
  • Has custom haproxy gateways for inbound traffic

yektanet-c44

  • Yektanet uses their own sso to connect to the cluster

yektanet-c53

  • Yektanet uses their own sso to connect to the cluster

Joining a new node

  • Login through console.hetzner.com using hectora credential into yektanet project (do not use mehran credentials)
  • Don’t use external volume for vm creation (its over-network and slow somehow)
  • Don’t allocate any public ip for the vm
  • Select db placement group if it was a db node
  • Provide a cloud-init. you can find a one on /var/lib/cloud/instance/user-data.txt in another node.

Example cloud-init:

#cloud-config

runcmd:
- |
cat <<EOF > /etc/systemd/network/10-enp7s0.network
[Match]
Name=enp7s0

[Network]
DHCP=yes
Gateway=172.30.53.1
EOF
- systemctl restart systemd-networkd
  • Add Semekh and Setup keys
  • Kubernetes sysctls (podolica does it automatically for c53 now!)
#!/usr/bin/env bash
set -e

modprobe br_netfilter
echo "br_netfilter" | tee /etc/modules-load.d/k8s.conf > /dev/null

tee /etc/sysctl.d/k8s.conf > /dev/null <<EOF
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
EOF

sysctl --system

zarinpal-c68

  • On their servers in Asiatech
  • Servers are managed (by phoenix)
  • Has a different ip range (172.16.67.0/24)
  • Server might get randomly disconnected in vcenter (contact phoenix if needed)
  • Has a history of hardware problems in the server that cause outage
  • Uses gw-controller instead of hlb

Known issues

  • In case of c68-fw1 getting rebooted, need to manually run this:
iptables-restore < /var/lib/frigate/pre-rules
  • In case of server/nodes getting rebooted, loadbalancer ips might need to be manually checked
  • Escalate to hosseinmohammadi if needed