In a large ACK cluster running Terway, each Felix agent on every node connects directly to the API server to retrieve NetworkPolicy rules—causing API server overload at scale. This topic explains how to eliminate that bottleneck by deploying Typha as a caching layer between the API server and Felix, or by disabling the NetworkPolicy feature when network policies are no longer needed.
Background
Terway implements NetworkPolicy using the Felix agent from Calico. In clusters with more than 100 nodes, every Felix instance independently watches the Kubernetes API server for policy updates. Because the number of watch connections scales with the number of nodes, the API server load grows linearly with cluster size.
Typha sits between the API server and all Felix instances, acting as a repeater that reduces the number of direct watch connections to the API server.
Choose your approach:
| Approach | When to use |
|---|---|
| Deploy Typha | Network policies are still needed; cluster has more than 100 nodes |
| Disable NetworkPolicy | Network policies are no longer needed and you want to eliminate all related overhead |
After disabling the NetworkPolicy feature, you cannot use network policies to control communication among pods.
Prerequisites
Before you begin, ensure that you have:
-
An ACK cluster with the Terway plug-in installed and more than 100 nodes. For details, see Create an ACK managed cluster.
-
A kubeconfig file for the cluster with a kubectl client connected to it. For details, see Get kubeconfig and connect kubectl to the cluster.
Deploy Typha as a repeater
Typha acts as a repeater between the Kubernetes API server and Felix agents. Deploy at least 3 Typha replicas, adding 1 replica for every 200 additional nodes.
-
Log on to the ACK console.
-
Update Terway to the latest version. For details, see Manage components.
Components differ by Terway mode. For a comparison, see Compare Terway modes.
-
Create a file named
calico-typha.yamland add the following content. Replace{REGION-ID}with your cluster's region ID. Setreplicasto 1 per 200 nodes with a minimum of 3. If your cluster runs Kubernetes earlier than 1.21, changepolicy/v1topolicy/v1beta1in the PodDisruptionBudget section.apiVersion: v1 kind: Service metadata: name: calico-typha namespace: kube-system labels: k8s-app: calico-typha spec: ports: - port: 5473 protocol: TCP targetPort: calico-typha name: calico-typha selector: k8s-app: calico-typha --- apiVersion: apps/v1 kind: Deployment metadata: name: calico-typha namespace: kube-system labels: k8s-app: calico-typha spec: replicas: 3 # 1 replica per 200 nodes; minimum 3 revisionHistoryLimit: 2 selector: matchLabels: k8s-app: calico-typha template: metadata: labels: k8s-app: calico-typha annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: 'true' spec: nodeSelector: kubernetes.io/os: linux hostNetwork: true tolerations: - operator: Exists serviceAccountName: terway priorityClassName: system-cluster-critical containers: - image: registry-vpc.{REGION-ID}.aliyuncs.com/acs/typha:v3.20.2 name: calico-typha ports: - containerPort: 5473 name: calico-typha protocol: TCP env: - name: TYPHA_LOGSEVERITYSCREEN value: "info" - name: TYPHA_LOGFILEPATH value: "none" # Disable file logging (not needed in Kubernetes) - name: TYPHA_LOGSEVERITYSYS value: "none" # Disable syslog (not needed in Kubernetes) - name: TYPHA_CONNECTIONREBALANCINGMODE value: "kubernetes" # Monitor Kubernetes API to rebalance Felix connections - name: TYPHA_DATASTORETYPE value: "kubernetes" - name: TYPHA_HEALTHENABLED value: "true" livenessProbe: httpGet: path: /liveness port: 9098 host: localhost periodSeconds: 30 initialDelaySeconds: 30 readinessProbe: httpGet: path: /readiness port: 9098 host: localhost periodSeconds: 10 --- apiVersion: policy/v1 # Use policy/v1beta1 for Kubernetes < 1.21 kind: PodDisruptionBudget metadata: name: calico-typha namespace: kube-system labels: k8s-app: calico-typha spec: maxUnavailable: 1 selector: matchLabels: k8s-app: calico-typha --- apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: name: bgppeers.crd.projectcalico.org spec: scope: Cluster group: crd.projectcalico.org versions: - name: v1 served: true storage: true schema: openAPIV3Schema: type: object properties: apiVersion: type: string names: kind: BGPPeer plural: bgppeers singular: bgppeer -
Apply the manifest.
kubectl apply -f calico-typha.yaml -
Verify that all Typha pods are running.
kubectl get pods -l k8s-app=calico-typha -n kube-systemAll pods should show
1/1in the READY column andRunningin the STATUS column before you continue. The output is similar to:NAME READY STATUS RESTARTS AGE calico-typha-66498ddfbd-2pzsr 1/1 Running 0 69s calico-typha-66498ddfbd-lrtzw 1/1 Running 0 50s calico-typha-66498ddfbd-scckd 1/1 Running 0 62s -
Configure Terway to route Felix connections through Typha.
kubectl edit cm eni-config -n kube-systemInside the
eni_confblock, add or update the following fields:felix_relay_service: calico-typha disable_network_policy: "false" # Omit this line if the key does not exist -
Restart Terway to apply the changes.
kubectl get pod -n kube-system | grep terway | awk '{print $1}' | xargs kubectl delete -n kube-system podThe expected output is similar to:
pod "terway-eniip-8hmz7" deleted pod "terway-eniip-dclfn" deleted pod "terway-eniip-rmctm" deleted ...
Disable the NetworkPolicy feature
If network policies are no longer needed, disable the NetworkPolicy feature to remove all Felix-related load from the API server.
After disabling the NetworkPolicy feature, you cannot use network policies to control communication among pods.
-
Edit the Terway ConfigMap and set
disable_network_policyto"true".kubectl edit cm -n kube-system eni-configAdd or update the following field:
disable_network_policy: "true" -
Restart Terway to apply the changes.
kubectl get pod -n kube-system | grep terway | awk '{print $1}' | xargs kubectl delete -n kube-system podThe expected output is similar to:
pod "terway-eniip-8hmz7" deleted pod "terway-eniip-dclfn" deleted pod "terway-eniip-rmctm" deleted ...
Verify the result
After deploying Typha, the NetworkPolicy proxies start to use the Typha component, which reduces the loads on the API server. You can monitor the traffic distributed to the Server Load Balancer (SLB) instances to check whether the loads on the API server are reduced.