Optimize NetworkPolicy performance for large clusters in Terway mode - Container Service for Kubernetes

In a large ACK cluster running Terway, each Felix agent on every node connects directly to the API server to retrieve NetworkPolicy rules—causing API server overload at scale. This topic explains how to eliminate that bottleneck by deploying Typha as a caching layer between the API server and Felix, or by disabling the NetworkPolicy feature when network policies are no longer needed.

Background

Terway implements NetworkPolicy using the Felix agent from Calico. In clusters with more than 100 nodes, every Felix instance independently watches the Kubernetes API server for policy updates. Because the number of watch connections scales with the number of nodes, the API server load grows linearly with cluster size.

Typha sits between the API server and all Felix instances, acting as a repeater that reduces the number of direct watch connections to the API server.

Choose your approach:

Approach	When to use
Deploy Typha	Network policies are still needed; cluster has more than 100 nodes
Disable NetworkPolicy	Network policies are no longer needed and you want to eliminate all related overhead

Warning

After disabling the NetworkPolicy feature, you cannot use network policies to control communication among pods.

Prerequisites

Before you begin, ensure that you have:

An ACK cluster with the Terway plug-in installed and more than 100 nodes. For details, see Create an ACK managed cluster.
A kubeconfig file for the cluster with a kubectl client connected to it. For details, see Get kubeconfig and connect kubectl to the cluster.

Deploy Typha as a repeater

Typha acts as a repeater between the Kubernetes API server and Felix agents. Deploy at least 3 Typha replicas, adding 1 replica for every 200 additional nodes.

Log on to the ACK console.
Update Terway to the latest version. For details, see Manage components.

Components differ by Terway mode. For a comparison, see Compare Terway modes.

Create a file named calico-typha.yaml and add the following content. Replace {REGION-ID} with your cluster's region ID. Set replicas to 1 per 200 nodes with a minimum of 3. If your cluster runs Kubernetes earlier than 1.21, change policy/v1 to policy/v1beta1 in the PodDisruptionBudget section.

apiVersion: v1
kind: Service
metadata:
  name: calico-typha
  namespace: kube-system
  labels:
    k8s-app: calico-typha
spec:
  ports:
    - port: 5473
      protocol: TCP
      targetPort: calico-typha
      name: calico-typha
  selector:
    k8s-app: calico-typha

---

apiVersion: apps/v1
kind: Deployment
metadata:
  name: calico-typha
  namespace: kube-system
  labels:
    k8s-app: calico-typha
spec:
  replicas: 3  # 1 replica per 200 nodes; minimum 3
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      k8s-app: calico-typha
  template:
    metadata:
      labels:
        k8s-app: calico-typha
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: 'true'
    spec:
      nodeSelector:
        kubernetes.io/os: linux
      hostNetwork: true
      tolerations:
        - operator: Exists
      serviceAccountName: terway
      priorityClassName: system-cluster-critical
      containers:
      - image: registry-vpc.{REGION-ID}.aliyuncs.com/acs/typha:v3.20.2
        name: calico-typha
        ports:
        - containerPort: 5473
          name: calico-typha
          protocol: TCP
        env:
          - name: TYPHA_LOGSEVERITYSCREEN
            value: "info"
          - name: TYPHA_LOGFILEPATH
            value: "none"      # Disable file logging (not needed in Kubernetes)
          - name: TYPHA_LOGSEVERITYSYS
            value: "none"      # Disable syslog (not needed in Kubernetes)
          - name: TYPHA_CONNECTIONREBALANCINGMODE
            value: "kubernetes"  # Monitor Kubernetes API to rebalance Felix connections
          - name: TYPHA_DATASTORETYPE
            value: "kubernetes"
          - name: TYPHA_HEALTHENABLED
            value: "true"
        livenessProbe:
          httpGet:
            path: /liveness
            port: 9098
            host: localhost
          periodSeconds: 30
          initialDelaySeconds: 30
        readinessProbe:
          httpGet:
            path: /readiness
            port: 9098
            host: localhost
          periodSeconds: 10

---

apiVersion: policy/v1  # Use policy/v1beta1 for Kubernetes < 1.21
kind: PodDisruptionBudget
metadata:
  name: calico-typha
  namespace: kube-system
  labels:
    k8s-app: calico-typha
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      k8s-app: calico-typha

---

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: bgppeers.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  versions:
  - name: v1
    served: true
    storage: true
    schema:
      openAPIV3Schema:
        type: object
        properties:
          apiVersion:
            type: string
  names:
    kind: BGPPeer
    plural: bgppeers
    singular: bgppeer

Apply the manifest.
```
kubectl apply -f calico-typha.yaml
```

Verify that all Typha pods are running.

kubectl get pods -l k8s-app=calico-typha -n kube-system

All pods should show 1/1 in the READY column and Running in the STATUS column before you continue. The output is similar to:

NAME                            READY   STATUS    RESTARTS   AGE
calico-typha-66498ddfbd-2pzsr   1/1     Running   0          69s
calico-typha-66498ddfbd-lrtzw   1/1     Running   0          50s
calico-typha-66498ddfbd-scckd   1/1     Running   0          62s

Configure Terway to route Felix connections through Typha.

kubectl edit cm eni-config -n kube-system

Inside the eni_conf block, add or update the following fields:

  felix_relay_service: calico-typha
  disable_network_policy: "false"  # Omit this line if the key does not exist

Restart Terway to apply the changes.

kubectl get pod -n kube-system | grep terway | awk '{print $1}' | xargs kubectl delete -n kube-system pod

The expected output is similar to:

pod "terway-eniip-8hmz7" deleted
pod "terway-eniip-dclfn" deleted
pod "terway-eniip-rmctm" deleted
...

Disable the NetworkPolicy feature

If network policies are no longer needed, disable the NetworkPolicy feature to remove all Felix-related load from the API server.

Warning

After disabling the NetworkPolicy feature, you cannot use network policies to control communication among pods.

Edit the Terway ConfigMap and set disable_network_policy to "true".
```
kubectl edit cm -n kube-system eni-config
```
Add or update the following field:
```
disable_network_policy: "true"
```

Restart Terway to apply the changes.

kubectl get pod -n kube-system | grep terway | awk '{print $1}' | xargs kubectl delete -n kube-system pod

The expected output is similar to:

pod "terway-eniip-8hmz7" deleted
pod "terway-eniip-dclfn" deleted
pod "terway-eniip-rmctm" deleted
...

Verify the result

After deploying Typha, the NetworkPolicy proxies start to use the Typha component, which reduces the loads on the API server. You can monitor the traffic distributed to the Server Load Balancer (SLB) instances to check whether the loads on the API server are reduced.