All Products
Search
Document Center

Container Service for Kubernetes:Optimize the scalability of NetworkPolicy in large-scale Terway clusters

Last Updated:Mar 26, 2026

In a large ACK cluster running Terway, each Felix agent on every node connects directly to the API server to retrieve NetworkPolicy rules—causing API server overload at scale. This topic explains how to eliminate that bottleneck by deploying Typha as a caching layer between the API server and Felix, or by disabling the NetworkPolicy feature when network policies are no longer needed.

Background

Terway implements NetworkPolicy using the Felix agent from Calico. In clusters with more than 100 nodes, every Felix instance independently watches the Kubernetes API server for policy updates. Because the number of watch connections scales with the number of nodes, the API server load grows linearly with cluster size.

Typha sits between the API server and all Felix instances, acting as a repeater that reduces the number of direct watch connections to the API server.

Choose your approach:

Approach When to use
Deploy Typha Network policies are still needed; cluster has more than 100 nodes
Disable NetworkPolicy Network policies are no longer needed and you want to eliminate all related overhead
Warning

After disabling the NetworkPolicy feature, you cannot use network policies to control communication among pods.

Prerequisites

Before you begin, ensure that you have:

Deploy Typha as a repeater

Typha acts as a repeater between the Kubernetes API server and Felix agents. Deploy at least 3 Typha replicas, adding 1 replica for every 200 additional nodes.

  1. Log on to the ACK console.

  2. Update Terway to the latest version. For details, see Manage components.

    Components differ by Terway mode. For a comparison, see Compare Terway modes.
  3. Create a file named calico-typha.yaml and add the following content. Replace {REGION-ID} with your cluster's region ID. Set replicas to 1 per 200 nodes with a minimum of 3. If your cluster runs Kubernetes earlier than 1.21, change policy/v1 to policy/v1beta1 in the PodDisruptionBudget section.

    apiVersion: v1
    kind: Service
    metadata:
      name: calico-typha
      namespace: kube-system
      labels:
        k8s-app: calico-typha
    spec:
      ports:
        - port: 5473
          protocol: TCP
          targetPort: calico-typha
          name: calico-typha
      selector:
        k8s-app: calico-typha
    
    ---
    
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: calico-typha
      namespace: kube-system
      labels:
        k8s-app: calico-typha
    spec:
      replicas: 3  # 1 replica per 200 nodes; minimum 3
      revisionHistoryLimit: 2
      selector:
        matchLabels:
          k8s-app: calico-typha
      template:
        metadata:
          labels:
            k8s-app: calico-typha
          annotations:
            cluster-autoscaler.kubernetes.io/safe-to-evict: 'true'
        spec:
          nodeSelector:
            kubernetes.io/os: linux
          hostNetwork: true
          tolerations:
            - operator: Exists
          serviceAccountName: terway
          priorityClassName: system-cluster-critical
          containers:
          - image: registry-vpc.{REGION-ID}.aliyuncs.com/acs/typha:v3.20.2
            name: calico-typha
            ports:
            - containerPort: 5473
              name: calico-typha
              protocol: TCP
            env:
              - name: TYPHA_LOGSEVERITYSCREEN
                value: "info"
              - name: TYPHA_LOGFILEPATH
                value: "none"      # Disable file logging (not needed in Kubernetes)
              - name: TYPHA_LOGSEVERITYSYS
                value: "none"      # Disable syslog (not needed in Kubernetes)
              - name: TYPHA_CONNECTIONREBALANCINGMODE
                value: "kubernetes"  # Monitor Kubernetes API to rebalance Felix connections
              - name: TYPHA_DATASTORETYPE
                value: "kubernetes"
              - name: TYPHA_HEALTHENABLED
                value: "true"
            livenessProbe:
              httpGet:
                path: /liveness
                port: 9098
                host: localhost
              periodSeconds: 30
              initialDelaySeconds: 30
            readinessProbe:
              httpGet:
                path: /readiness
                port: 9098
                host: localhost
              periodSeconds: 10
    
    ---
    
    apiVersion: policy/v1  # Use policy/v1beta1 for Kubernetes < 1.21
    kind: PodDisruptionBudget
    metadata:
      name: calico-typha
      namespace: kube-system
      labels:
        k8s-app: calico-typha
    spec:
      maxUnavailable: 1
      selector:
        matchLabels:
          k8s-app: calico-typha
    
    ---
    
    apiVersion: apiextensions.k8s.io/v1
    kind: CustomResourceDefinition
    metadata:
      name: bgppeers.crd.projectcalico.org
    spec:
      scope: Cluster
      group: crd.projectcalico.org
      versions:
      - name: v1
        served: true
        storage: true
        schema:
          openAPIV3Schema:
            type: object
            properties:
              apiVersion:
                type: string
      names:
        kind: BGPPeer
        plural: bgppeers
        singular: bgppeer
  4. Apply the manifest.

    kubectl apply -f calico-typha.yaml
  5. Verify that all Typha pods are running.

    kubectl get pods -l k8s-app=calico-typha -n kube-system

    All pods should show 1/1 in the READY column and Running in the STATUS column before you continue. The output is similar to:

    NAME                            READY   STATUS    RESTARTS   AGE
    calico-typha-66498ddfbd-2pzsr   1/1     Running   0          69s
    calico-typha-66498ddfbd-lrtzw   1/1     Running   0          50s
    calico-typha-66498ddfbd-scckd   1/1     Running   0          62s
  6. Configure Terway to route Felix connections through Typha.

    kubectl edit cm eni-config -n kube-system

    Inside the eni_conf block, add or update the following fields:

      felix_relay_service: calico-typha
      disable_network_policy: "false"  # Omit this line if the key does not exist
  7. Restart Terway to apply the changes.

    kubectl get pod -n kube-system | grep terway | awk '{print $1}' | xargs kubectl delete -n kube-system pod

    The expected output is similar to:

    pod "terway-eniip-8hmz7" deleted
    pod "terway-eniip-dclfn" deleted
    pod "terway-eniip-rmctm" deleted
    ...

Disable the NetworkPolicy feature

If network policies are no longer needed, disable the NetworkPolicy feature to remove all Felix-related load from the API server.

Warning

After disabling the NetworkPolicy feature, you cannot use network policies to control communication among pods.

  1. Edit the Terway ConfigMap and set disable_network_policy to "true".

    kubectl edit cm -n kube-system eni-config

    Add or update the following field:

    disable_network_policy: "true"
  2. Restart Terway to apply the changes.

    kubectl get pod -n kube-system | grep terway | awk '{print $1}' | xargs kubectl delete -n kube-system pod

    The expected output is similar to:

    pod "terway-eniip-8hmz7" deleted
    pod "terway-eniip-dclfn" deleted
    pod "terway-eniip-rmctm" deleted
    ...

Verify the result

After deploying Typha, the NetworkPolicy proxies start to use the Typha component, which reduces the loads on the API server. You can monitor the traffic distributed to the Server Load Balancer (SLB) instances to check whether the loads on the API server are reduced.