Troubleshoot Horizontal Pod Autoscaler - Container Service for Kubernetes

This topic provides answers to some frequently asked questions about node auto scaling.

Issue 1: What do I do if HPA fails to collect resource metrics?

If a FailedGetResourceMetric warning is returned in the Events section, as shown in the following returned Horizontal Pod Autoscaler (HPA) conditions, Cloud Controller Manager (CCM) cannot collect resource metrics from the monitored resources.

Name:                                                  kubernetes-tutorial-deployment
Namespace:                                             default
Labels:                                                <none>
Annotations:                                           <none>
CreationTimestamp:                                     Mon, 10 Jun 2019 11:46:48 +0530
Reference:                                             Deployment/kubernetes-tutorial-deployment
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  <unknown> / 2%
Min replicas:                                          1
Max replicas:                                          4
Deployment pods:                                       1 current / 0 desired
Conditions:
  Type           Status  Reason                   Message
  ----           ------  ------                   -------
  AbleToScale    True    SucceededGetScale        the HPA controller was able to get the target's current scale
  ScalingActive  False   FailedGetResourceMetric  the HPA was unable to compute the replica count: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
Events:
  Type     Reason                   Age                      From                       Message
  ----     ------                   ----                     ----                       -------
  Warning  FailedGetResourceMetric  3m3s (x1009 over 4h18m)  horizontal-pod-autoscaler  unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)

Possible causes:

Cause 1: The data sources from which resource metrics are collected are unavailable.

Run thekubectl top pod command to check whether the metric data of the monitored pods is returned. If no metric data is returned, run thekubectl get apiservice command to check whether the metrics-server component is available.

The following output shows an example of the returned data:

NAME                                   SERVICE                      AVAILABLE   AGE
v1.                                    Local                        True        29h
v1.admissionregistration.k8s.io        Local                        True        29h
v1.apiextensions.k8s.io                Local                        True        29h
v1.apps                                Local                        True        29h
v1.authentication.k8s.io               Local                        True        29h
v1.authorization.k8s.io                Local                        True        29h
v1.autoscaling                         Local                        True        29h
v1.batch                               Local                        True        29h
v1.coordination.k8s.io                 Local                        True        29h
v1.monitoring.coreos.com               Local                        True        29h
v1.networking.k8s.io                   Local                        True        29h
v1.rbac.authorization.k8s.io           Local                        True        29h
v1.scheduling.k8s.io                   Local                        True        29h
v1.storage.k8s.io                      Local                        True        29h
v1alpha1.argoproj.io                   Local                        True        29h
v1alpha1.fedlearner.k8s.io             Local                        True        5h11m
v1beta1.admissionregistration.k8s.io   Local                        True        29h
v1beta1.alicloud.com                   Local                        True        29h
v1beta1.apiextensions.k8s.io           Local                        True        29h
v1beta1.apps                           Local                        True        29h
v1beta1.authentication.k8s.io          Local                        True        29h
v1beta1.authorization.k8s.io           Local                        True        29h
v1beta1.batch                          Local                        True        29h
v1beta1.certificates.k8s.io            Local                        True        29h
v1beta1.coordination.k8s.io            Local                        True        29h
v1beta1.events.k8s.io                  Local                        True        29h
v1beta1.extensions                     Local                        True        29h
...
[v1beta1.metrics.k8s.io                 kube-system/metrics-server   True        29h]
...
v1beta1.networking.k8s.io              Local                        True        29h
v1beta1.node.k8s.io                    Local                        True        29h
v1beta1.policy                         Local                        True        29h
v1beta1.rbac.authorization.k8s.io      Local                        True        29h
v1beta1.scheduling.k8s.io              Local                        True        29h
v1beta1.storage.k8s.io                 Local                        True        29h
v1beta2.apps                           Local                        True        29h
v2beta1.autoscaling                    Local                        True        29h
v2beta2.autoscaling                    Local                        True        29h

If the SERVICE for v1beta1.metrics.k8s.io is not kube-system/metrics-server, check whether metrics-server is overwritten by Prometheus Operator. If metrics-server is overwritten by Prometheus Operator, use the following YAML template to redeploy metrics-server:

apiVersion: apiregistration.k8s.io/v1beta1
kind: APIService
metadata:
  name: v1beta1.metrics.k8s.io
spec:
  service:
    name: metrics-server
    namespace: kube-system
  group: metrics.k8s.io
  version: v1beta1
  insecureSkipTLSVerify: true
  groupPriorityMinimum: 100
  versionPriority: 100

If no error is found after you perform the preceding checks, see the troubleshooting section in the related topic of metrics-server.

Cause 2: Metrics cannot be collected during a rolling update or scale-out activity.
By default, metrics-server collects metrics at intervals of 1 second. However, metrics-server must wait a few seconds before it can collect metrics after it performs a rolling update or scale-out activity. We recommend that you update metrics 2 seconds after a rolling update or scale-out activity.
Cause 3: The request field is missing.
HPA automatically obtains the CPU or memory usage by calculating the value ofused resource/requested resource of the pod. If the requested resource is not specified in the pod configurations, HPA cannot calculate the resource usage. Therefore, you must make sure that the requests field is specified in the pod configurations.

Issue 2: An excessive number of pods are added by HPA during a rolling update

During a rolling update, kube-controller-manager performs zero filling on pods whose monitoring data cannot be collected. This may cause HPA to add an excessive number of pods. Check whether metrics-server is updated to the latest version and configure the following startup settings for the pod on which metrics-server is deployed:

# Add the following configuration to the startup settings. 
--enable-hpa-rolling-update-skipped=true

Issue 3: What do I do if HPA does not scale pods when the scaling threshold is reached?

The scaling conditions for HPA are not strictly based on exceeding or falling below thresholds. HPA also takes other factors into consideration when it scales pods. For example, it checks whether the current scale-out event triggers a scale-in activity or the scale-in event triggers a scale-out activity. This avoids repetitive scaling and prevents unnecessary resource consumption.

Issue 4: How do I set the data collection interval for HPA?

For metrics-server versions later than 0.2.1-b46d98c-aliyun, specify the--metric-resolution parameter in the startup settings. Example:--metric-resolution=15s.

Issue 5: I configured a CPU HPA and a memory HPA. Can the memory HPA trigger a scale-in activity after the CPU HPA triggers a scale-out activity?

Yes, it can. After CPU HPA triggers a scale-out activity, the memory HPA triggers a scale-in activity. We recommend that you use the same HPA to monitor CPU and memory usage.

Issue 6: Must both the CPU usage and memory usage exceed the thresholds to trigger a scale-out activity if I use the same HPA to monitor CPU and memory usage?

The numbers of pods to be scaled out based on the CPU usage and memory usage may be different. To ensure the stability of the workloads, the system preferably performs the scale-in activity that adds more pods when both the CPU usage and memory usage trigger scale-out activities. The same rule applies to scale-in activities. The system preferably performs the scale-in activity that reserves more pods when both the CPU usage and memory usage trigger scale-in activities.