Observability | Best Practices for Monitoring NGINX Ingress Gateways with Prometheus

By Lingzhu

An Introduction to NGINX Ingress Gateway

In a Kubernetes cluster, NGINX Ingress is used to realize the proxy forward of north-south traffic. NGINX Ingress generates specific routing rules based on Ingress resource configuration in the cluster. The Ingress resources manage public services. Generally, these services are accessed over HTTP. You can use NGINX Ingress and Ingress resources to implement the following scenarios:

1. Use the NGINX Ingress to forward all traffic from the client to a single Service:

Figure: An Introduction to the Nginx Ingress Working Mode

2. Use Nginx Ingress to generate more complex routing and forwarding rules to forward all traffic from a single bound IP address to different Services based on the URL request path prefix.

Figure: Forward Based on the URL Request Path

3. According to the Host field in the HTTP request header usually determined by the accessed domain name, traffic from a single bound IP address is distributed to different backend services to realize the Name-based Virtual Host capability.

Figure: Forward Requests Based on the Host Header

We usually focus on two types of core metric data in the monitoring scenarios of NGINX Ingress gateway:

Workload Resources

It is the load of the Nginx Ingress Controller Pod. When resource usage (such as CPU and memory) is saturated or overloaded, the external services of clusters will be unstable. For workload monitoring, we recommend paying attention to the USE metrics. They are Utilization, Saturation, and Errors. Alibaba Cloud Prometheus Service provides a preset performance monitoring dashboard. Please see Access to Workload Performance Monitoring Components [1] for more information.

Ingress Request Traffic

It includes the analysis and statistics of the global traffic in a cluster, the traffic forwarded by an Ingress rule, the traffic of a Service, the success rate, error rate, latency, and the IP address and device of the request source. For ingress request traffic monitoring, we recommend paying attention to the RED metrics. They are Rate, Errors, and Duration. You can use the best practices in this article to implement access.

Implementation of NGINX Ingress Gateway Monitor

Based on Exporter Metrics

A major feature of the NGINX Ingress release of Kubernetes based on the open-source NGINX is that each process plays the role of Exporter and implements self-monitoring metrics that follow the Prometheus protocol format, such as:

nginx_ingress_controller_requests{canary="",controller_class="k8s.io/ingress-nginx",controller_namespace="kube-system",controller_pod="nginx-ingress-controller-6fdbbc5856-pcxkz",host="my.otel-demo.com",ingress="my-otel-demo",method="GET",namespace="default",path="/",service="my-otel-demo-frontend",status="200"} 2.401964e+06
nginx_ingress_controller_requests{canary="",controller_class="k8s.io/ingress-nginx",controller_namespace="kube-system",controller_pod="nginx-ingress-controller-6fdbbc5856-pcxkz",host="my.otel-demo.com",ingress="my-otel-demo",method="GET",namespace="default",path="/",service="my-otel-demo-frontend",status="304"} 111
nginx_ingress_controller_requests{canary="",controller_class="k8s.io/ingress-nginx",controller_namespace="kube-system",controller_pod="nginx-ingress-controller-6fdbbc5856-pcxkz",host="my.otel-demo.com",ingress="my-otel-demo",method="GET",namespace="default",path="/",service="my-otel-demo-frontend",status="308"} 553545
nginx_ingress_controller_requests{canary="",controller_class="k8s.io/ingress-nginx",controller_namespace="kube-system",controller_pod="nginx-ingress-controller-6fdbbc5856-pcxkz",host="my.otel-demo.com",ingress="my-otel-demo",method="GET",namespace="default",path="/",service="my-otel-demo-frontend",status="404"} 55
nginx_ingress_controller_requests{canary="",controller_class="k8s.io/ingress-nginx",controller_namespace="kube-system",controller_pod="nginx-ingress-controller-6fdbbc5856-pcxkz",host="my.otel-demo.com",ingress="my-otel-demo",method="GET",namespace="default",path="/",service="my-otel-demo-frontend",status="499"} 2
nginx_ingress_controller_requests{canary="",controller_class="k8s.io/ingress-nginx",controller_namespace="kube-system",controller_pod="nginx-ingress-controller-6fdbbc5856-pcxkz",host="my.otel-demo.com",ingress="my-otel-demo",method="GET",namespace="default",path="/",service="my-otel-demo-frontend",status="500"} 64
nginx_ingress_controller_requests{canary="",controller_class="k8s.io/ingress-nginx",controller_namespace="kube-system",controller_pod="nginx-ingress-controller-6fdbbc5856-pcxkz",host="my.otel-demo.com",ingress="my-otel-demo",method="GET",namespace="default",path="/",service="my-otel-demo-frontendproxy",status="200"} 59599
nginx_ingress_controller_requests{canary="",controller_class="k8s.io/ingress-nginx",controller_namespace="kube-system",controller_pod="nginx-ingress-controller-6fdbbc5856-pcxkz",host="my.otel-demo.com",ingress="my-otel-demo",method="GET",namespace="default",path="/",service="my-otel-demo-frontendproxy",status="304"} 15
nginx_ingress_controller_requests{canary="",controller_class="k8s.io/ingress-nginx",controller_namespace="kube-system",controller_pod="nginx-ingress-controller-6fdbbc5856-pcxkz",host="my.otel-demo.com",ingress="my-otel-demo",method="GET",namespace="default",path="/",service="my-otel-demo-frontendproxy",status="308"} 15709
nginx_ingress_controller_requests{canary="",controller_class="k8s.io/ingress-nginx",controller_namespace="kube-system",controller_pod="nginx-ingress-controller-6fdbbc5856-pcxkz",host="my.otel-demo.com",ingress="my-otel-demo",method="GET",namespace="default",path="/",service="my-otel-demo-frontendproxy",status="403"} 235
nginx_ingress_controller_requests{canary="",controller_class="k8s.io/ingress-nginx",controller_namespace="kube-system",controller_pod="nginx-ingress-controller-6fdbbc5856-pcxkz",host="e-commerce.

You can use the open-source or Alibaba Cloud Prometheus Agent and service discovery policies to capture and report metrics. You can use PromQL to analyze and configure alerts or Grafana to visualize metric data. However, there are many problems in production practice when using this type of monitoring implementation.

Problem 1: Exposing Too Many Unavailable Histogram Metrics

After you capture NGINX Ingress in a production or test cluster, you can find a large number of histogram metrics in the metric list it displays. In most cases, a histogram metric is named _bucket and used together with _count and _count. Also, it contains metrics not used for common analytics, such as:

nginx_ingress_controller_request_size_bucket: bucket sampling for each request body size
nginx_ingress_controller_bytes_sent_bucket: bucket sampling for each response body size

By default, if you do not perform the drop operation in the metric_relabel_configs collection configuration of Prometheus, these metrics will be captured and reported, consuming a large amount of bandwidth and storage resources.

Problem 2: Pull Mode Pulls Too Many Inactive Timelines

When the first problem encounters the Pull mode of Prometheus Agent, the situation becomes even worse. Even if a microservice with a low access frequency only had one request, all timelines related to it would always appear in the metric list exposed by Nginx Ingress. In each capture cycle, it is continuously collected and reported, wasting more resources.

The essential problem of this phenomenon is how to avoid reporting metrics when a counter metric stays the same during the observation period. We find it difficult to come up with a good solution through the Pull mode, and we will introduce new ideas later.

Problem 3: Ingress Path Is Not Scalable and Drill-Down

URL Path is difficult to process for monitoring metrics of HTTP traffic. If the URL Path of each request is directly added to the metric label for analysis purposes, a terrible dimension explosion will occur. However, if this information is not added, fine-grained drill-down analysis of metrics cannot be realized.

In the metrics exposed by NGINX Ingresses, the path label is used to record the corresponding request path fields in the Ingress rules, such as /(.+), /login, and /orders/(.+). This avoids the problem that the URL path details cannot be enumerated. However, if users want to implement more fine-grained drill-down analysis, for example, they want to see the statistics of the two different URL patterns, /users/(.+)/follower and /users/(.+)/follower, it is not scalable, and the computational logics of metrics preset in the NGINX Ingress implementation are not programmable.

Problem 4: Lack of Analysis of Geographical and Equipment Information

Generally, the O&M personnel of a website system pay more attention to the information on the request source side. For example:

Which provinces and cities are the website users located in? What are the top ten cities with the most users?
Do users visit the website through mobile terminal or PC terminal? How many mobile terminals are iOS models? How many are Android models?

These data are not reflected in the metrics exposed by the NGINX Ingress.

Problem 5: The Official Grafana Dashboard Layout of Kubernetes Is Less Focused

Although it is not related to the metrics exposed by the NGINX Ingress, users generally use the Grafana dashboard provided by Kubernetes to visualize data, so it is a problem.

Figure: Kubernetes Grafana Dashboard Based on Self-Monitoring Metrics Produced by NGINX Ingress

As mentioned earlier, in monitoring scenarios for ingress traffic, we generally focus on the RED metrics: Rate, Errors, and Duration. However, in the face of the first screen of this dashboard, if you stand in the user's perspective to analyze the request traffic, you can find that its layout or information structure is unreasonable:

I don't care about the number of connections to the Ingress Controller. This is the concept of the lower layer of HTTP requests.
I don't care about the success rate at the controller level. I pay more attention to the success rate across the Ingress/Host, Service, and URI paths.
I don't care how many times the configuration of the Ingress Controller is reloaded.
I don't care about the last configuration pull failure of Ingress Controller.
I don't care about when the Ingress certificate expire.

Therefore, providing a focused and easy-to-use dashboard is essential to implement NGINX Ingress Gateway Monitor.

Statistics Based on Access Logs

To sum up, self-monitoring metrics based on the native NGINX Ingress have many problems in production practice. Therefore, *NGINX Ingress Gateway Monitor provided by Alibaba Cloud Prometheus Monitoring uses another method—Statistics based on access logs.

Similar to the open-source version of Nginx, Nginx Ingress prints the log of each request to its Ingress Controller Pod standard output, which is called access log:

172.16.0.20 - [172.16.0.20] - - [24/Mar/2023:17:58:26 +0800] "POST /api/cart HTTP/1.1" 500 32 "-" "Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/3.0)" 475 0.003 [default-my-otel-demo-frontend-8080] 172.16.0.17:8080 32 0.003 500 8f4dafe7280e421e9f6ca01efeacaf2d my.otel-demo.com []
172.16.0.20 - [172.16.0.20] - - [24/Mar/2023:17:58:26 +0800] "GET /api/products/HQTGWGPNH4 HTTP/1.1" 200 758 "-" "Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/3.0)" 334 0.001 [default-my-otel-demo-frontend-8080] 172.16.0.17:8080 758 0.002 200 e90aa6e5ffb7dfc03c0d576eb145fa29 my.otel-demo.com []
172.16.0.20 - [172.16.0.20] - - [24/Mar/2023:17:58:26 +0800] "POST /api/cart HTTP/1.1" 500 32 "-" "Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/3.0)" 475 0.003 [default-my-otel-demo-frontend-8080] 172.16.0.17:8080 32 0.002 500 dd7b9f42dbe53e72efe8768b1811525a my.otel-demo.com []
172.16.0.20 - [172.16.0.20] - - [24/Mar/2023:17:58:26 +0800] "GET /api/products/L9ECAV7KIM HTTP/1.1" 200 752 "-" "Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/3.0)" 334 0.002 [default-my-otel-demo-frontend-8080] 172.16.0.17:8080 752 0.001 200 883fec15467ed2e243a22345a0df9ed9 my.otel-demo.com []
172.16.0.20 - [172.16.0.20] - - [24/Mar/2023:17:58:26 +0800] "POST /api/cart HTTP/1.1" 500 32 "-" "Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/3.0)" 475 0.007 [default-my-otel-demo-frontend-8080] 172.16.0.17:8080 32 0.008 500 08ae27b3de3e112c47572255f3702af0 my.otel-demo.com []
172.16.0.20 - [172.16.0.20] - - [24/Mar/2023:17:58:26 +0800] "POST /api/checkout HTTP/1.1" 200 315 "-" "Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/3.0)" 765 0.194 [default-my-otel-demo-frontend-8080] 172.16.0.17:8080 315 0.194 200 4ed16b7f57394004d1d90383ce43a137 my.otel-demo.com []
172.16.0.20 - [172.16.0.20] - - [24/Mar/2023:17:58:26 +0800] "GET /api/products/6E92ZMYYFZ HTTP/1.1" 200 493 "-" "Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/3.0)" 334 0.002 [default-my-otel-demo-frontend-8080] 172.16.0.17:8080 493 0.002 200 674e2ae6c941f48a0bcaf0a7c57821c1 my.otel-demo.com []
172.16.0.20 - [172.16.0.20] - - [24/Mar/2023:17:58:26 +0800] "GET /api/products/66VCHSJNUP HTTP/1.1" 200 515 "-" "Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/3.0)" 334 0.001 [default-my-otel-demo-frontend-8080] 172.16.0.17:8080 515 0.002 200 245e689b406613eed45937d56c11339e my.otel-demo.com []
172.16.0.20 - [172.16.0.20] - - [24/Mar/2023:17:58:26 +0800] "GET /api/products/0PUK6V6EV0 HTTP/1.1" 200 438 "-" "Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/3.0)" 334 0.001 [default-my-otel-demo-frontend-8080] 172.16.0.17:8080 438 0.002 200 b6d2416865d34f601c460a2b382806b7 my.otel-demo.com []
172.16.0.20 - [172.16.0.20] - - [24/Mar/2023:17:58:26 +0800] "POST /api/checkout HTTP/1.1" 200 321 "-" "Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/3.0)" 772 0.214 [default-my-otel-demo-frontend-8080] 172.16.0.17:8080 321 0.214 200 63d8d6405b0d9a0ee65d6c1a13342f10 my.otel-demo.com []

By default, the access log printed by ACK Nginx Ingress contains the following information:

Request time
IP address of request source
Request methods (such as GET)
Request paths (such as /api/cart)
The status code returned
Request body length
Response body length
Request Duration
Upstream service name of request (such as default-my-otel-demo-frontend-8080)
Request User-Agent (such as Mozilla/5.0 – compatible; MSIE 7.0; Windows NT 5.1; Trident/3.0)
The Host carried in the request header (such as my.otel-demo.com). It helps determine the Ingress routing rule from which traffic comes in.

Based on this information, you only need to deploy a collector in the Kubernetes environment to achieve RED metrics statistics of ingress traffic through pre-aggregation calculation. At the same time, with the help of controllable technical means, major problems in monitoring based on Exporter metrics can be avoided:

For problem 1, it is necessary to create a set of lean metrics and cut out unnecessary ones to meet the needs of most statistical analysis scenarios.
For problem 2, it is necessary to abandon the counter model, use rolling windows to calculate Gauge metrics, separate data between windows, and use RemoteWrite to push to avoid repeatedly reporting historical stacked timelines.
For problem 3, the pre-aggregation logic can be extended through CR configuration, and drill-down can be realized by creating new matching rules.
For problem 4, the pre-aggregation process can achieve data enrichment through GeoIP, UserAgent analysis, and other means.
For problem 5, it is necessary to create a new ingress observation dashboard, optimize the layout and information structure, and improve the value and experience of the dashboard.

Metric Model of NGINX Ingress Gateway Monitor

General Requests Metrics (ingress_requests)

Metric name: ingress_requests

Metric type: Gauge

Aggregation period: 30s

Metric description: It is the number of requests counted in the dimension corresponding to the label within an aggregation cycle.

Metric label:

Label	Description	Value Example
ingress_cluster	The deployment name of NGINX Ingress controller	nginx-ingress-controller
ingress_cluster_instance	The pod name of NGINX Ingress controller	nginx-ingress-controller-6fdbbc5856-pcxkz
ingress_cluster_namespace	The namespace where the NGINX Ingress controller is located	kube-system
host	The Host name carried in the request header. It can identify which Ingress routing rule the traffic came from. If it is a non-compliant request, the value is "_".	my.otel-demo.com
service	The name of the backend service to which the request is forwarded. If it is a non-compliant request, the value is null.	default-my-otel-demo-frontend-8080
uri	URL path after convergence	/(.+)
method	Request Method	GET
status_code	The status code returned	200

Geography-Based Request Volume Metrics (ingress_geoip_requests)

Metric name: ingress_geoip_requests

Metric type: Gauge

Aggregation period: 30s

Metric description: It is the number of requests counted in the dimension corresponding to the label within an aggregation period. The label is enriched with geographic information.

Metric label:

Label	Description	Value Example
ingress_cluster	The deployment name of NGINX Ingress controller	nginx-ingress-controller
ingress_cluster_instance	The pod name of NGINX Ingress controller	nginx-ingress-controller-6fdbbc5856-pcxkz
ingress_cluster_namespace	The namespace where the NGINX Ingress controller is located	kube-system
host	The Host name carried in the request header. It can identify which Ingress routing rule the traffic came from. If it is a non-compliant request, the value is "_".	my.otel-demo.com
service	The name of the backend service to which the request is forwarded. If it is a non-compliant request, the value is null.	default-my-otel-demo-frontend-8080
country_codeC	Country code of the request source IP address	CN
country_name	Country name of the request source IP address	China
region_name	Region name of the request source IP address	Zhejiang
city_name	City name of the request source IP address	Hangzhou
timezone	Time zone of the request source IP address	Asia/Shanghai

Note: We cut the information in several dimensions (such as URI, Method, and Status Code in the label). In common scenarios of this metric, the granularity of the request path is of the service level. Finer granularity requires more expensive storage and has a lower use value.

Device-Based Request Volume Metrics (ingress_user_agent_requests)

Metric name: ingress_user_agent_requests

Metric type: Gauge

Aggregation period: 30s

Metric description: It is the number of requests counted in the dimension corresponding to the label in an aggregation period. The label is enriched with device information.

Metric label:

Label	Description	Value Example
ingress_cluster	The deployment name of NGINX Ingress controller	nginx-ingress-controller
ingress_cluster_instance	The pod name of NGINX Ingress controller	nginx-ingress-controller-6fdbbc5856-pcxkz
ingress_cluster_namespace	The namespace where the NGINX Ingress controller is located	kube-system
host	The Host name carried in the request header. It can identify which Ingress routing rule the traffic came from. If it is a non-compliant request, the value is "_".	my.otel-demo.com
service	The name of the backend service to which the request is forwarded. If it is a non-compliant request, the value is null.	default-my-otel-demo-frontend-8080
browser_family	The browser type of the request source. If the browser type cannot be correctly identified, the value is "".	Chrome
device_category	The device type of the request source. If the device type cannot be correctly identified, the value is "".	mobile
os_family	The operating system type of the request source. If the operating system type cannot be correctly identified, the value is "".	iPhone

Request Latency Bucket Metric (ingress_request_time)

Metric name: ingress_request_time

Metric type: GaugeHistogram

Aggregation period: 30s

Metric description: The bucket value of the request latency that is counted in the dimension corresponding to the label within an aggregation period

Metric label:

Label	Description	Value Example
ingress_cluster	The deployment name of NGINX Ingress controller	nginx-ingress-controller
ingress_cluster_instance	The pod name of NGINX Ingress controller	nginx-ingress-controller-6fdbbc5856-pcxkz
ingress_cluster_namespace	The namespace where the NGINX Ingress controller is located	kube-system
host	The Host name carried in the request header. It can identify which Ingress routing rule the traffic came from. If it is a non-compliant request, the value is "_".	my.otel-demo.com
service	The name of the backend service to which the request is forwarded. If it is a non-compliant request, the value is null.	default-my-otel-demo-frontend-8080
uri	URL path after convergence	/(.+)
method	Request Method	GET
status_code	The status code returned	200

Note: The current metric type is not the common Histogram type. The value of each bucket is a counter model, but the GaugeHistogram type—the value of each bucket is an instantaneous value observed in the current aggregation period. Therefore, if you want to perform quantile calculation on this metric, refer to the expression:

histogram_quantile(0.95,sum(sum_over_time(ingress_request_time_bucket{...}[1m])) by (le)).

Ingress Traffic Metrics (ingress_request_size)

Metric name: ingress_request_size

Metric type: Gauge

Aggregation period: 30s

Metric description: The total number of bytes in the request message that are counted in the dimension corresponding to the label within an aggregation period.

Metric label:

Label	Description	Value Example
ingress_cluster	The deployment name of NGINX Ingress controller	nginx-ingress-controller
ingress_cluster_instance	The pod name of NGINX Ingress controller	nginx-ingress-controller-6fdbbc5856-pcxkz
ingress_cluster_namespace	The namespace where the NGINX Ingress controller is located	kube-system
host	The Host name carried in the request header. It can identify which Ingress routing rule the traffic came from. If it is a non-compliant request, the value is "_".	my.otel-demo.com
service	The name of the backend service to which the request is forwarded. If it is a non-compliant request, the value is null.	default-my-otel-demo-frontend-8080

Egress traffic Metrics (ingress_response_size)

Metric name: ingress_response_size

Metric type: Gauge

Aggregation period: 30s

Metric description: The total number of bytes in the response message that are counted in the dimension corresponding to the label within an aggregation period. This metric is limited by the implementation of NGINX Ingress. Only the number of bytes in the response body can be counted, and the size of the response header cannot be counted.

Metric label:

Label	Description	Value Example
ingress_cluster	The deployment name of NGINX Ingress controller	nginx-ingress-controller
ingress_cluster_instance	The pod name of NGINX Ingress controller	nginx-ingress-controller-6fdbbc5856-pcxkz
ingress_cluster_namespace	The namespace where the NGINX Ingress controller is located	kube-system
host	The Host name carried in the request header. It can identify which Ingress routing rule the traffic came from. If it is a non-compliant request, the value is "_".	my.otel-demo.com
service	The name of the backend service to which the request is forwarded. If it is a non-compliant request, the value is null.	default-my-otel-demo-frontend-8080

Nginx Ingress Gateway Monitor Access

Method 1: Monitor the Default NGINX Ingress Gateway of an ACK Cluster

If you check the box to install NGINX Ingress when you create an ACK cluster, a default Ingress Controller Pod is created in the kube-system space of the cluster to implement gateway traffic proxy. You can use the following method to monitor the default NGINX Ingress gateway:

Step 1: Enter the Prometheus Monitoring Integration Center of Alibaba Cloud

Log on to the Alibaba Cloud Prometheus Service page. Find the Prometheus instance that corresponds to your ACK cluster. On the Integration page, find Nginx Ingress Gateway Monitor:

Figure: Select Nginx Ingress Gateway Monitor

Step 2: Fill in the Installation Parameters

Figure: Installation Parameters

If you have not made any changes to Nginx Ingress after the ACK cluster is created (such as changing the namespace and IngressClass name), click OK to submit the installation.
If you are accessing a self-managed or an N-th NGINX Ingress gateway, please see Monitoring Self-built or Multiple NGINX Ingress Gateways.

Note: If you start monitoring, a workload collector, DaemonSet, will be deployed in your Kubernetes cluster. The resource limit range is 0.5 CPU cores and 512MB memory. You can adjust the limit range based on the actual traffic volume of the gateway. Run the kubectl edit daemonset -narms-prom arms-vector command to change the limit range.

Step 3: View the Dashboard of the NGINX Ingress Gateway Monitor

You can open the sidebar of NGINX Ingress gateway monitor integration card, find the dashboard named Universal Ingress Observability Dashboard on the Dashboards tab, and click to jump to Grafana to view data.

Figure: Dashboards TAB tab

If you have finished the installation in Step 2 and NGINX Ingress gateway has real traffic data, you can view the collected and reported metric data in the dashboard within 2-3 minutes.

Method 2: Monitor Self-Managed or Multiple NGINX Ingress Gateways

If you use a self-managed NGINX Ingress gateway or deploy multiple NGINX Ingress gateways in the Kubernetes cluster by referring to the ACK official document Deploy Multiple Ingress Controllers [2], you can refer to this section for monitoring access.

The rest of the access process is unchanged. On the installation page of NGINX Ingress Gateway Monitor, adjust the parameters based on the actual situation.

Figure: Custom Installation Parameters

The five parameters that require attention are described below:

Config Name: Set as the unique ID of the current collection configuration (such as otel-demo-nginx-ingress)
Ingress Controller Label Selector Key: The collector searches for the specified Ingress Controller Pod through the label selector. The key name of the selector is provided here (such as app).
Ingress Controller Label Selector Value: The collector searches for the specified Ingress Controller Pod through the label selector. The value of the selector is provided here (such as otel-demo-nginx). It can be combined with the key name above to form a query expression: app=otel-demo-nginx.
Ingress Controller Namespace: The namespace where the Ingress controller is located (such as otel-demo)
Ingress Class Name: The Ingress class name that the Ingress Controller listens to (such as otel-demo-nginx-class)

Note: Monitoring multiple NGINX Ingress gateways will reuse the same workload collector. The default resource limit range is 0.5 CPU cores and 512MB memory. You can adjust the limit range based on the actual traffic volume of the gateway. Run the kubectl edit daemonset -narms-prom arms-vector command to change the limit range.

Visualization Dashboard of NGINX Ingress Gateway Monitor

The entire visualization dashboard of NGINX Ingress gateway monitor is divided into six parts:

Overview: Displays important metric information on the first screen
Service Statistics-TopN: Displays information (such as the PVs, duration, and success rate of Host, services, and URIs from the TopN perspective)
Service Statistics-Trend Distribution: Displays trends (such as PVs, egress and ingress traffic, request success rate, and latency) and the distribution of status code, request methods, and the number of Ingress Pod requests
Service Statistics-Request Analysis: Displays the PVs, success rate, 4XX ratio, 5XX ratio, and latency across the Host, service, and URI request paths in the form of table
Geographic Statistics: Displays requests based on geographic information from the perspective of proportion and table details.
Device Statistics: Displays the requests based on device information from the perspective of proportion and table details.

1. Overview

The overview section fully displays the elements defined by the RED metrics through a dashboard design that reflects traffic and service quality/experience: Rate, Errors, and Duration.

① Traffic Dashboard

Figure: PV and Traffic

The traffic dashboard is at the top of the Nginx Ingress gateway monitor dashboard to display the most important traffic-related data.

1. Minute-level PVs

2. Hour-level PVs

YoY (a day earlier)
MoM (an hour earlier)

3. Day-level PVs

YoY (a week earlier)
MoM (a day earlier)

4. Week-level PVs

YoY (four weeks earlier)
MoM (a week earlier)

At the same time, thanks to the powerful visualization capability of Grafana, we can use different colors to distinguish whether metrics need attention. We can see that this practice has been applied in more than one scenario below:

Metric values are shown in red when YoY or MoM increases
Metric values are shown in green when YoY or MoM increases

② Service Quality/Experience Dashboard

Figure: Success Rate, Errors, and Duration

The Overview section also displays important metrics (such as success rate, errors, and duration). Here, a successful request is defined as a request with a response code of 1XX, 2XX, or 3XX. If the response code is 4XX or 5XX, the request is a failed or error request.

We have selected a set of error response codes that require special attention:

404: When the value rises abnormally, it is necessary to check whether the application configuration error causes the page crawled by the search engine to be unable to load correctly.
429: When this value rises abnormally, it is necessary to pay attention to whether any client accesses the backend service at a higher frequency than normal, resulting in traffic limiting.
499: When this value rises abnormally, it is necessary to pay attention to whether the client closes the connection in advance because the backend service takes too long to respond.
500: When the value rises abnormally, it is necessary to pay attention to whether there is an internal error caused by the backend service due to the incorrect implementation of the business logic.
503: When this value rises abnormally, it is necessary to pay attention to whether there are backend services that are unavailable due to upgrade and other reasons.
504: When this value rises abnormally, it is necessary to pay attention to whether there is a backend service response that exceeds the Nginx Ingress tolerance range and causes a timeout.

At the same time, the observable color is applied here, and the strategy is listed below:

1. Request Success Rate:

Green when the request success rate is greater than 90%
Yellow when the request success rate is greater than 50% but less than 90%
Red when the request success rate is less than 50%

2. 5XX Ratio:

Red when the 5XX ratio is greater than 50%
Yellow when the 5XX ratio is less than 50% but greater than 10%
Green when the 5XX ratio is less than 10%

3. Number of errors: Yellow when the number is greater than 0

4. Duration metrics:

Green when a duration is less than 200ms
Yellow when a duration is less than 500ms but greater than 200ms
Red when a duration is more than 500ms

In addition, it should be noted that the duration metrics of correct requests and errors are quite different. Therefore, we recommend analyzing them differently by specifying a normal response code or an incorrect response code through the drop-down filter on the top.

2. Service Statistics - TopN

The service statistics - TopN section displays the Host, service, and URI of the top 10 PVs, top 10 request duration, and top 10 5XX ratios.

① PV Access

Figure: PV Access Ranking

Here, you can use the drop-down filter on the top to specify the response status code to distinguish the ranking of normal request access and error request access.

② Request Duration

Figure: Request Duration Ranking

The color change strategy here is:

Green when a duration is less than 200ms
Yellow when a duration is less than 500ms but greater than 200ms
Red when a duration is more than 500ms

In addition, it should be noted that the duration metrics of correct requests and error requests are quite different. Therefore, we recommend analyzing them differently by specifying a normal response code or an incorrect response code through the drop-down filter on the top.

③ 5XX Ratio

Figure: 5XX Ratio Ranking

The color change strategy here is:

Red when the 5XX ratio is greater than 50%
Yellow when the 5XX ratio is less than 50% but greater than 10%
Green when the 5XX ratio is less than 10%

3. Service Statistics-Trend Distribution

The service statistics-trend distribution section displays the trends of each RED metric in the Host dimension and the Service dimension, as well as the distribution of requests in terms of response status code, request methods, and Ingress Controller Pod.

① RED Metrics in the Host Dimension

Figure: RED Metrics in the Host Dimension

This section of the dashboard shows the RED metric elements of each Host:

Minute-level PV changes of each Host
Minute-level request success rate changes of each Host
Minute-level egress and ingress traffic changes of each Host
Minute-level duration changes for each Host

The PV trend and duration trend are controlled by the response status code change of the drop-down filter on the top, which can distinguish PV and duration of normal requests from error requests.

② RED Metrics in the Service Dimension

Figure: RED Metrics in the Service Dimension

This section of the dashboard shows the RED metrics elements of each Service:

Minute-level PV changes of each Service
Minute-level request success rate changes of each Service
Minute-level egress and ingress traffic changes of each Service
Minute-level duration changes of each Service

The PV trend and duration trend are controlled by the response status code change of the drop-down filter on the top, which can distinguish PV and duration of normal requests from error requests.

③ Request Distribution

Figure: Distribution of Response Status Code, Request Method, and Ingress Controller Pod

This section of the dashboard uses a pie chart to show the request traffic distribution in each dimension:

The proportion and specific value of the request traffic distribution in each response status code
The proportion and specific value of the request traffic distribution in each request method
The proportion and specific value of the request traffic distribution in each Ingress Controller Pod

Their statistical range is the current time period selected at the top.

4. Service Statistics-Request Analysis

Figure: Request Analysis Table

The last section of service statistics is to present the PV, success rate, 4XX ratio, 5XX ratio, and the latency on the request path of Host, Service, and URI in table form in detail. Their statistical range is the current time period selected at the top. If you want to drill down to see more fine-grained URI request analysis statistics and extend URI convergence rules, please refer to Edit CR to Extend URI Convergence Rules in the advanced guide section.

5. Geographic Statistics

Figure: Statistics Based on Geographic Information

The geographic statistics section provides the proportion of each dimension and the corresponding table:

1. Province Visited

The proportion of each province or region visited. The statistical range is the current time period selected at the top.
The table details of the province or region visited. The statistical range is the current time period selected at the top.

2. City Visited

The proportion of the city visited. The statistical range is the current time period selected at the top.
The table details of the city visited. The statistical range is the current time period selected at the top.

3. Time Zone Visited

The proportion of each time zone visited. The statistical range is the current time period selected at the top.
The table details of the time zone visited. The statistical range is the current time period selected at the top.

6. Device Statistics

Figure: Device Statistics

The device statistics section provides the proportion of each dimension and the corresponding table:

1. Device Type

The proportion and specific value of each device type. The statistical range is the current time period selected at the top.
The table details of the device type. The statistical range is the current time period selected at the top.

2. Operating System

The proportion and specific value of each operating system. The statistical range is the current time period selected at the top.
The table details of the operating system. The statistical range is the current time period selected at the top.

3. Browser

The proportion and specific value of each browser. The statistical range is the current time period selected at the top.
The table details of the browser. The statistical range is the current time period selected at the top.

The Overall Effect Picture

Advanced Guide for NGINX Ingress Gateway Monitor

Edit CR to Extend URI Convergence Rules

Detailed data (such as request paths in access logs) cannot be enumerated. If you directly add these data to the label of Ingress request metric, the dimensions will diverge, the storage cost will increase sharply, and the metric query will be affected. Therefore, the collector that implements Nginx Ingress gateway monitor will refine the request path based on a set of URI convergence rules. Each convergence rule consists of two parts:

Match Expression: If a regular expression matches the current URI, converge the URI (such as $/api/product/(.+)$ )
Converged Text: Converge the URI into another readable string (such as ProductItem)

When the collector is enabled for the first time, it scans the Ingress resources of the current Kubernetes cluster and assembles convergence rules based on the Path information provided by the existing routing rules. If this part of the configuration cannot meet your analysis and statistical needs, please follow the following steps to extend the configuration.

First, execute the kubectl edit ingresslog -narms-prom ingresslog-<Your Collection Configuration Name> to enter the editing window of the custom resource (such as kubectl edit ingresslog -narms-prom ingresslog-default-ingress-nginx).

Please find the spec.logParser.reduceUri.allowList field and expand it. For example, it may have only two convergence rules by default:

reduceUri:
      allowList:
        - pattern: ^/(.+)$
          reduced: /(.+)
        - pattern: ^/$
          reduced: /

The allowList field is an array object. Each element of the allowList field indicates a convergence rule. The pattern field under each convergence rule indicates the match expression, and the reduced field indicates the converged text.

You can use the following examples to change the fields based on your business scenario:


    reduceUri:
      allowList:
        - pattern: ^/api/cart$
          reduced: /api/cart
        - pattern: ^/api/checkout$
          reduced: /api/checkout
        - pattern: ^/api/data$
          reduced: /api/data
        - pattern: ^/api/data/\?contextKeys=(.+)$
          reduced: /api/data/?contextKeys=(.+)
        - pattern: ^/api/products/(.+)$
          reduced: /api/products/(.+)
        - pattern: ^/api/recommendations/\?productIds=(.+)$
          reduced: /api/recommendations/?productIds=(.+)
        - pattern: ^/(.+)$
          reduced: /(.+)
        - pattern: ^/$
          reduced: /

Here, put the shortest rule that matches path to the end of the list in order, such as ^/$. Wait 2-3 minutes after you save the configuration, and then you can view the refined metric data that is expanded based on the URI convergence rule on the dashboard.

Extending URI convergence rules will refine your timeline, resulting in an increase in the number of generated metrics and affecting billing. Therefore, please pay attention to changes in the number of metrics in a timely manner.

Note 1: We recommend backing up URI convergence rules locally in a timely manner because the corresponding IngressLog custom resources will be deleted by default after the current NGINX Ingress gateway monitor is uninstalled.

Note 2: Do not modify other configurations in the IngressLog custom resource. Otherwise, NGINX Ingress gateway monitor cannot work properly.

References

Kubernetes documentation:
- Introduction to Ingress: https://kubernetes.io/docs/concepts/services-networking/ingress/
- Introduction to Ingress Controller: https://kubernetes.io/docs/concepts/services-networking/ingress-controllers/
Official documentation of Alibaba Cloud Container Service for Kubernetes (ACK):
- Ingress overview: https://www.alibabacloud.com/help/en/container-service-for-kubernetes/latest/ingress-overview
- Best practices for the NGINX Ingress: https://www.alibabacloud.com/help/en/container-service-for-kubernetes/latest/best-practices-for-the-nginx-ingress-controller

Community

Observability | Best Practices for Monitoring NGINX Ingress Gateways with Prometheus

An Introduction to NGINX Ingress Gateway

Workload Resources

Ingress Request Traffic

Implementation of NGINX Ingress Gateway Monitor

Based on Exporter Metrics

Problem 1: Exposing Too Many Unavailable Histogram Metrics

Problem 2: Pull Mode Pulls Too Many Inactive Timelines

Problem 3: Ingress Path Is Not Scalable and Drill-Down

Problem 4: Lack of Analysis of Geographical and Equipment Information

Problem 5: The Official Grafana Dashboard Layout of Kubernetes Is Less Focused

Statistics Based on Access Logs

Metric Model of NGINX Ingress Gateway Monitor

General Requests Metrics (ingress_requests)

Geography-Based Request Volume Metrics (ingress_geoip_requests)

Device-Based Request Volume Metrics (ingress_user_agent_requests)

Request Latency Bucket Metric (ingress_request_time)

Ingress Traffic Metrics (ingress_request_size)

Egress traffic Metrics (ingress_response_size)

Nginx Ingress Gateway Monitor Access

Method 1: Monitor the Default NGINX Ingress Gateway of an ACK Cluster

Step 1: Enter the Prometheus Monitoring Integration Center of Alibaba Cloud

Step 2: Fill in the Installation Parameters

Step 3: View the Dashboard of the NGINX Ingress Gateway Monitor

Method 2: Monitor Self-Managed or Multiple NGINX Ingress Gateways

Visualization Dashboard of NGINX Ingress Gateway Monitor

1. Overview

2. Service Statistics - TopN

3. Service Statistics-Trend Distribution

4. Service Statistics-Request Analysis

5. Geographic Statistics

6. Device Statistics

The Overall Effect Picture

Advanced Guide for NGINX Ingress Gateway Monitor

Edit CR to Extend URI Convergence Rules

Related Links

References

Read previous post:

Read next post:

Alibaba Cloud Native

You may also like

Comments

Alibaba Cloud Native

Related Products

Best Practices

Cloud-Native Applications Management Solution

Function Compute

Lindorm