Use Managed Service for Prometheus to monitor an NGINX Ingress gateway - Managed Service for Prometheus

This topic describes how to use Managed Service for Prometheus to monitor an NGINX Ingress gateway.

Entry points

Entry point 1: Integration center of a Prometheus instance

Log on to the Managed Service for Prometheus console.
In the left-side navigation pane, click Monitoring List. The Prometheus Service page appears.
Click the name of the Prometheus instance instance that you want to manage to go to the Integration Center page.

Entry point 2: Integration Center in the left-side navigation pane of the ARMS console

Log on to the Application Real-Time Monitoring Service (ARMS) console.
In the left-side navigation pane, click Integration Center. In the Application Components section, find the Nginx Ingress Gateway Monitor component and click Add. In the panel that appears, configure the component as prompted.

Method 1: Monitor the default NGINX Ingress gateway of an ACK cluster

This section describes how to configure the Nginx Ingress Gateway Monitor component in the integration center of a Prometheus instance.

Configure the Nginx Ingress Gateway Monitor component.
- If you install the Nginx Ingress Gateway Monitor component for the first time, perform the following operation:
  In the Not Installed section of the Integration Center page, find the Nginx Ingress Gateway Monitor component and click Install.
  Note
  You can click the component to view the common NGINX Ingress metrics and dashboard thumbnails in the panel that appears. The metrics listed are for reference only. For more information, see Monitoring metrics of an NGINX Ingress gateway. After you install the Nginx Ingress Gateway Monitor component, you can view the actual metrics that Managed Service for Prometheus provides.
- If you have installed the Nginx Ingress Gateway Monitor component, you must add the component again.
  In the Installed section of the Integration Center page, find the Nginx Ingress Gateway Monitor component and click Add.

On the Configurations tab in the STEP2 section, configure the parameters and click OK. The following table describes the parameters.

Parameter	Description
Config Name	The name of the current exporter. The name must meet the following requirements: The name can contain only lowercase letters, digits, and hyphens (-) and cannot start or end with a hyphen (-). It must be unique. Note If you do not specify this parameter, the system uses the default name, which consists of the exporter type and a numeric suffix.
Ingress Controller Label Selector Key	The key of the tag used to query the specified Ingress Controller pod. Example: app.
Ingress Controller Label Selector Value	The value of the tag used to query the specified Ingress Controller pod. Example: otel-demo-nginx. The combination of a value and a tag is used as a query expression. Example: app=otel-demo-nginx.
Ingress Controller Namespace	The namespace where the Ingress Controller resides. Example: otel-demo.
Ingress Class Name	The ID of the Ingress class that the Ingress Controller listens to. Example: otel-demo-nginx-class.
Log Parse Regex	Enter the log parsing rule.

Note

You can view the monitoring metrics on the Metrics tab in the STEP2 section.

You can also click the Nginx Ingress Gateway Monitor component in the Installed section of the Integration Center page. In the panel that appears, you can view information such as targets, metrics, dashboards, service discovery configurations, and exporters. For more information, see Integration center.

After you configure the Nginx Ingress Gateway Monitor component, a DaemonSet is deployed in your ACK cluster. The resource limit is 0.5 cores per 512 MB. You can run the kubectl edit daemonset -narms-prom arms-vector command to adjust the resource limit based on the actual traffic volume of the gateway.

Method 2: Monitor a self-built NGINX Ingress gateway or multiple NGINX Ingress gateways

If you use a self-built NGINX Ingress gateway or deploy multiple Ingress Controllers in an ACK cluster as required by the ACK official documentation, perform the following steps to configure the Nginx Ingress Gateway Monitor component:

Configure the Nginx Ingress Gateway Monitor component.
- If you install the Nginx Ingress Gateway Monitor component for the first time, perform the following operation:
  In the Not Installed section of the Integration Center page, find the Nginx Ingress Gateway Monitor component and click Install.
  Note
  You can click the component to view the common NGINX Ingress metrics and dashboard thumbnails in the panel that appears. The metrics listed are for reference only. For more information, see Monitoring metrics of an NGINX Ingress gateway. After you install the Nginx Ingress Gateway Monitor component, you can view the actual metrics that Managed Service for Prometheus provides.
- If you have installed the Nginx Ingress Gateway Monitor component, you must add the component again.
  In the Installed section of the Integration Center page, find the Nginx Ingress Gateway Monitor component and click Add.

On the Configurations tab in the STEP2 section, configure the parameters and click OK. The following table describes the parameters.

Parameter	Description
Config Name	The name of the current exporter. The name must meet the following requirements: The name can contain only lowercase letters, digits, and hyphens (-) and cannot start or end with a hyphen (-). It must be unique. Note If you do not specify this parameter, the system uses the default name, which consists of the exporter type and a numeric suffix.
Ingress Controller Label Selector Key	The key of the tag used to query the specified Ingress Controller pod. Example: app.
Ingress Controller Label Selector Value	The value of the tag used to query the specified Ingress Controller pod. Example: otel-demo-nginx. The combination of a value and a tag is used as a query expression. Example: app=otel-demo-nginx.
Ingress Controller Namespace	The namespace where the Ingress Controller resides. Example: otel-demo.
Ingress Class Name	The ID of the Ingress class that the Ingress Controller listens to. Example: otel-demo-nginx-class.
Log Parse Regex	Enter the log parsing rule.

Note

You can view the monitoring metrics on the Metrics tab in the STEP2 section.

If you monitor multiple NGINX Ingress gateways, the same collector workload is reused. The default resource limit of each NGINX Ingress gateway is 0.5 cores per 512 MB. Pay attention to the actual traffic volume of the gateway and adjust the resource limit accordingly. You can run the kubectl edit daemonset -narms-prom arms-vector command to adjust the resource limit.

View the monitoring dashboard of an NGINX Ingress gateway

On the Integration Center page, click the Nginx Ingress Gateway Monitor component in the Installed section. In the panel that appears, click the Dashboards tab to view the thumbnails and hyperlinks of Ingress gateway dashboards. Click a hyperlink to go to the Grafana page and view the dashboard.

The dashboard of an NGINX Ingress gateway consists of six sections.

Overview

The Overview section visualizes the service traffic and quality by providing the key metrics defined by the RED Method: rate, errors, and duration.

Traffic metrics
Note
Different colors are used to distinguish the metrics and help you view the relevant data.
- A metric in red indicates that the value of the metric in the current time period is greater than the value of the metric in the same period of last year or the value of the metric in the previous period.
- A metric in green indicates that the value of the metric in the current time period is less than the value of the metric in the same period of last year or the value of the metric in the previous period.
Traffic-related data
- PVs in one minute
- PVs in one hour
  - Ratio of PVs in the current minute to PVs in the same minute of the previous day
  - Ratio of PVs in the current minute to PVs in the previous minute
- PVs in one day
  - Ratio of PVs on the current day to PVs on the same day of the previous week
  - Ratio of PVs on the current day to PVs on the previous day
- PVs in one week
  - Ratio of PVs in the current week to PVs four weeks or one month ago
  - Ratio of PVs in the current week to PVs one week ago
Service quality metrics
Service quality metrics include the success rate of requests, the number of error requests, and latency. Successful requests return the following status codes: 1XX, 2XX, and 3XX. Failed requests and error requests return the following status codes: 4XX and 5XX. The following 4XX and 5XX status codes are common.
- 404: If the value of a metric is excessively high, you must check whether the pages captured by the search engine cannot be loaded as expected due to application configuration errors.
- 429: If the value of a metric is excessively high, you must check whether a client accesses the backend services more frequently than expected and causes throttling.
- 499: If the value of a metric is excessively high, you must check whether the client closes the connection earlier than expected because the backend services take too long to respond.
- 500: If the value of a metric is excessively high, you must check whether an internal error occurred due to the invalid implementation of the business logic in the backend services.
- 503: If the value of a metric is excessively high, you must check whether backend services are unavailable due to reasons such as upgrade.
- 504: If the value of a metric is excessively high, you must check whether the response from a backend service exceeds the tolerance range of the NGINX Ingress gateway and times out.
Different colors have different meanings.
- Success rate of errors
  - If the value is greater than 90%, it is displayed in green.
  - If the value is greater than 50% and less than 90%, it is displayed in yellow.
  - If the value is less than 50%, it is displayed in red.
- Percentage of 5XX status codes
  - If the value is greater than 50%, it is displayed in red.
  - If the value is greater than 10% and less than 50%, it is displayed in yellow.
  - If the value is less than 10%, it is displayed in green.
- Number of each type of error requests: If the value is greater than 0%, it is displayed in yellow.
- Latency metrics:
  - If the value is less than 200 milliseconds, it is displayed in green.
  - If the value is greater than 200 milliseconds and less than 500 milliseconds, it is displayed in yellow.
  - If the value is greater than 500 milliseconds, it is displayed in red.
Note
The latency metrics of normal requests and error requests are different. We recommend that you use the Status Code drop-down list to query latency metrics by status code.

Service Statistics-TopN

This section displays the hosts or domain names, services, and URIs of top 10 PVs, requests, and requests with 5XX status codes.

PVs
Note
You can use the Status Code drop-down list to query PVs by status code.
Request duration
- If the value is less than 200 milliseconds, it is displayed in green.
- If the value is greater than 200 milliseconds and less than 500 milliseconds, it is displayed in yellow.
- If the value is greater than 500 milliseconds, it is displayed in red.
Note
The request duration metrics of normal requests and error requests are different. We recommend that you use the Status Code drop-down list to query request duration metrics by status code.
Percentage of 5XX status codes
- If the value is greater than 50%, it is displayed in red.
- If the value is greater than 10% and less than 50%, it is displayed in yellow.
- If the value is less than 10%, it is displayed in green.

Service Statistics-Trend Distribution

This section shows the trends in each RED metric of various services and hosts or domain names, and the distribution of requests in different status codes, request methods, and Ingress Controller pods.

RED metrics of hosts or domain names
- PV changes of each host or domain name per minute
- Success rate changes of request of each host or domain name per minute
- Inbound and outbound traffic changes of each host or domain name per minute
- Latency changes of each host or domain name per minute
You can use the Status Code drop-down list to query the PV trends and latency trends of normal requests and error requests by status code.
RED metrics of services
- PV changes of each service in one minute
- Success rate changes of requests of each service in one minute
- Inbound and outbound traffic changes of each service in one minute
- Latency changes of each service in one minute
You can use the Status Code drop-down list to query the PV trends and latency trends of normal requests and error requests by status code.
Distribution of requests
- Number of requests distributed in each status code and the percentage
- Number of requests distributed in each request method and the percentage
- Number of requests distributed in each Ingress Controller pod and the percentage
Note
The time range to query is selected at the top of the page.

Service Statistics-Request Analysis

This section displays the PVs, success rates, percentages of 4XX status codes, percentages of 5XX status codes, and latency of various URIs, services, and hosts or domain names in a table. The time range to query is selected at the top of the page. If you want to perform drill-down analysis to view more fine-grained URI request data, you need to extend URI convergence rules. For more information, see Monitoring guide of NGINX Ingress gateway.

Geographical Statistics

Province
- The percentages of provinces or municipalities from where requests are sent are displayed. The time range to query is selected at the top of the page.
- Details about provinces or municipalities from where requests are sent are displayed in a table. The time range to query is selected at the top of the page.
City
- The percentages of cities from where requests are sent are displayed. The time range to query is selected at the top of the page.
- Details about cities from where requests are sent are displayed in a table. The time range to query is selected at the top of the page.
Time zone
- The percentages of time zones from where requests are sent are displayed. The time range to query is selected at the top of the page.
- Details about time zones from where requests are sent are displayed in a table. The time range to query is selected at the top of the page.

Equipment Statistics

Device type
- The percentages of devices from where requests are sent and the PVs are displayed. The time range to query is selected at the top of the page.
- Details about the device types are displayed in a table. The time range to query is selected at the top of the page.
Operating system
- The percentages of operating systems from where requests are sent and the PVs are displayed. The time range to query is selected at the top of the page.
- Details about the operating systems are displayed in a table. The time range to query is selected at the top of the page.
Browser
- The percentages of browsers from where requests are sent and the PVs are displayed. The time range to query is selected at the top of the page.
- Details about the browsers are displayed in a table. The time range to query is selected at the top of the page.

Monitoring metrics of an NGINX Ingress gateway

ingress_requests

Metric name: ingress_requests
Metric type: Gauge
Aggregation period: 30 seconds
Description: the number of requests that are counted in an aggregation period and matched based on the specified tags.

Tags

Tag	Description	Example
ingress_cluster	The deployment of the NGINX Ingress Controller.	nginx-ingress-controller
ingress_cluster_instance	The pod of the NGINX Ingress Controller.	nginx-ingress-controller-6fdbbc5856-pcxkz
ingress_cluster_namespace	The namespace where the NGINX Ingress Controller resides.	kube-system
host	The hostname carried in the request header. The hostname can identify the routing rule based on which traffic is routed. If the request is non-compliant, the value is "_".	my.otel-demo.com
service	The backend service to which the request is forwarded. If the request is non-compliant, the value empty.	default-my-otel-demo-frontend-8080
uri	The path of the converged URL.	/(.+)
method	The request method.	GET
status_code	The status code.	200

ingress_geoip_requests

Metric name: ingress_geoip_requests
Metric type: Gauge
Aggregation period: 30 seconds
Description: the number of requests that are counted in an aggregation period and matched based on the specified tags. The tags are enriched with geographic information.

Tags

Tag	Description	Example
ingress_cluster	The deployment of the NGINX Ingress Controller.	nginx-ingress-controller
ingress_cluster_instance	The pod of the NGINX Ingress Controller.	nginx-ingress-controller-6fdbbc5856-pcxkz
ingress_cluster_namespace	The namespace where the NGINX Ingress Controller resides.	kube-system
host	The hostname carried in the request header. The hostname can identify the routing rule based on which traffic is routed. If the request is non-compliant, the value is "_".	my.otel-demo.com
service	The backend service to which the request is forwarded. If the request is non-compliant, the value empty.	default-my-otel-demo-frontend-8080
country_codeC	The code of the country where the IP address resides.	CN
country_name	The name of the country where the IP address resides.	China
region_name	The name of the province where the IP address resides.	Zhejiang
city_name	The name of the city where the IP address resides.	Hangzhou
timezone	The time zone where the IP address resides.	Asia/Shanghai

Note

The preceding table lists the service-level tags that can meet your requirements in the common scenarios of the metric. Except the service-level tags, the metric has more fine-grained tags, such as URI, method, and status code. These tags require expensive storage and are less cost-effective.

ingress_user_agent_requests

Metric name: ingress_user_agent_requests
Metric type: Gauge
Aggregation period: 30 seconds
Description: the number of requests that are counted in an aggregation period and matched based on the specified tags. The tags are enriched with device information.

Tags

Tag	Description	Example
ingress_cluster	The deployment of the NGINX Ingress Controller.	nginx-ingress-controller
ingress_cluster_instance	The pod of the NGINX Ingress Controller.	nginx-ingress-controller-6fdbbc5856-pcxkz
ingress_cluster_namespace	The namespace where the NGINX Ingress Controller resides.	kube-system
host	The hostname carried in the request header. The hostname can identify the routing rule based on which traffic is routed. If the request is non-compliant, the value is "_".	my.otel-demo.com
service	The backend service to which the request is forwarded. If the request is non-compliant, the value empty.	default-my-otel-demo-frontend-8080
browser_family	The type of the browser from which the request is sent. If the browser cannot be identified as expected, the value is "<null>".	Chrome
device_category	The type of the device from which the request is sent. If the device cannot be identified as expected, the value is "<null>".	mobile
os_family	The type of the operating system from which the request is sent. If the operating system cannot be identified as expected, the value is "<null>".	iPhone

Note

ingress_request_time

Metric name: ingress_request_time
Metric type: GaugeHistogram
Aggregation period: 30 seconds
Description: the bucket value of the request latency that is counted in an aggregation period and matched based on the specified tags.

Tags

Tag	Description	Example
ingress_cluster	The deployment of the NGINX Ingress Controller.	nginx-ingress-controller
ingress_cluster_instance	The pod of the NGINX Ingress Controller.	nginx-ingress-controller-6fdbbc5856-pcxkz
ingress_cluster_namespace	The namespace where the NGINX Ingress Controller resides.	kube-system
host	The hostname carried in the request header. The hostname can identify the routing rule based on which traffic is routed. If the request is non-compliant, the value is "_".	my.otel-demo.com
service	The backend service to which the request is forwarded. If the request is non-compliant, the value empty.	default-my-otel-demo-frontend-8080
uri	The path of the converged URL.	/(.+)
method	The request method.	GET
status_code	The status code.	200

Note

The type of the metric is GaugeHistogram. GaugeHistogram captures an instantaneous value of the current aggregation period as a bucket value whereas Histogram generates bucket values based on the counter model. You can use the following expression to calculate quantiles for the GaugeHistogram metric: histogram_quantile(0.95, sum(sum_over_time(ingress_request_time_bucket{...}[1m])) by (le)).

ingress_request_size

Metric name: ingress_request_size
Metric type: Gauge
Aggregation period: 30 seconds
Description: the total number of bytes of the request that are counted in an aggregation period and matched based on the specified tags.

Tags

Tag	Description	Example
ingress_cluster	The deployment of the NGINX Ingress Controller.	nginx-ingress-controller
ingress_cluster_instance	The pod of the NGINX Ingress Controller.	nginx-ingress-controller-6fdbbc5856-pcxkz
ingress_cluster_namespace	The namespace where the NGINX Ingress Controller resides.	kube-system
host	The hostname carried in the request header. The hostname can identify the routing rule based on which traffic is routed. If the request is non-compliant, the value is "_".	my.otel-demo.com
service	The backend service to which the request is forwarded. If the request is non-compliant, the value empty.	default-my-otel-demo-frontend-8080

Note

ingress_response_size

Metric name: ingress_response_size
Metric type: Gauge
Aggregation period: 30 seconds
Description: the total number of bytes of the request that are counted in an aggregation period and matched based on the specified tags. The number of bytes is limited by the implementation of NGINX Ingress. In this example, only the number of bytes in the response body can be counted. The number of bytes in the response header is not counted.

Tags

Tag	Description	Example
ingress_cluster	The deployment of the NGINX Ingress Controller.	nginx-ingress-controller
ingress_cluster_instance	The pod of the NGINX Ingress Controller.	nginx-ingress-controller-6fdbbc5856-pcxkz
ingress_cluster_namespace	The namespace where the NGINX Ingress Controller resides.	kube-system
host	The hostname carried in the request header. The hostname can identify the routing rule based on which traffic is routed. If the request is non-compliant, the value is "_".	my.otel-demo.com
service	The backend service to which the request is forwarded. If the request is non-compliant, the value empty.	default-my-otel-demo-frontend-8080

Note

Monitoring guide of NGINX Ingress gateway

Modify custom resources to expand URI convergence rules

Detailed data such as request paths in access logs cannot be enumerated. If you configure the request path as a tag of an Ingress metric, storage cost is increased and metric query may be affected.

Therefore, the collector that collects the metric data of an NGINX Ingress gateway refines the request path based on a set of URI convergence rules. Each convergence rule consists of two parts:

Regular expression: If the current URI matches the specified expression, convergence is performed. Sample expression: $/api/product/(.+)$.
Converged text: The URI is converged into another readable string. Sample string: ProductItem.

The first time that the collector is enabled, it scans the Ingress resources of the current ACK cluster and assembles convergence rules based on the path information provided by the existing routing rules. If the preceding convergence configurations cannot meet your analysis requirements, perform the following steps to expand the configurations.

Run the kubectl edit ingresslog -narms-prom ingresslog-<Collection configuration name> command to go to the editing window of the custom resource. Sample command: kubectl edit ingresslog -narms-prom ingresslog-default-ingress-nginx.
Find the spec.logParser.reduceUri.allowList field and expand it. In this example, the field has only two convergence rules by default.
```
    reduceUri:
      allowList:
        - pattern: ^/(.+)$
          reduced: /(.+)
        - pattern: ^/$
          reduced: /
```
Note
The allowList field is an array object, and each element indicates a convergence rule. The pattern field in each convergence rule indicates a regular expression, and the reduced field indicates the converged text.

The following example shows how to expand convergence rules. You can expand convergence based on your business requirements.

    reduceUri:
      allowList:
        - pattern: ^/api/cart$
          reduced: /api/cart
        - pattern: ^/api/checkout$
          reduced: /api/checkout
        - pattern: ^/api/data$
          reduced: /api/data
        - pattern: ^/api/data/\?contextKeys=(.+)$
          reduced: /api/data/?contextKeys=(.+)
        - pattern: ^/api/products/(.+)$
          reduced: /api/products/(.+)
        - pattern: ^/api/recommendations/\?productIds=(.+)$
          reduced: /api/recommendations/?productIds=(.+)
        - pattern: ^/(.+)$
          reduced: /(.+)
        - pattern: ^/$
          reduced: /

Note

We recommend that you put the rules with the shortest paths to the end of the list in order. Example: ^/$. Save the configurations and wait for 2 or 3 minutes. Then, you can view the metric data that is refined based on the URI convergence rules in the dashboard.

After you expand URI convergence rules, the timeline of metric monitoring is refined, resulting in an increase in the number of generated metrics and affecting billing. You need to pay attention to changes in the number of metrics at the earliest opportunity.

Important

We recommend that you back up URI convergence rules in a timely manner. If you remove the exporters of the NGINX Ingress gateway from your Prometheus instance, the IngressLog custom resources are deleted by default.
Do not modify other configurations in the IngressLog custom resources. Otherwise, the exporters of the NGINX Ingress gateway cannot work as expected.

References: Monitor an NGINX Ingress gateway

Use exporter metrics

Each process of NGINX Ingress Controller developed based on open source NGINX plays the role of an exporter and monitors the Prometheus metrics. Example:

nginx_ingress_controller_requests{canary="",controller_class="k8s.io/ingress-nginx",controller_namespace="kube-system",controller_pod="nginx-ingress-controller-6fdbbc5856-pcxkz",host="my.otel-demo.com",ingress="my-otel-demo",method="GET",namespace="default",path="/",service="my-otel-demo-frontend",status="200"} 2.401964e+06
nginx_ingress_controller_requests{canary="",controller_class="k8s.io/ingress-nginx",controller_namespace="kube-system",controller_pod="nginx-ingress-controller-6fdbbc5856-pcxkz",host="my.otel-demo.com",ingress="my-otel-demo",method="GET",namespace="default",path="/",service="my-otel-demo-frontend",status="304"} 111
nginx_ingress_controller_requests{canary="",controller_class="k8s.io/ingress-nginx",controller_namespace="kube-system",controller_pod="nginx-ingress-controller-6fdbbc5856-pcxkz",host="my.otel-demo.com",ingress="my-otel-demo",method="GET",namespace="default",path="/",service="my-otel-demo-frontend",status="308"} 553545
nginx_ingress_controller_requests{canary="",controller_class="k8s.io/ingress-nginx",controller_namespace="kube-system",controller_pod="nginx-ingress-controller-6fdbbc5856-pcxkz",host="my.otel-demo.com",ingress="my-otel-demo",method="GET",namespace="default",path="/",service="my-otel-demo-frontend",status="404"} 55
nginx_ingress_controller_requests{canary="",controller_class="k8s.io/ingress-nginx",controller_namespace="kube-system",controller_pod="nginx-ingress-controller-6fdbbc5856-pcxkz",host="my.otel-demo.com",ingress="my-otel-demo",method="GET",namespace="default",path="/",service="my-otel-demo-frontend",status="499"} 2
nginx_ingress_controller_requests{canary="",controller_class="k8s.io/ingress-nginx",controller_namespace="kube-system",controller_pod="nginx-ingress-controller-6fdbbc5856-pcxkz",host="my.otel-demo.com",ingress="my-otel-demo",method="GET",namespace="default",path="/",service="my-otel-demo-frontend",status="500"} 64
nginx_ingress_controller_requests{canary="",controller_class="k8s.io/ingress-nginx",controller_namespace="kube-system",controller_pod="nginx-ingress-controller-6fdbbc5856-pcxkz",host="my.otel-demo.com",ingress="my-otel-demo",method="GET",namespace="default",path="/",service="my-otel-demo-frontendproxy",status="200"} 59599
nginx_ingress_controller_requests{canary="",controller_class="k8s.io/ingress-nginx",controller_namespace="kube-system",controller_pod="nginx-ingress-controller-6fdbbc5856-pcxkz",host="my.otel-demo.com",ingress="my-otel-demo",method="GET",namespace="default",path="/",service="my-otel-demo-frontendproxy",status="304"} 15
nginx_ingress_controller_requests{canary="",controller_class="k8s.io/ingress-nginx",controller_namespace="kube-system",controller_pod="nginx-ingress-controller-6fdbbc5856-pcxkz",host="my.otel-demo.com",ingress="my-otel-demo",method="GET",namespace="default",path="/",service="my-otel-demo-frontendproxy",status="308"} 15709
nginx_ingress_controller_requests{canary="",controller_class="k8s.io/ingress-nginx",controller_namespace="kube-system",controller_pod="nginx-ingress-controller-6fdbbc5856-pcxkz",host="my.otel-demo.com",ingress="my-otel-demo",method="GET",namespace="default",path="/",service="my-otel-demo-frontendproxy",status="403"} 235
nginx_ingress_controller_requests{canary="",controller_class="k8s.io/ingress-nginx",controller_namespace="kube-system",controller_pod="nginx-ingress-controller-6fdbbc5856-pcxkz",host="e-commerce.

You can use an open source Prometheus agent or a Managed Service for Prometheus agent with service discovery configurations to capture and report metrics. You can use Prometheus Query Language to implement analysis and alerting, or use Managed Service for Grafana to visualize metric data. However, monitoring based on exporter metrics has the following pain points in production environments.

Pain point 1: Exposure of a large number of impractical Histogram metrics
If you capture the metrics of your NGINX Ingress gateway in a production or test cluster, a large number of Histogram metrics are captured. In most cases, a Histogram metric is named <metric_name>_bucket and used together with <metric_name>_count and <metric_name>_count. Some metrics that are not used for common analytics may also be captured. Examples:
- nginx_ingress_controller_request_size_bucket: the bucket size of each request body.
- nginx_ingress_controller_bytes_sent_bucket: the bucket size of each response body.
By default, if you do not perform the drop operation on the metric_relabel_configs collection configuration of Prometheus, these metrics are captured and reported. This may consume a large amount of bandwidth and storage resources.
Pain point 2: A large number of inactive timelines pulled in pull mode
If you capture the metrics of your NGINX Ingress gateway in a cluster when the pull mode of a Prometheus agent is enabled, excessive Histogram metrics are captured. As long as a microservice receives a request, all timelines related to the request will be captured even if the microservice is seldom accessed. In each capturing cycle, the timelines are continuously collected and reported, causing a waste of resources.
To solve the problem, you need to prevent Counter metrics from being reported when no changes happen in a time period.
Pain point 3: Inability to expand or drill down Ingress paths
URL Path is a metric that can reflect HTTP traffic in drill-down analysis. However, if you configure URL Path as a tag, metric data becomes excessive.
To solve the problem, you can use the URL Path tag to record the path fields of the exposed NGINX Ingress metrics corresponding to the Ingress rules. Examples: /(.+), /login, and /orders/(.+). In some scenarios, you may want to perform fine-grained drill-down analysis. Assume that you need to view the URL Pattern data of /users/(.+)/follower and /users/(.+)/followee. URL paths cannot be expanded, and the metric calculation logic preset in the NGINX Ingress gateway cannot be programmed.
Pain point 4: Lack of geographical and equipment analysis
Generally, the O&M personnel of a website system pay close attention to the request source side information. Examples:
- Provinces and cities where users are from and ten provinces and cities with the most PVs.
- PV data of PCs and mobile devices, including iOS devices and Android devices.
Geographical and equipment data is not displayed in the metrics exposed by the NGINX Ingress gateway.
Pain point 5: Rough Grafana dashboards of Kubernetes
Generally, the Grafana dashboards of Kubernetes are not comprehensive. Grafana dashboard provided by Kubernetes. The dashboard lists the metrics of an NGINX Ingress gateway.
As mentioned earlier, the RED Method defines the following key metrics: rate, errors, and duration. A detailed and comprehensive dashboard is essential to NGINX Ingress gateway monitoring. However, the layout or information structure of this dashboard is inappropriate from the perspective of users who analyze request traffic.

Use access logs

Using the preset NGINX Ingress metrics have various pain points in production environments. Alibaba Cloud Managed Service for Prometheus allows you to monitor NGINX Ingress gateways based on access logs.

Similar to open source NGINX, the NGINX Ingress Gateway Monitor component prints the access logs of each request to the standard output of the Ingress Controller pods.

By default, the access logs of the NGINX Ingress Gateway Monitor component installed in an ACK cluster contains the following information:

Time that the request is sent.
Source IP address of the request.
Request method. Example: GET.
Request path. Example: /api/cart.
Status code.
Request body length.
Response body length.
Request duration.
The upstream service of the request. Example: default-my-otel-demo-frontend-8080.
The User-Agent of the request. Example: Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/3.0).
The hostname or domain name carried in the request header. Example: my.otel-demo.com. The hostname or domain name can identify the routing rule based on which traffic is routed.

Based on the preceding information, you only need to deploy a collector in the Kubernetes environment to perform pre-aggregation and then monitor the RED metrics related to the inbound traffic. You can also use controllable technical means to prevent several problems in monitoring based on exporter metrics:

Exposure of a large number of impractical Histogram metrics: You can create a set of detailed and essential metrics to meet the common analytics requirements.
A large number of inactive timelines pulled in pull mode: You can discard the Counter model and use multiple rolling windows to monitor Gauge metrics. Data is independent between windows and pushed in the Remote Write mode. This prevents metrics in historical timelines from being repeatedly reported.
Inability to expand or drill down Ingress paths: You can extend the pre-aggregation logic with custom resource configurations. You can create new matching rules to implement drill-down analysis.
Lack of geographical and equipment analysis: You can use GeoIP and User-Agent analytics to enrich data in pre-aggregation.
Rough Grafana dashboards of Kubernetes: You can create new portal dashboards with optimized layout, information structure, and observability experience.