All Products
Search
Document Center

Application Real-Time Monitoring Service:Use Managed Service for Prometheus to monitor Cassandra

Last Updated:Nov 20, 2023

This topic describes how to use Alibaba Cloud Managed Service for Prometheus to monitor Cassandra.

Prerequisites

A Prometheus instance for ECS is created. For more information, see Create a Prometheus instance to monitor an ECS instance.

Limits

You can install the component only for Prometheus for ECS instances.

Step 1: Deploy a Cassandra JMX agent

  1. Based on the version of Cassandra, download a Cassandra JMX agent to the Elastic Compute Service (ECS) instance where Cassandra resides.

  2. Decompress the package to MCAC_ROOT. Add the following information to the cassandra-env.sh file:

    MCAC_ROOT=/path/to/directory
    JVM_OPTS="$JVM_OPTS -javaagent:${MCAC_ROOT}/lib/datastax-mcac-agent.jar"
    Important

    The port number that the Cassandra JMX agent exposes to Managed Service for Prometheus is 9103. To change the port number, modify the following information in the ${MCAC_ROOT}/config/collectd.conf.tmpl file.

  3. Restart Cassandra and run the curl localhost:{jmx port}/metrics command in the ECS instance. Check whether data is returned. If data is returned, the Cassandra JMX agent is installed.

Step 2: Integrate Cassandra into Managed Service for Prometheus

Procedure

Entry point 1: Integration center of the Prometheus instance

  1. Log on to the ARMS console.
  2. In the left-side navigation pane, choose Prometheus Service > Prometheus Instances.
  3. Click the name of the Prometheus instance instance that you want to manage to go to the Integration Center page.

Entry point 2: Integration center in the ARMS console

  1. Log on to the Application Real-Time Monitoring Service (ARMS) console.

  2. In the left-side navigation pane, click Integration Center. In the Components section, find Cassandra and click Add. In the panel that appears, integrate Cassandra as prompted.

Integrate Cassandra

This section describes how to integrate the Cassandra component in the integration center of the Prometheus instance.

  1. Install or add the Cassandra component.

    • If this is the first time that you install the Cassandra component, perform the following operation.

      In the Not Installed section of the Integration Center page, find Cassandra and click Install.

      Note

      You can click the card to view the common Cassandra metrics and dashboard thumbnails in the panel that appears. The metrics listed are for reference only. After you install the Cassandra component, you can view the actual metrics collected by Managed Service for Prometheus. For more information, see Key metrics.

    • If you have installed the Cassandra component, you must add the component again.

      In the Installed section of the Integration Center page, find Cassandra and click Add.

  2. On the Settings tab in the STEP2 section, configure the parameters and click OK. The following table describes the parameters.

    Parameter

    Description

    Instance name

    The name of the exporter.

    • The name can contain only lowercase letters, digits, and hyphens (-) and cannot start or end with a hyphen (-).

    • The name must be unique.

    ECS Label Key (service discovery)

    The ECS tag and tag value that are used to deploy the exporter. Managed Service for Prometheus uses this tag for service discovery. Valid values: acs:emr:nodeGroupType and acs:emr:hostGroupType.

    ECS Label value

    The ECS tag values. Default values: CORE,MASTER. Separate multiple values with commas (,).

    JMX Agent listening port

    The listening port of the metrics. Managed Service for Prometheus accesses the port to obtain metric data. Default value: 9103.

    Metrics path

    The HTTP path used by Managed Service for Prometheus to collect metric data from the exporter. Default value: /metrics.

    Metrics scrape interval (seconds)

    The interval at which Managed Service for Prometheus collects the monitoring data. Default value: 30.

    Note

    You can view the monitoring metrics on the Metrics tab in the STEP2 section.

    The installed components are displayed in the Installed section of the Integration Center page. Click the component. In the panel that appears, you can view information such as targets, metrics, dashboard, alerts, service discovery configurations, and exporters. For more information, see Integration center.

    You can also view the status of the exporter on the Targets tab.

Step 3: View the dashboards of Cassandra

On the Dashboards tab, you can view monitoring data such as the availability, client read and write latency, and client throughput. You can also view the CPU utilization, memory usage, and disk usage of nodes.

On the Integration Center page, click the Cassandra component in the Installed section. In the panel that appears, click the Dashboards tab to view the thumbnails and hyperlinks of Cassandra dashboards. Click a hyperlink to go to the Grafana page and view the dashboard. This section describes the monitoring metrics of common dashboards.

  • Cluster/Node Information section

  • Client Read Latency, Write Delay, and Throughput section

  • Exceptions and Errors section

  • Caching and Bloom Filters section

  • Hardware resource usage section

  • Storage occupancy details section

  • Thread Pool Status section

  • JVM and Garbage Collection section

Step 4: Configure alerting

On the Integration Center page, click the Cassandra component in the Installed section. In the panel that appears, click the Alerts tab to view all Cassandra alert rules configured in Managed Service for Prometheus.

Managed Service for Prometheus allows you to enable Cassandra exporters with simple configurations. Out-of-the-box dedicated dashboards and alerting are provided. You can use the ARMS console to manage the exporters with reduced workload costs.

Managed Service for Prometheus provides multiple default alert rules for the key metrics of Cassandra. Common Cassandra alert rules are preset as templates to help the O&M personnel build dashboards and alert systems. The following table lists the default alert rules.

Category

Metric

Description

Node status

Proportion of inactive nodes in the cluster

If the value is greater than 10, one or more nodes in the cluster are down.

Resource usage

CPU utilization

If the CPU utilization of a node exceeds 85% in the last 5 minutes, the CPU utilization reaches the upper limit.

Memory usage

If the memory usage of a node exceeds 85%, the memory usage reaches the upper limit.

Hard disk usage

If the hard disk usage of a node exceeds 85%, the hard disk reaches the upper limit.

Read and write latency and throughput

Read latency

If the read latency of a node exceeds 200 ms in the past 1 minute, the read latency is high.

Write latency

If the write latency of a node exceeds 200 ms in the last 1 minute, the write latency is high.

Read throughput

If the number of read operations of a node exceeds 1,000 in the last 1 minute, the read throughput is high.

Write throughput

If the number of write operations of a node exceeds 1,000 in the last 1 minute, the write throughput is high.

Exceptions and errors

Timed out requests

If the number of timed out requests of a node exceeds 10 in the past 1 minute, the node is overloaded.

Failed requests

If the number of failed requests of a node exceeds 10 in the past 1 minute, the node is overloaded.

Dropped messages

If the number of dropped messages of a node exceeds 10 in the last 1 minute, the node is overloaded.

JVM

GC time ratio

If the GC time of a node in the last 5 minutes accounts for more than 1%, the garbage collection is too frequent.

You can also create alert rules based on your business requirements. For more information, see Create an alert rule for a Prometheus instance.

Key metrics

Cluster and node information

Metric

Level

Description

Remarks

mcac_client_connected_native_clients

Major

Number of CQL connections

If the value is too large, lots of system resources are occupied, which causes prolonged client latency.

mcac_table_live_disk_space_used_total

Major

Space occupied by Cassandra

If the value is too large, storage space may be insufficient, causing prolonged access latency.

mcac_table_snapshots_size

Recommand

Cassandra snapshot file size

Snapshots are used to restore data. If the value is too large, storage space may be insufficient to store complete snapshots.

collectd_uptime

Major

Node startup time

If the value is too large, the system has not been restarted for a long time, and may be vulnerable to security risks.

Key performance metrics

Metric

Level

Description

Remarks

mcac_table_read_latency

Critical

Client read latency

If the value is too large, the read speed of the application is slow, which affects user experience.

mcac_table_write_latency

Critical

Client write latency

If the value is too large, the write speed of the application is slow, which affects user experience.

Exceptions and errors

Metric

Level

Description

Remarks

mcac_client_request_timeouts_total

Critical

Timed out client requests

If the value is too large, the system is overloaded, which severely affects user experience.

mcac_client_request_failures_total

Critical

Abnormal client requests

If the value is too large, the system is overloaded, which severely affects user experience.

mcac_dropped_message_dropped_total

Critical

Dropped messages

If the value is too large, the system is overloaded, which severely affects user experience.

Caching and Bloom filters

Metric

Level

Description

Remarks

mcac_table_key_cache_hit_rate

Major

Hit rate of key_cache

If the value is too small, the read speed of the application may be slow, which affects user experience.

mcac_table_row_cache_hit_total

Major

Number of hits of row_cache

If the value is too small, the read speed of the application may be slow, which affects user experience.

mcac_table_row_cache_miss_total

Recommand

Number of missed hits of row_cache

If the value is too large, the read speed of the application may be slow, affecting user experience.

mcac_table_row_cache_hit_out_of_range_total

Recommand

Number of times that row_cache hits but still accesses the disk

If the value is too large, the read speed of the application may be slow, affecting user experience.

mcac_table_bloom_filter_false_ratio

Major

False-positive rate of the Bloom filter

If the value is too large, non-existent elements in the query result are misjudged as existent, which wastes query time and resources. This degrades query performance and increases query costs.

Usage trends in CPU, memory, and disks

Metric

Level

Description

Remarks

collectd_cpu_total

Critical

CPU utilization

If the value is too large, the system is overloaded, which prolongs client request latency and severely affects user experience.

collectd_memory

Critical

Memory usage

If the value is too large, the system is overloaded, which prolongs the client request latency and severely affects user experience.

collectd_df_df_complex

Critical

Hard disk usage

If the value is too large, the hard disk space is insufficient. Data cannot be stored persistently, and the system may crash.

SSTable compression

Metric

Level

Description

Remarks

mcac_table_pending_compactions

Major

SSTable compression task in progress

If the value is too large, the system is overloaded, which prolongs the client request latency. We recommend that you configure the compression interval of SSTable.

mcac_table_compaction_bytes_written_total

Major

SSTable compression speed

If the value is too small, the compression speed is slow, which causes task accumulation. We recommend that you increase the hardware configuration of the node.

mcac_table_compression_ratio

Major

SSTable compression ratio

If the value is too large, the compressed files are still too large, and the compression tasks do not achieve the expected results.

Disk file

Metric

Level

Description

Remarks

mcac_table_live_ss_table_count

Major

Number of SSTables

If the value is too large, the hard disk usage is high, and the read/write latency is prolonged. We recommend that you configure the compression policy of SSTable.

mcac_table_live_disk_space_used_total

Major

Hard disk space occupied by SSTable

If the value is too large, the hard disk usage is high, and the read/write latency is prolonged. We recommend that you configure the compression policy of SSTable.

mcac_table_ss_tables_per_read_histogram

Major

Number of SSTables for each read operation

If the value is too large, the client read latency is high.

mcac_commit_log_total_commit_log_size

Major

Hard disk space occupied by Commit Log

If the value is too large, the hard disk space is insufficient, the read/write performance is degraded, and the data recovery time is increased.

mcac_table_memtable_live_data_size

Major

Space occupied by the MemTable

If the value is too large, the data write performance and node stability are degraded.

mcac_table_waiting_on_free_memtable_space

Major

Time spent waiting for the MemTable to be released

If the value is too large, the data write performance and node stability are degraded.

Thread pool status

Metric

Level

Description

Remarks

mcac_thread_pools_active_tasks

Critical

Number of active tasks in the thread pool

If the value is too large, system resources are occupied, which may cause a reduced response speed and even a system crash.

mcac_thread_pools_total_blocked_tasks_total

Critical

Number of blocked tasks in the thread pool

If the value is too large, system resources are occupied, which may cause a reduced response speed and even a system crash.

mcac_thread_pools_pending_tasks

Critical

Number of pending tasks in the thread pool

If the value is too large, lots of system resources are occupied. If the requests that correspond to pending tasks time out, the system may crash.

mcac_thread_pools_completed_tasks

Major

Number of completed tasks in the thread pool

This metric indicates the throughput of the system. The higher the value, the better the system performs.

JVM

Metric

Level

Description

Remarks

mcac_jvm_memory_used

Critical

Size of the used JVM heap memory

If the value is too large, the memory may be insufficient, which triggers frequent garbage collection, and reduces the throughput of the application.

mcac_jvm_gc_time

Critical

Time spent by the application in GC

If the value is too large, GC is too frequent, and the system has less time to execute user tasks, which may lead to client request timeout or even a system crash.