Managed Service for Prometheus collects Cassandra metrics -- read/write latency, compaction activity, thread pool status, JVM health, and more -- through a JMX agent deployed on your Elastic Compute Service (ECS) instances. The integration includes pre-built Grafana dashboards and 12 default alert rules across node status, resource usage, read/write performance, exceptions, and JVM health.
This guide walks you through four steps: deploy the JMX agent, integrate Cassandra with your Prometheus instance, explore dashboards, and configure alerting.
This integration is available only for Prometheus for ECS instances.
Prerequisites
Before you begin, make sure you have:
A Prometheus instance for ECS. For details, see Create a Prometheus instance to monitor an ECS instance
Step 1: Deploy the Cassandra JMX agent
The JMX agent exposes Cassandra metrics on an HTTP endpoint that Managed Service for Prometheus scrapes at a configurable interval.
Download the JMX agent package to the ECS instance where Cassandra runs. Select the version that matches your Cassandra release:
Extract the package and configure the JVM agent. Add the following lines to your
cassandra-env.shfile: Replace/path/to/directorywith the actual path where you extracted the package.ImportantThe JMX agent listens on port
9103by default. To change the port, edit the${MCAC_ROOT}/config/collectd.conf.tmplfile.MCAC_ROOT=/path/to/directory JVM_OPTS="$JVM_OPTS -javaagent:${MCAC_ROOT}/lib/datastax-mcac-agent.jar"Restart Cassandra.
Verify that the agent is running: If metric data is returned, the JMX agent is working correctly.
curl localhost:9103/metrics
Step 2: Integrate Cassandra with your Prometheus instance
Open the Integration Center through either of the following entry points.
From your Prometheus instance:
Log on to the ARMS console.
In the left-side navigation pane, choose Prometheus Service > Prometheus Instances.
Click the name of your Prometheus instance to go to the Integration Center page.
From the ARMS console:
Log on to the Application Real-Time Monitoring Service (ARMS) console.
In the left-side navigation pane, click Integration Center.
In the Components section, find Cassandra and click Add. Complete the integration as prompted.
Install or add the Cassandra component
First-time installation: In the Not Installed section of the Integration Center page, find Cassandra and click Install.
Click the card to preview common Cassandra metrics and dashboard thumbnails. These metrics are for reference only. After installation, the actual metrics collected from your environment are available. For details, see Key metrics.
Adding another instance: If you already have the Cassandra component installed, go to the Installed section, find Cassandra, and click Add.
Configure integration parameters
On the Settings tab in the STEP2 section, configure the following parameters and click OK.
| Parameter | Description | Default |
|---|---|---|
| Instance name | A unique name for the exporter. Only lowercase letters, digits, and hyphens (-) are allowed. Cannot start or end with a hyphen. | -- |
| ECS Label Key (service discovery) | The ECS tag key used for service discovery. Valid values: acs:emr:nodeGroupType and acs:emr:hostGroupType. | -- |
| ECS Label value | The ECS tag values. Separate multiple values with commas. | CORE,MASTER |
| JMX Agent listening port | The port where the JMX agent exposes metrics. | 9103 |
| Metrics path | The HTTP path for metric collection. | /metrics |
| Metrics scrape interval (seconds) | How often Managed Service for Prometheus collects metrics. | 30 |
Review the collected metrics on the Metrics tab in the STEP2 section.
Verify the integration
After installation, the Cassandra component appears in the Installed section of the Integration Center page. Click the component to view targets, metrics, dashboards, alerts, service discovery configurations, and exporters. For more information, see Integration center.
Check the exporter status on the Targets tab. All targets should show an UP state.
Step 3: View Cassandra dashboards
The pre-built dashboards visualize cluster availability, client latency, throughput, and resource utilization.
To access the dashboards:
On the Integration Center page, click Cassandra in the Installed section.
In the panel that appears, click the Dashboards tab.
Click a dashboard hyperlink to open it in Grafana.
The dashboards cover the following areas:
| Dashboard section | What it shows |
|---|---|
| Cluster/Node information | Cluster size, node uptime, active CQL connections |
| Client read latency, write latency, and throughput | End-to-end request performance |
| Exceptions and errors | Timed-out requests, failures, dropped messages |
| Caching and Bloom filters | Cache hit rates, false-positive ratios |
| Hardware resource usage | CPU, memory, and disk utilization per node |
| Storage occupancy details | SSTable count, disk space, commit log size |
| Thread pool status | Active, pending, and blocked tasks |
| JVM and garbage collection | Heap memory usage, GC duration |
Step 4: Configure alerting
Managed Service for Prometheus includes 12 default alert rules for Cassandra. To view them:
On the Integration Center page, click Cassandra in the Installed section.
In the panel that appears, click the Alerts tab.
Default alert rules
| Category | Alert | Trigger condition |
|---|---|---|
| Node status | Inactive node ratio | Value exceeds 10, indicating one or more nodes are down |
| Resource usage | CPU utilization | Exceeds 85% for 5 minutes |
| Memory usage | Exceeds 85% | |
| Disk usage | Exceeds 85% | |
| Read/write performance | Read latency | Exceeds 200 ms for 1 minute |
| Write latency | Exceeds 200 ms for 1 minute | |
| Read throughput | Exceeds 1,000 operations in 1 minute | |
| Write throughput | Exceeds 1,000 operations in 1 minute | |
| Exceptions and errors | Timed-out requests | More than 10 in 1 minute |
| Failed requests | More than 10 in 1 minute | |
| Dropped messages | More than 10 in 1 minute | |
| JVM | GC time ratio | Exceeds 1% over 5 minutes |
To create custom alert rules based on your requirements, see Create an alert rule for a Prometheus instance.
Key metrics
Cluster and node information
| Metric | Severity | Description | Operational guidance |
|---|---|---|---|
mcac_client_connected_native_clients | Major | Active CQL connections | High connection counts consume system resources and increase client latency. |
mcac_table_live_disk_space_used_total | Major | Disk space used by Cassandra data | High disk usage increases access latency and risks storage exhaustion. |
mcac_table_snapshots_size | Recommended | Snapshot file size | Large snapshots consume storage space and may prevent complete backups. |
collectd_uptime | Major | Node uptime | Extended uptime without restarts may indicate missed security patches. |
Performance
| Metric | Severity | Description | Operational guidance |
|---|---|---|---|
mcac_table_read_latency | Critical | Client read latency | High values directly degrade application read performance and user experience. |
mcac_table_write_latency | Critical | Client write latency | High values directly degrade application write performance and user experience. |
Exceptions and errors
| Metric | Severity | Description | Operational guidance |
|---|---|---|---|
mcac_client_request_timeouts_total | Critical | Timed-out client requests | Increasing timeouts indicate system overload. |
mcac_client_request_failures_total | Critical | Failed client requests | Increasing failures indicate system overload. |
mcac_dropped_message_dropped_total | Critical | Dropped messages | Dropped messages indicate that nodes cannot keep up with incoming requests. |
Caching and Bloom filters
| Metric | Severity | Description | Operational guidance |
|---|---|---|---|
mcac_table_key_cache_hit_rate | Major | Key cache hit rate | Low hit rates cause more disk reads, increasing read latency. |
mcac_table_row_cache_hit_total | Major | Row cache hits | Low hit counts may indicate ineffective caching for your workload. |
mcac_table_row_cache_miss_total | Recommended | Row cache misses | High miss counts increase read latency due to disk lookups. |
mcac_table_row_cache_hit_out_of_range_total | Recommended | Row cache hits that still require disk access | High values reduce the effectiveness of row caching. |
mcac_table_bloom_filter_false_ratio | Major | Bloom filter false-positive rate | High false-positive rates waste query resources by checking SSTables that do not contain the requested data. |
CPU, memory, and disk
| Metric | Severity | Description | Operational guidance |
|---|---|---|---|
collectd_cpu_total | Critical | CPU utilization | Sustained high CPU usage increases request latency and may cause timeouts. |
collectd_memory | Critical | Memory usage | High memory usage can trigger frequent garbage collection, reducing throughput. |
collectd_df_df_complex | Critical | Disk usage | When disk space runs out, Cassandra cannot persist data and the node may crash. |
SSTable compaction
| Metric | Severity | Description | Operational guidance |
|---|---|---|---|
mcac_table_pending_compactions | Major | Pending compaction tasks | A growing backlog increases read latency. Consider tuning compaction settings. |
mcac_table_compaction_bytes_written_total | Major | Compaction throughput (bytes written) | Low throughput causes task backlog. Consider upgrading node hardware. |
mcac_table_compression_ratio | Major | SSTable compression ratio | High ratios mean compressed files are still large, limiting the benefit of compression. |
Disk storage
| Metric | Severity | Description | Operational guidance |
|---|---|---|---|
mcac_table_live_ss_table_count | Major | SSTable count | Excessive SSTables increase disk usage and read latency. Review your compaction strategy. |
mcac_table_live_disk_space_used_total | Major | Disk space used by SSTables | High disk usage increases read/write latency. Review your compaction strategy. |
mcac_table_ss_tables_per_read_histogram | Major | SSTables accessed per read | High values increase read latency. |
mcac_commit_log_total_commit_log_size | Major | Commit log disk usage | Large commit logs reduce available disk space and extend recovery time. |
mcac_table_memtable_live_data_size | Major | MemTable size | Oversized MemTables can degrade write performance and node stability. |
mcac_table_waiting_on_free_memtable_space | Major | Time waiting for MemTable space | Extended waits degrade write performance and node stability. |
Thread pool status
| Metric | Severity | Description | Operational guidance |
|---|---|---|---|
mcac_thread_pools_active_tasks | Critical | Active tasks | High counts indicate resource contention that can slow responses or crash the node. |
mcac_thread_pools_total_blocked_tasks_total | Critical | Blocked tasks | Blocked tasks indicate thread pool saturation. Investigate resource bottlenecks. |
mcac_thread_pools_pending_tasks | Critical | Pending tasks | A growing queue of pending tasks can lead to request timeouts and system instability. |
mcac_thread_pools_completed_tasks | Major | Completed tasks | Reflects system throughput. Higher values indicate better performance. |
JVM
| Metric | Severity | Description | Operational guidance |
|---|---|---|---|
mcac_jvm_memory_used | Critical | JVM heap memory used | Approaching the heap limit triggers frequent garbage collection, reducing application throughput. |
mcac_jvm_gc_time | Critical | Time spent in garbage collection | High GC time reduces the time available for request processing, leading to timeouts or crashes. |