All Products
Search
Document Center

Application Real-Time Monitoring Service:Monitor Cassandra with Managed Service for Prometheus

Last Updated:Mar 11, 2026

Managed Service for Prometheus collects Cassandra metrics -- read/write latency, compaction activity, thread pool status, JVM health, and more -- through a JMX agent deployed on your Elastic Compute Service (ECS) instances. The integration includes pre-built Grafana dashboards and 12 default alert rules across node status, resource usage, read/write performance, exceptions, and JVM health.

This guide walks you through four steps: deploy the JMX agent, integrate Cassandra with your Prometheus instance, explore dashboards, and configure alerting.

This integration is available only for Prometheus for ECS instances.

Prerequisites

Before you begin, make sure you have:

Step 1: Deploy the Cassandra JMX agent

The JMX agent exposes Cassandra metrics on an HTTP endpoint that Managed Service for Prometheus scrapes at a configurable interval.

  1. Download the JMX agent package to the ECS instance where Cassandra runs. Select the version that matches your Cassandra release:

  2. Extract the package and configure the JVM agent. Add the following lines to your cassandra-env.sh file: Replace /path/to/directory with the actual path where you extracted the package.

    Important

    The JMX agent listens on port 9103 by default. To change the port, edit the ${MCAC_ROOT}/config/collectd.conf.tmpl file.

       MCAC_ROOT=/path/to/directory
       JVM_OPTS="$JVM_OPTS -javaagent:${MCAC_ROOT}/lib/datastax-mcac-agent.jar"
  3. Restart Cassandra.

  4. Verify that the agent is running: If metric data is returned, the JMX agent is working correctly.

       curl localhost:9103/metrics

Step 2: Integrate Cassandra with your Prometheus instance

Open the Integration Center through either of the following entry points.

From your Prometheus instance:

  1. Log on to the ARMS console.

  2. In the left-side navigation pane, choose Prometheus Service > Prometheus Instances.

  3. Click the name of your Prometheus instance to go to the Integration Center page.

From the ARMS console:

  1. Log on to the Application Real-Time Monitoring Service (ARMS) console.

  2. In the left-side navigation pane, click Integration Center.

  3. In the Components section, find Cassandra and click Add. Complete the integration as prompted.

Install or add the Cassandra component

  • First-time installation: In the Not Installed section of the Integration Center page, find Cassandra and click Install.

    Click the card to preview common Cassandra metrics and dashboard thumbnails. These metrics are for reference only. After installation, the actual metrics collected from your environment are available. For details, see Key metrics.
  • Adding another instance: If you already have the Cassandra component installed, go to the Installed section, find Cassandra, and click Add.

Configure integration parameters

On the Settings tab in the STEP2 section, configure the following parameters and click OK.

ParameterDescriptionDefault
Instance nameA unique name for the exporter. Only lowercase letters, digits, and hyphens (-) are allowed. Cannot start or end with a hyphen.--
ECS Label Key (service discovery)The ECS tag key used for service discovery. Valid values: acs:emr:nodeGroupType and acs:emr:hostGroupType.--
ECS Label valueThe ECS tag values. Separate multiple values with commas.CORE,MASTER
JMX Agent listening portThe port where the JMX agent exposes metrics.9103
Metrics pathThe HTTP path for metric collection./metrics
Metrics scrape interval (seconds)How often Managed Service for Prometheus collects metrics.30
Review the collected metrics on the Metrics tab in the STEP2 section.

Verify the integration

After installation, the Cassandra component appears in the Installed section of the Integration Center page. Click the component to view targets, metrics, dashboards, alerts, service discovery configurations, and exporters. For more information, see Integration center.

Check the exporter status on the Targets tab. All targets should show an UP state.

Step 3: View Cassandra dashboards

The pre-built dashboards visualize cluster availability, client latency, throughput, and resource utilization.

To access the dashboards:

  1. On the Integration Center page, click Cassandra in the Installed section.

  2. In the panel that appears, click the Dashboards tab.

  3. Click a dashboard hyperlink to open it in Grafana.

The dashboards cover the following areas:

Dashboard sectionWhat it shows
Cluster/Node informationCluster size, node uptime, active CQL connections
Client read latency, write latency, and throughputEnd-to-end request performance
Exceptions and errorsTimed-out requests, failures, dropped messages
Caching and Bloom filtersCache hit rates, false-positive ratios
Hardware resource usageCPU, memory, and disk utilization per node
Storage occupancy detailsSSTable count, disk space, commit log size
Thread pool statusActive, pending, and blocked tasks
JVM and garbage collectionHeap memory usage, GC duration

Step 4: Configure alerting

Managed Service for Prometheus includes 12 default alert rules for Cassandra. To view them:

  1. On the Integration Center page, click Cassandra in the Installed section.

  2. In the panel that appears, click the Alerts tab.

Default alert rules

CategoryAlertTrigger condition
Node statusInactive node ratioValue exceeds 10, indicating one or more nodes are down
Resource usageCPU utilizationExceeds 85% for 5 minutes
Memory usageExceeds 85%
Disk usageExceeds 85%
Read/write performanceRead latencyExceeds 200 ms for 1 minute
Write latencyExceeds 200 ms for 1 minute
Read throughputExceeds 1,000 operations in 1 minute
Write throughputExceeds 1,000 operations in 1 minute
Exceptions and errorsTimed-out requestsMore than 10 in 1 minute
Failed requestsMore than 10 in 1 minute
Dropped messagesMore than 10 in 1 minute
JVMGC time ratioExceeds 1% over 5 minutes

To create custom alert rules based on your requirements, see Create an alert rule for a Prometheus instance.

Key metrics

Cluster and node information

MetricSeverityDescriptionOperational guidance
mcac_client_connected_native_clientsMajorActive CQL connectionsHigh connection counts consume system resources and increase client latency.
mcac_table_live_disk_space_used_totalMajorDisk space used by Cassandra dataHigh disk usage increases access latency and risks storage exhaustion.
mcac_table_snapshots_sizeRecommendedSnapshot file sizeLarge snapshots consume storage space and may prevent complete backups.
collectd_uptimeMajorNode uptimeExtended uptime without restarts may indicate missed security patches.

Performance

MetricSeverityDescriptionOperational guidance
mcac_table_read_latencyCriticalClient read latencyHigh values directly degrade application read performance and user experience.
mcac_table_write_latencyCriticalClient write latencyHigh values directly degrade application write performance and user experience.

Exceptions and errors

MetricSeverityDescriptionOperational guidance
mcac_client_request_timeouts_totalCriticalTimed-out client requestsIncreasing timeouts indicate system overload.
mcac_client_request_failures_totalCriticalFailed client requestsIncreasing failures indicate system overload.
mcac_dropped_message_dropped_totalCriticalDropped messagesDropped messages indicate that nodes cannot keep up with incoming requests.

Caching and Bloom filters

MetricSeverityDescriptionOperational guidance
mcac_table_key_cache_hit_rateMajorKey cache hit rateLow hit rates cause more disk reads, increasing read latency.
mcac_table_row_cache_hit_totalMajorRow cache hitsLow hit counts may indicate ineffective caching for your workload.
mcac_table_row_cache_miss_totalRecommendedRow cache missesHigh miss counts increase read latency due to disk lookups.
mcac_table_row_cache_hit_out_of_range_totalRecommendedRow cache hits that still require disk accessHigh values reduce the effectiveness of row caching.
mcac_table_bloom_filter_false_ratioMajorBloom filter false-positive rateHigh false-positive rates waste query resources by checking SSTables that do not contain the requested data.

CPU, memory, and disk

MetricSeverityDescriptionOperational guidance
collectd_cpu_totalCriticalCPU utilizationSustained high CPU usage increases request latency and may cause timeouts.
collectd_memoryCriticalMemory usageHigh memory usage can trigger frequent garbage collection, reducing throughput.
collectd_df_df_complexCriticalDisk usageWhen disk space runs out, Cassandra cannot persist data and the node may crash.

SSTable compaction

MetricSeverityDescriptionOperational guidance
mcac_table_pending_compactionsMajorPending compaction tasksA growing backlog increases read latency. Consider tuning compaction settings.
mcac_table_compaction_bytes_written_totalMajorCompaction throughput (bytes written)Low throughput causes task backlog. Consider upgrading node hardware.
mcac_table_compression_ratioMajorSSTable compression ratioHigh ratios mean compressed files are still large, limiting the benefit of compression.

Disk storage

MetricSeverityDescriptionOperational guidance
mcac_table_live_ss_table_countMajorSSTable countExcessive SSTables increase disk usage and read latency. Review your compaction strategy.
mcac_table_live_disk_space_used_totalMajorDisk space used by SSTablesHigh disk usage increases read/write latency. Review your compaction strategy.
mcac_table_ss_tables_per_read_histogramMajorSSTables accessed per readHigh values increase read latency.
mcac_commit_log_total_commit_log_sizeMajorCommit log disk usageLarge commit logs reduce available disk space and extend recovery time.
mcac_table_memtable_live_data_sizeMajorMemTable sizeOversized MemTables can degrade write performance and node stability.
mcac_table_waiting_on_free_memtable_spaceMajorTime waiting for MemTable spaceExtended waits degrade write performance and node stability.

Thread pool status

MetricSeverityDescriptionOperational guidance
mcac_thread_pools_active_tasksCriticalActive tasksHigh counts indicate resource contention that can slow responses or crash the node.
mcac_thread_pools_total_blocked_tasks_totalCriticalBlocked tasksBlocked tasks indicate thread pool saturation. Investigate resource bottlenecks.
mcac_thread_pools_pending_tasksCriticalPending tasksA growing queue of pending tasks can lead to request timeouts and system instability.
mcac_thread_pools_completed_tasksMajorCompleted tasksReflects system throughput. Higher values indicate better performance.

JVM

MetricSeverityDescriptionOperational guidance
mcac_jvm_memory_usedCriticalJVM heap memory usedApproaching the heap limit triggers frequent garbage collection, reducing application throughput.
mcac_jvm_gc_timeCriticalTime spent in garbage collectionHigh GC time reduces the time available for request processing, leading to timeouts or crashes.