To help you monitor the operation of clusters, CloudMonitor offers multiple monitoring metrics for E-MapReduce clusters, including CPU idleness, memory capacity, and disk capacity. It also allows you to set alarm rules for these monitoring metrics. After you purchase  the E-MapReduce service, CloudMonitor auto collects data for the aforementioned monitoring metrics.

Monitoring service

  • Metrics
    Metric Dimension Unit Minimum monitoring granularity
    Inbound traffic rate User, cluster, and role bits/s 30s
    Outbound network rateNetwork drain Rate User, cluster, and role bits/s 30s
    CPU idleness User, cluster, and role % 1 minute
    User-mode CPU usage User, cluster, and role % 30s
    System-mode CPU usage User, cluster, and role % 30s
    Idle disk capacity User, cluster, and role Bytes 30s
    Total disk capacity User, cluster, and role Bytes 30s
    Average load within 15 minutes User, cluster, and role - 30s
    Average load within 5 minutes User, cluster, and role - 30s
    Average load within 1 minutes User, cluster, and role - 30s
    Idle memory capacity User, cluster, and role Bytes 30s
    Total memory capacity User, cluster, and role Bytes 30s
    Inbound data packet rate User, cluster, and role Packets/s 30s
    Outbound data packet rate User, cluster, and role Packets/s 30s
    Number of running processes User, cluster, and role Processes 30s
    Total number of processes  User, cluster, and role Processes 30s
    Number of blocked processes User, cluster, and role Processes 30s
    Number of created processes/threads User, cluster, and role Processes/threads 30s
    MemNonHeapUsedM User, cluster, and role Bytes 30s
    MemNonHeapCommittedM User, cluster, and role Bytes 30s
    Memnonheapmaxm User, cluster, and role Bytes 30s
    MemHeapUsedM User, cluster, and role Bytes 30s
    MemHeapCommittedM User, cluster, and role Bytes 30s
    MemHeapMaxM User, cluster, and role Bytes 30s
    MemMaxM User, cluster, and role Bytes 30s
    Threadsnew User, cluster, and role - 30s
    ThreadsRunnable User, cluster, and role - 30s
    ThreadsBlocked User, cluster, and role - 30s
    ThreadsWaiting User, cluster, and role - 30s
    ThreadsTimedWaiting User, cluster, and role - 30s
    ThreadsTerminated User, cluster, and role - 30s
    GcCount User, cluster, and role - 30s
    GcTimeMillis User, cluster, and role - 30s
    CallQueueLength User, cluster, and role - 30s
    NumOpenConnections User, cluster, and role - 30s
    ReceivedBytes User, cluster, and role - 30s
    SentBytes User, cluster, and role - 30s
    BlockCapacity User, cluster, and role - 30s
    BlocksTotal User, cluster, and role - 30s
    CapacityRemaining User, cluster, and role - 30s
    CapacityTotal User, cluster, and role - 30s
    CapacityUsed User, cluster, and role - 30s
    CapacityUsedNonDFS User, cluster, and role - 30s
    CorruptBlocks User, cluster, and role - 30s
    ExcessBlocks User, cluster, and role - 30s
    ExpiredHeartbeats User, cluster, and role - 30s
    MissingBlocks User, cluster, and role - 30s
    PendingDataNodeMessageCount User, cluster, and role - 30s
    PendingDeletionBlocks User, cluster, and role - 30s
    PendingReplicationBlocks User, cluster, and role - 30s
    PostponedMisreplicatedBlocks User, cluster, and role - 30s
    ScheduledReplicationBlocks User, cluster, and role - 30s
    TotalFiles User, cluster, and role - 30s
    TotalLoad User, cluster, and role - 30s
    UnderReplicatedBlocks User, cluster, and role - 30s
    BlocksRead User, cluster, and role - 30s
    BlocksRemoved User, cluster, and role - 30s
    BlocksReplicated User, cluster, and role - 30s
    BlocksUncached User, cluster, and role - 30s
    BlocksVerified User, cluster, and role - 30s
    BlockVerificationFailures User, cluster, and role - 30s
    BlocksWritten User, cluster, and role - 30s
    BytesRead User, cluster, and role - 30s
    BytesWritten User, cluster, and role - 30s
    FlushNanosAvgTime User, cluster, and role - 30s
    FlushNanosNumOps User, cluster, and role - 30s
    FsyncCount User, cluster, and role - 30s
    VolumeFailures User, cluster, and role - 30s
    ReadBlockOpNumOps User, cluster, and role - 30s
    ReadBlockOpAvgTime User, cluster, and role ms 30s
    WriteBlockOpNumOps User, cluster, and role - 30s
    WriteBlockOpAvgTime User, cluster, and role ms 30s
    BlockChecksumOpNumOps User, cluster, and role - 30s
    BlockChecksumOpAvgTime User, cluster, and role ms 30s
    CopyBlockOpNumOps User, cluster, and role - 30s
    CopyBlockOpAvgTime User, cluster, and role ms 30s
    ReplaceBlockOpNumOps User, cluster, and role - 30s
    ReplaceBlockOpAvgTime User, cluster, and role ms 30s
    BlockReportsNumOps User, cluster, and role - 30s
    BlockReportsAvgTime User, cluster, and role ms 30s
    NodeManager_AllocatedContainers User, cluster, and role - 30s
    ContainersCompleted User, cluster, and role - 30s
    ContainersFailed User, cluster, and role - 30s
    ContainersIniting User, cluster, and role - 30s
    ContainersKilled User, cluster, and role - 30s
    ContainersLaunched User, cluster, and role - 30s
    ContainersRunning User, cluster, and role - 30s
    ActiveApplications User, cluster, and role - 30s
    ActiveUsers User, cluster, and role - 30s
    AggregateContainersAllocated User, cluster, and role - 30s
    AggregateContainersReleased User, cluster, and role - 30s
    AllocatedContainers User, cluster, and role - 30s
    AppsCompleted User, cluster, and role - 30s
    AppsFailed User, cluster, and role - 30s
    AppsKilled User, cluster, and role - 30s
    AppsPending User, cluster, and role - 30s
    AppsRunning User, cluster, and role - 30s
    AppsSubmitted User, cluster, and role - 30s
    AvailableMB User, cluster, and role - 30s
    AvailableVCores User, cluster, and role - 30s
    PendingContainers User, cluster, and role - 30s
    ReservedContainers User, cluster, and role - 30s
    Note
    • Monitoring data is preserved for at most 31 days.

    • - You can view monitoring data for a maximum of 14 consecutive days.

  • Viewing Monitoring Data
    1. Log in to the cloud monitoring console.
    2. Go to the E-MapReduce instance list under  Cloud Service Monitoring.
    3. Click an instance name or click Monitoring Chart in the Action column to access the instance monitoring details page and view various metrics.
    4. Click the Time Range quick selection button at the top of the page or use the specific selection function. You can view the monitoring data for up to 14 consecutive days.
    5. Click the Zoom In button in the upper-right corner of the monitoring chart to enlarge the chart.

Alarm service

  • Parameter descriptions 
    • Monitoring metrics: the monitoring metrics provided by the E-MapReduce service.

    • Statistical cycle: the alarm system checks whether your monitoring data has exceeded the alarm threshold based on the cycle. For example, if the statistical cycle of the alarm rule for memory usage is set to one minute, the system checks whether the memory usage has exceeded the threshold value every other minute.

    • Statistical method: refers to the method used to determine if the data exceeds the threshold. The average value, maximum value, minimum value, and sum value can be set as the statistical method.

      1. Average: The average value of metric data within a statistical period. For example, when the average value of all monitoring data collected within 15 minutes is adopted as the statistical method, an average value over 80% is deemed to exceed the threshold.
      2. Maximum: The maximum value of metric data within a statistical period. For example, when the maximum value of all monitoring data collected within 15 minutes is adopted as the statistical method, a maximum value over 80% is deemed to exceed the threshold.
      3. Minimum: The minimum value of metric data within a statistical period. For example, when the minimum value of all monitoring data collected within 15 minutes is adopted as the statistical method, a minimum value over 80% is deemed to exceed the threshold.
      4. Sum: The sum of metric data within a statistical period. For example, when the sum value of all monitoring data collected within 15 minutes is adopted as the statistical method, a sum value over 80% is deemed to exceed the threshold. The above statistic methods are needed for traffic-based indexes.
    • Consecutive times: an alarm is triggered when the value of the monitoring metrics continuously exceeds the threshold value for the set consecutive cycles.

      Example: Set CPU usage to more than 80% alarm, statistical cycle to 5 minutes, 3 consecutive The alarm after the threshold is exceeded, the first time the detection CPU usage exceeds 80%, the alarm notification is not issued. The second time in 5 minutes to probe the CPU Usage is more than 80%, and no alarm will be issued. The third probe still exceeds 80% Alarm notification will be issued only when. Therefore, from the first time when the actual data exceeds the threshold to the time when the alarm rule is triggered, the minimum time consumed is: the statistical cycle*(the number of consecutive detections-1), which is 5*(3-1) = 10 minutes in this case.

  • Set an alarm rule
    1. Log in to the cloud monitoring console .
    2. Go to the E-MapReduce instance list under Cloud Service Monitoring.
    3. Click an instance name or click Monitoring Chart in the Action column to access the instance monitoring details page.
    4. Click the Bell button in the upper-right corner of the monitoring chart or New Alarm Rule in the upper-right corner of the page to set an alarm rule for corresponding monitoring metrics of this instance.