edit-icon download-icon

Best practices for troubleshooting monitoring jobs

Last Updated: Jul 03, 2018

During daily operation and maintenance, a monitoring job may fail due to various causes. This topic describes the error scenarios, causes, and solutions to help you quickly fix the issues.

ARMS job handling procedure

In ARMS, a monitoring job has three phases:

  1. Data pulling: ARMS computing cluster pulls data from data sources, such as Loghub and ECS Log.

  2. Data cleansing: Having pulled data, ARMS cleanses (or splits) it. The number of data records that are successfully cleansed is displayed.

  3. Data aggregation: After cleansing data, ARMS performs data aggregation and data persistence operations in the memory. The number of such operations is displayed.

In each of the preceding steps, ARMS displays the running details after a job is created and started. On the Jobs page under Custom Monitoring, click a target job. The running details of this job are displayed, as shown in the following figure:

Data flow details

Each chart on this page displays the number of data lines processed by ARMS at the corresponding system time.

When an error occurs, you can perform diagnosis based on the information displayed on this page. Here are some typical cases.

Common process of error diagnosis

“No data” is displayed in all charts on the job running details page

Check the following items:

  1. If the monitoring job is just started, wait for one or two minutes until data pulling is completed.
  2. It’s possible that no data is generated during this time slot. In this case, click the clock icon on the panel, extend the time range, and check if any historical data is available.

A red line appears in the chart

Normally, each of the three charts has only one blue line. If an error occurs, an extra red line appears in the chart. An error may occur in any of the three phases of a job. You can click the exclamation point to view the details of the sample to identify the cause. Here is an example of how to diagnose the exception in data cleansing and data aggregation.

Example 1: Diagnose a data cleansing exception

Symptom

Data splitting exception

When a red curve appears, click the exclamation point to check the exception type. As shown in the following figure, Type conversion exception is thrown for the field with the code SG10100.

Error detail

Analysis

A typical case is that, when using intelligent splitting, the provided field value is considered to be of Long type, while the generated data contains values of String type which cannot be converted to Long type.

Solution

Pause and edit the monitoring job. On the Log Cleansing page of step 2, select Custom Splitting to convert LongKey code into StringKey code. Then, save the settings and resume the monitoring job.

Splitter

Resume the monitoring job, and the subsequent values on the red line change to 0. The splitting exception is fixed.

Example 2: Diagnose a data aggregation exception

Symptom

Continued from the preceding example, the splitting exception is fixed by adjusting the splitting model, but a data aggregation exception is thrown.

Data aggregation exception

Click the exclamation point, and you can see the exception is about a String cannot be converted into a Number, as shown in the following figure:

Aggregation exception

Analysis

Check if any arithmetic operations have been performed on the LongKey code during the creation of the datasets. It turns out that the SUM operation has been performed on the “code” field in one of the datasets, with the purpose of testing if the SUM function works.

Edit dataset

Solution

Remove SUM, and the exception is fixed.

Thank you! We've received your feedback.