During daily operation and maintenance, a monitoring job may fail due to various causes. This topic describes the error scenarios, causes, and solutions to help you quickly fix the issues.
In ARMS, a monitoring job has three phases:
Data pulling: ARMS computing cluster pulls data from data sources, such as Loghub and ECS Log.
Data cleansing: Having pulled data, ARMS cleanses (or splits) it. The number of data records that are successfully cleansed is displayed.
Data aggregation: After cleansing data, ARMS performs data aggregation and data persistence operations in the memory. The number of such operations is displayed.
In each of the preceding steps, ARMS displays the running details after a job is created and started. On the Jobs page under Custom Monitoring, click a target job. The running details of this job are displayed, as shown in the following figure:
Each chart on this page displays the number of data lines processed by ARMS at the corresponding system time.
When an error occurs, you can perform diagnosis based on the information displayed on this page. Here are some typical cases.
Check the following items:
- If the monitoring job is just started, wait for one or two minutes until data pulling is completed.
- It’s possible that no data is generated during this time slot. In this case, click the clock icon on the panel, extend the time range, and check if any historical data is available.
Normally, each of the three charts has only one blue line. If an error occurs, an extra red line appears in the chart. An error may occur in any of the three phases of a job. You can click the exclamation point to view the details of the sample to identify the cause. Here is an example of how to diagnose the exception in data cleansing and data aggregation.
When a red curve appears, click the exclamation point to check the exception type. As shown in the following figure, Type conversion exception is thrown for the field with the code SG10100.
A typical case is that, when using intelligent splitting, the provided field value is considered to be of Long type, while the generated data contains values of String type which cannot be converted to Long type.
Pause and edit the monitoring job. On the Log Cleansing page of step 2, select Custom Splitting to convert LongKey code into StringKey code. Then, save the settings and resume the monitoring job.
Resume the monitoring job, and the subsequent values on the red line change to 0. The splitting exception is fixed.
Continued from the preceding example, the splitting exception is fixed by adjusting the splitting model, but a data aggregation exception is thrown.
Click the exclamation point, and you can see the exception is about a String cannot be converted into a Number, as shown in the following figure:
Check if any arithmetic operations have been performed on the LongKey code during the creation of the datasets. It turns out that the SUM operation has been performed on the “code” field in one of the datasets, with the purpose of testing if the SUM function works.
Remove SUM, and the exception is fixed.