This topic describes the best practices of log cleansing.
Purpose and features of data cleansing in ARMS
ARMS provides easy-to-use real-time monitoring solutions. Different users have logs in different formats and different fields in different logs. To consume users’ logs, ARMS must cleanse data in the user logs.
For example, the following row of log contains the date format, item ID, and transaction amount.
ARMS must learn that the date and time is 2016-08-24-13:32:33, item ID is abc, and the transaction amount is 100 in this row of log. After cleansing the preceding attributes, ARMS can perform computing in corresponding scenarios based on the attributes in the subsequent aggregate computing orchestration.
ARMS abstracts the data cleansing process into forming the time, string, number, and other fields by splitting log data.
ARMS provides the following data cleansing modes:
Intelligent splitting: fast, convenient, and intelligent
This mode intelligently generates a data cleansing and splitting solution. It intelligently analyzes some log samples provided by a user and recommends the optimal solution to the user. You can also tweak the intelligent splitting solution on the drag-and-drop manual splitting interface, which dramatically reduces the time spent on configure manual splitting.
Manual splitting: manual, comprehensive, and controllable
You can configure the data cleansing and splitting logic on the drag-and-drop interface. A fully visualized configuration process allows you to complete most monitoring configuration tasks without any coding.
Suggestions on the log format
Keep the consistency wherever possible
Make sure that logs from the same log source are in the same format wherever possible. If one log source has multiple types of logs, use the filter function to filter the required log content.
Use a recognizable timestamp
ARMS supports multiple common time formats, which require year, month, day, hour, minute, and second and optionally millisecond.
In manual splitting, the timestamp supports all time formats that can be converted to common SimpleDateFormat time regular expressions, such as yyyy-MM-dd HH:mm:ss. For more information about the time regular expressions, see SimpleDateFormat official documentation.
Intelligent splitting recognizes the following time formats:
- Common time formats that can be converted to common SimpleDateFormat time regular expressions, for example 2016-8-21 23:12:53.
- Nginx time formats, for example 28/Nov/2014:11:56:09 +0800.
- Other common Java SimpleDateFormat formats. For more information, see Date formats in intelligent splitting.
Use unambiguous separators
Common separators include the space, vertical bar (|), semicolon (;), comma (,), and so on.
You are advised to use separators of one type, to make it easier for the splitter to split the data.
JSON format is recommended for strings and substrings
For log strings containing JSON strings, it’s easier to cleanse the data with a JSON splitter. The JSON strings in the log must be in the standard JSON format with quotation marks but not line breaks.
How to use custom splitting
For more information, see Use custom splitting.