The most comprehensive data: AppLogs are provided by programmers, covering key locations, variables, and exceptions. Technically, over 90% of online bugs are located by AppLogs.
Arbitrary formats: One piece of code is often developed more than one programmer. Every programmer has their own preferred formats, which are difficult to uniform. Style inconsistency is also seen in logs introduced from third-party databases.
Сommon characteristics: Despite of the arbitrary formats, different logs share things in common. For example, the following fields are required for Log4J logs:
- File or class
- Line number
Generally, AppLogs are larger than access logs by an order of magnitude. For example, if a website has one million independent accesses every day, each access has 20 logic modules, and 10 main logic points in each module must be logged.
Then, the total number of logs is:
1,000,000 * 20 * 10 = 2 * 10^8
The length of each log is 200 bytes, and the storage size is:
2 * 10^8 * 200 = 4 * 10^10 = 40 GB
The data grows as the business system becomes increasingly complex. It is common for a medium-sized website to have 100 - 200 GB of log data every day.
Most applications are running in a stateless mode under different frameworks, including:
- Docker (container)
- Function Compute (Container Service)
The numbers of corresponding instances vary from a few to thousands, which requires a cross-server log collection solution.
Programs are running in different environments, so logs are stored in various places, for example:
- Application logs are stored in Docker.
- API logs are stored in Function Compute.
- Old system logs are stored in traditional IDCs.
- Mobile-side logs are stored in users’ mobile devices.
- Mobile web logs are stored in browsers.
To have a complete picture of this data, we must unify and store data in a single place.
Purpose: to collect data from different sources into a central place to facilitate subsequent operations.
You can create a project in Log Service to store AppLogs. Log Service supports over 30 collection methods, such as tracking in physical severs, JS on the mobile web side, and outputting logs on servers, for more information, see Collection methods.
Apart from writing logs using methods like SDK, Log Service provides a convenient, stable, and high performance Agent Logtail for server logs. Logtail provides two versions for Windows and Linux. Once you have defined the machine group and set the log collection configuration, sever logs can be collected in real time.
After the log collection configuration is complete, you can operate on logs in the project.
Comparing with other log collection agents, such as Logstash, Flume, FluentD, and Beats, Logtail has following advantages:
- Easy to use: Provides API access, remote management and monitoring capabilities. Logtail is designed with Alibaba Group’s rich experience in million-level server log collection and management, allowing you to configure a collection point to hundreds of thousands of devices in seconds.
- Adaptive to different environments: Logtail supports public networks, VPCs, and user-defined IDCs. The HTTPS and resumable data transfer functions make it possible to integrate with public network data.
- Great performance with a little consumption of resources: With years of refinement, Logtail is superior to its open-source competitors in terms of performance and resource consumption. For more information, see Comparison tests.
Purpose: regardless of how the data volume increases and how servers are deployed, guarantee the time it takes to locate problems is constant.
For example, how to locate an order error and a long latency issue out of terabytes of data every week? The process also involves filtering and investigating based on various criteria.
For example, for AppLogs with latency details, we can investigate request data with latency of more than one second and methods starting with Post:
Latency > 1000000 and Method=Post*
Search logs that contain the keyword “error” and do not contain the keyword “merge”.
Results of one day, one week, or a longer timespan can be returned in less than one second.
There are two types of association: intra-process association and inter-process association. The difference between two types is as follows:
- Intra-process association: This is a simple type because the previous and new logs of a function are stored in one a file. In multi-thread cases, we can filter logs by ThreadID.
- Cross-process association: Normally, it is hard to find clear clues when dealing with logs from different processes. The association is generally performed by passing TracerID into RPC.
Locate an error log with the keyword query in the Log Service console.
Click Context View, and see the preceding and following results.
- You can click OLD and NEW for more results.
- You can also filter the results by ThreadID to improve the filtering accuracy.
For more information, see Context query.
The concept of the cross-process association, or Tracing, can be dated back to the famous article Dapper, a Large-Scale Distributed Systems Tracing Infrastructure by Google in 2010. Inspired by the article, developers from the open source sector created many affordable versions of Tracer, for example:
- Dapper (Google): basis of different Tracers
- StackDriver Trace (Google): ZipKin currently compatible
- Zipkin: an open source Tracing system by Twitter
- Appdash: Golang version
- Hawkeye: developed by Alibaba’s Middleware Technology Department
- X-ray: introduced at AWS re:Invent 2016
Applying Tracer from scratch is easier than in an existing system, because of the costs and challenges in adapting it to the system.
Based on Log Service, we can now implement a basic tracing feature, which is to access logs by outputting associative fields such as Request_id and OrderID in logs from different modules and searching them in various log stores.
For example, we can query logs of frontend servers, backend servers, payment systems, and order systems using SDKs. After the results are obtained, we can create a page on the frontend to associate the cross-process calls. The tracing system built based on Log Service is as follows:
After the specific log is located, we can perform the analysis on the log such as calculating the types of online error logs.
We can query logs by
__level__, and 2,720 errors are found within one day.
To determine the unique log type, we can perform the analysis and aggregation of data by file and line fields:
__level__:error | select __file__, __line__, count(*) as c group by __file__, __line__ order by c desc
Then, the distribution of all types and locations of errors can be found:
Besides, we can locate IPs and perform analysis by fields such as error codes and high latency. For more information, see Best practices of Log analysis.
You can back up the logs to OSS, IA (with a lower storage cost), or MaxCompute. For more information, see Log Shipper.
Alarms can be performed in the following ways:
- Saving the log query as a scheduled task in Log Service to alarm the results. For more information, see Set alarms.
- Implementing the CloudMonitor Log Alarm feature. For more information, see Use CloudMonitor to consume LogHub logs.
You can grant different permissions to your team members by setting sub-accounts or groups. For more information, see Grant RAM sub-accounts permissions to access Log Service.
Price and cost: AppLog mainly adopts LogHub and LogSearch features of Log Service. Compared with an open source solution, AppLog is an easy-to-use solution with only 25% cost of an open source solution, thus improving the development efficiency.