An online service generates a large number of operational logs and access logs, which contain information about errors that occurred, triggered alerts, and user behaviors. Generally, such logs are stored in text files, which are readable and can be used to quickly locate issues in routine O&M. However, after a service generates a large number of logs, it is necessary to store and analyze the logs in a more advanced way to explore the value of the log data.

This topic takes the access logs of a web service as an example to describe how to use MongoDB to store and analyze logs to make full use of the log data. The methods and operations described in this topic also apply to other types of log storage services.

An example of the access logs of a web server is provided as follows. Typically, an access log contains information about the IP address used for the access, the user who accessed the target resource, the operating system and browser that the user used for the access, the endpoint of the target resource, and the result of the access.

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "[http://www.example.com/start.html](http://www.example.com/start.html)" "Mozilla/4.08 [en] (Win98; I ;Nav)"
		

You can use MongoDB to store each access log in a single document. The format of a document in MongoDB is as follows:

{
            _id: ObjectId('4f442120eb03305789000000'),
            line: '127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "[http://www.example.com/start.html](http://www.example.com/start.html)" "Mozilla/4.08 [en] (Win98; I ;Nav)"'
            }
		

The preceding method is easy to configure. However, it may bring inconvenience to the analysis of log data. As MongoDB is not a service targeted at text analysis, we recommend that you convert the format of a log to extract each field and its value from the log before you store it in MongoDB as a document. As shown in the following document, the preceding log is converted to separate fields and values:

{
            _id: ObjectId('4f442120eb03305789000000'),
            host: "127.0.0.1",
            logname: null,
            user: 'frank',
            time: ISODate("2000-10-10T20:55:36Z"),
            path: "/apache_pb.gif",
            request: "GET /apache_pb.gif HTTP/1.0",
            status: 200,
            response_size: 2326,
            referrer: "[http://www.example.com/start.html](http://www.example.com/start.html)",
            user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)"
            }
		

When you convert the format of a log, you can also remove the fields that you regard as useless to the data analysis to save storage space. Several irrelevant fields in the preceding document need to be removed, including the user, request, and status fields. You can also remove the time field because the _id field contains the information about the time when the access was performed. However, you can retain the time field for later analysis because it demonstrates the information in a more clear way and a query statement that uses the time field is more user-friendly. In addition, compared with the _id field, the data type of the time field requires less storage space. Based on the preceding reasons, the updated content of the document may be as follows:

{
            _id: ObjectId('4f442120eb03305789000000'),
            host: "127.0.0.1",
            time: ISODate("2000-10-10T20:55:36Z"),
            path: "/apache_pb.gif",
            referer: "[http://www.example.com/start.html](http://www.example.com/start.html)",
            user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)"
            }
		

Write logs to MongoDB

A log storage service is required to collect a large number of logs at a time. To meet such a requirement, you can specify a write concern for MongoDB to manage the write operation. For example, you can specify the following write concern:

db.events.insert({
                host: "127.0.0.1",
                time: ISODate("2000-10-10T20:55:36Z"),
                path: "/apache_pb.gif",
                referer: "[http://www.example.com/start.html](http://www.example.com/start.html)",
                user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)"
                }
                )
			
Note
  • If you require the highest write throughput, you can set the w option of the write concern to {w: 0}.
  • If the target log data is of great importance, for example, the log data is used as the credentials of service billing, you can set the w option to {w: 1} or {w: "majority"}, which is more secure compared with {w: 0}.

To improve the efficiency of the write operation, you can write multiple logs to MongoDB with a single request. The format of the request is as follows:

db.events.insert([doc1, doc2, ...])

Query logs in MongoDB

After logs are stored in MongoDB by using the preceding method, you can query logs in MongoDB based on different query requirements.

Query the logs of all requests to access /apache_pb.gif

q_events = db.events.find({'path': '/apache_pb.gif'})

If you need to frequently query such access logs, you can create an index on the path field to improve query efficiency.

db.events.createIndex({path: 1})

Query the logs of all requests within a day

q_events = db.events.find({'time': { '$gte': ISODate("2016-12-19T00:00:00.00Z"),'$lt': ISODate("2016-12-20T00:00:00.00Z")}})
			

You can create an index on the time field to improve query efficiency.

db.events.createIndex({time: 1})

Query the logs of all requests sent to a server over a period of time

q_events = db.events.find({
                'host': '127.0.0.1',
                'time': {'$gte': ISODate("2016-12-19T00:00:00.00Z"),'$lt': ISODate("2016-12-20T00:00:00.00Z" }
                })
			

Similarly, you can use the aggregation pipeline or perform map-reduce operations provided by MongoDB to initiate more complex queries for data analysis. We recommend that you create indexes on fields properly to improve query efficiency.

Data sharding

As the number of service nodes that generate logs increases, the write and storage capabilities of a log storage service are challenged and a higher level of capabilities is required. In this case, you can use the sharding method provided by MongoDB to distribute the log data across multiple shards. When you use the sharding method, you need to focus on choosing shard keys.

Use a field indicating the timestamp as the shard key

You can use a field indicating the timestamp, such as the _id field that contains ObjectId values and the time field, as the shard key. The following issues may occur in this type of sharding:

  • As the timestamp grows in sequence, newly collected log data will be distributed to the same shard. Thus, the write capability of MongoDB is not enhanced.
  • As many log queries target at the latest log data, which is distributed to only a few shards, only statistics related to these shards are returned for these queries.

Hashed sharding

The default shard key of hashed sharding is set to the _id field. This sharding method evenly distributes log data to each shard. Therefore, the write capability of MongoDB grows with shards in a linear manner. However, hashed sharding distributes data randomly. This leads to the issue where MongoDB cannot efficiently process the requests of given ranged queries, which are often used in data analysis. To process such a request, MongoDB needs to traverse all the shards and merge the queried data to return the final result.

Ranged sharding

Assume that values of the path field in the preceding example are evenly distributed and many queries are based on the path field. Then, you can specify the path field as the shard key to divide data into contiguous ranges. This method has the following benefits:

  • Write requests are evenly distributed to each shard.
  • Query requests based on the path field are densely distributed to one or more shards, which improves the query efficiency.

In addition, the following issues may occur:

  • If a value of the path field is frequently accessed, logs with the same shard key value are likely to be in the same chunk or shard. The value is at high frequency and the size of the chunk may be large.
  • If the path field has few values, access logs cannot be properly distributed to each shard.

To fix these issues, you can pass an additional field to the shard key. For example, if the original shard key value is {path: 1}, you can add a field as follows:

{path: 1, ssk: 1}. You can assign a random value to the ssk field, such as the hash value of the _id field. You can also assign a timestamp to the ssk field so that the shard key values are sorted by time.

In this way, the shard key has multiple values with even frequency. No shard key value is at an extremely high frequency. Each of the preceding sharding methods has its own advantages and disadvantages. You can select a method based on your business requirements.

Solutions to data growth

MongoDB provides the sharding feature for you to store massive data. However, the storage costs increase with the growth of data volume. Typically, the value of log data decreases over time. Data generated one year ago or even three months ago, which is valueless to the analysis, needs to be cleared to reduce storage costs. MongoDB allows you to use the following solutions to meet such a requirement.

TTL indexes

Time to live (TTL) indexes are special single-field indexes that MongoDB can use to automatically remove documents from a collection after a certain amount of time. In the preceding example, the time field indicates the time when the request was sent. You can create a TTL index on the time field and specify that MongoDB removes the document after 30 hours as follows:

db.events.createIndex( { time: 1 }, { expireAfterSeconds: 108000 } )

Note With a TTL index created, the background task that removes expired documents written in single-threading mode runs every 60 seconds by default. If a large amount of log data is written to MongoDB, many documents in MongoDB are about to expire over time. The expired documents that are not removed occupy large storage space.

Capped collections

If you do not have strict limits on the storage period, whereas you want to limit the storage space, you can use capped collections to store log data. Capped collections work in the following way. After you specify a maximum storage space or a maximum number of stored documents for a capped collection, once one of the limits is reached, MongoDB automatically removes the oldest documents in the collection.

db.createCollection("event", {capped: true, size: 104857600000}

Archive documents by collection or database periodically

At the end of a month, you can rename the collection that stores the documents of that month and create a new collection to store documents of the next month. We recommend that you append the information about the year and month to the collection name. For example, the 12 collections that store the documents written in each month of 2016 are named as follows:

events-201601
                events-201602
                events-201603
                events-201604
                ....
                events-201612
			

If you want to clear the documents of a specific month, you can directly delete the corresponding collection.

 db["events-201601"].drop()
                db["events-201602"].drop()
			

The query statements may be complex if you want to query the documents of multiple months as MongoDB needs to merge the queried data of multiple collections to return the final result.