edit-icon download-icon

Store log data with MongoDB

Last Updated: Jan 17, 2018

Online services usually generate many run logs and access logs containing information such as errors, alarms, and user behaviors. Services record log information as readable text to facilitate troubleshooting. However, to find the expected information from the massive logs that are generated, you must store and analyze the log data.

This document uses access logs stored for web services as an example to describe how to make the most of log data by storing and analyzing it with ApsaraDB for MongoDB. This document is also applicable to other types of log storage applications.

Log storage method design

The following is a typical web server access log, which contains the access source, user, accessed resource address, access outcome, and user’s system and browser type.

  1. 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "[http://www.example.com/start.html](http://www.example.com/start.html)" "Mozilla/4.08 [en] (Win98; I ;Nav)"

The simplest method to store such logs is to store each row of log in a separate document. ApsaraDB for MongoDB stores row-based logs in the following method:

  1. {
  2. _id: ObjectId('4f442120eb03305789000000'),
  3. line: '127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "[http://www.example.com/start.html](http://www.example.com/start.html)" "Mozilla/4.08 [en] (Win98; I ;Nav)"'
  4. }

A disadvantage of this log storage method is increased complexity of data analysis, because ApsaraDB for MongoDB is not designed for text analysis. A better method is to extract the fields of the row-based logs before storing them in a MongoDB document. The preceding log can be converted to a document containing individual log fields:

  1. {
  2. _id: ObjectId('4f442120eb03305789000000'),
  3. host: "127.0.0.1",
  4. logname: null,
  5. user: 'frank',
  6. time: ISODate("2000-10-10T20:55:36Z"),
  7. path: "/apache_pb.gif",
  8. request: "GET /apache_pb.gif HTTP/1.0",
  9. status: 200,
  10. response_size: 2326,
  11. referrer: "[http://www.example.com/start.html](http://www.example.com/start.html)",
  12. user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)"
  13. }

You can filter out useless fields to save on storage space. For example, the user, request, and status fields are not relevant in data analysis and therefore need not be stored. You do not have to store the time field separately, because ObjectId contains time information. (If you want to include the accurate time of request generation and therefore construct query statements with more ease, you can store the time field as a data type to minimize the storage space). Based on the preceding considerations, the preceding log stored at last is similar to the following:

  1. {
  2. _id: ObjectId('4f442120eb03305789000000'),
  3. host: "127.0.0.1",
  4. time: ISODate("2000-10-10T20:55:36Z"),
  5. path: "/apache_pb.gif",
  6. referer: "[http://www.example.com/start.html](http://www.example.com/start.html)",
  7. user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)"
  8. }

Write logs

The log storage service can support concurrent writes of massive logs. You can customize writeConcern to control the log write capability. One of the ways to achieve customization is as follows:

  1. db.events.insert({
  2. host: "127.0.0.1",
  3. time: ISODate("2000-10-10T20:55:36Z"),
  4. path: "/apache_pb.gif",
  5. referer: "[http://www.example.com/start.html](http://www.example.com/start.html)",
  6. user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)"
  7. }
  8. )

Note:

  • To achieve the maximum write throughput, set writeConcern to {w: 0}.

  • For important logs (such as, logs used as billing certificates), you can set writeConcern to {w: 1} or {w: “majority”} for enhanced security.

For optimal write efficiency, you can write multiple logs upon one network request using batch write. The format of batch write is as follows:

db.events.insert([doc1, doc2, ...])

Query logs

You can query logs stored in ApsaraDB for MongoDB according to your needs.

Query all /apache_pb.gif access requests

q_events = db.events.find({'path': '/apache_pb.gif'})

If the query is performed frequently, you can index the path field to improve query efficiency.

db.events.createIndex({path: 1})

Query all requests from a specific day

  1. q_events = db.events.find({'time': { '$gte': ISODate("2016-12-19T00:00:00.00Z"),'$lt': ISODate("2016-12-20T00:00:00.00Z")}})

You can index the time field to improve query efficiency.

db.events.createIndex({time: 1})

Query all requests on a host within a time period

  1. q_events = db.events.find({
  2. 'host': '127.0.0.1',
  3. 'time': {'$gte': ISODate("2016-12-19T00:00:00.00Z"),'$lt': ISODate("2016-12-20T00:00:00.00Z" }
  4. })

ApsaraDB for MongoDB provides Aggregation and MapReduce designed for complex query analysis. You can create proper indexes to improve query efficiency.

Data partition

In response to the increasing number of service nodes that write logs, the log storage service must provide scalable log write and massive log storage. ApsaraDB for MongoDB provides the sharding feature to store log data in multiple shards. It is critical to select a proper shard key.

Timestamp-based sharding

Timestamp-based sharding (applicable to _id of the ObjectId type or time field) has the following problems:

  • Due to sequential increments of timestamps, newly written logs are distributed to the same shard; as a result, the log write capability remains unchanged.
  • Because many log queries retrieve the latest data that is distributed only on some shards, these queries are only routed to these shards.

Sharding based on random fields

Hashed sharding based on the _id field distributes data and writes evenly to shards, linearly increasing the write capability with shard quantity. The problem with hashed sharding is random distribution of data. All range queries (which are often used during data analysis) retrieve data on all shards and then combine the query results, leading to lower query efficiency.

Sharding based on evenly distributed keys

In the case that the path fields are evenly distributed and many queries are classified by this field, you can perform sharding based on the path field. The advantages are:

  • Write requests are evenly routed to all shards.
  • Path-based query requests are routed to one or more shards, improving query efficiency.

The disadvantages are:

  • An excessively large chunk resulting from frequent access to a specific path can only be stored in a single shard, which may easily create an access hotspot.
  • If few queries retrieve data based on the path field, data cannot be evenly distributed to all shards.

To solve these problems, we can include an additional factor into the shard key. For example, the shard key {path: 1} with an additional factor included changes to

{path: 1, ssk: 1}, in which ssk may be a random value (for example, hash value of _id) or timestamp. Same paths are still sorted by time.

The additional factor enables shard keys to be distributed on diverse shards and prevents the occurrence of many duplicate values. You can select one of these sharding methods according to your needs.

Address the data growth challenge

Sharding supports storage of massive data. However, storage costs keeps increasing as more data is stored. The log data often has certain characteristics and loses value over time. For example, to reduce storage costs, you do not need to store the historical data (from three months or even one year ago) that have no analytic value. ApsaraDB for MongoDB has quite a few ways to suit this case.

TTL indexing

The TTL indexing function of ApsaraDB for MongoDB can automatically remove documents that expire after a time period. For example, if a TTL index is created for the preceding time field that represents the request generation time, the document is automatically removed after 30 hours.

db.events.createIndex( { time: 1 }, { expireAfterSeconds: 108000 } )

NOTE: TTL indexing is used to remove expired single-threaded documents periodically (every 60 seconds by default). However, if massive logs keep being written to documents, many expired documents cannot be removed quickly enough to make room for new logs.

Capped collection

Capped collections are applicable in scenarios that do not have strict requirements of log storage period but have storage space limit. ApsaraDB for MongoDB removes older documents if a collection reaches the maximum storage space limit or the maximum document count.

db.createCollection("event", {capped: true, size: 104857600000}

Periodic archiving by collection or database

ApsaraDB for MongoDB supports renaming of event collections on a periodic basis. For example, you can rename event collections at the end of every month to add the current month to the collection name and create new collections to write more events. For example, logs from 2016 finally are stored in the following 12 collections:

  1. events-201601
  2. events-201602
  3. events-201603
  4. events-201604
  5. ....
  6. events-201612

To clear historical data, remove the collections directly.

  1. db["events-201601"].drop()
  2. db["events-201602"].drop()

The statement for querying data from several months is relatively complex. The system retrieves data from multiple collections and then combines the query results.

Thank you! We've received your feedback.