The Log Service LogShipper function can ship log data to OSS, Table Store, MaxCompute, and other storage services, and perform offline computing with E-MapReduce (Spark and Hive) or MaxCompute.
Using data warehouse and offline computing together supplements real-time computing. However, they are still used for different purposes:
|Mode||Pros||Cons||Scope of application|
|Real-time computing||Fast||Simple computing||Mainly used for incremental computation in monitoring and real-time analysis.|
|Offline computing (data warehouse)||Accurate and powerful||Relatively slow||Mainly used for full computation in BI, data statistics and comparison.|
To satisfy the current data analysis requirements, the same set of data must be operated in both real-time computing and data warehousing (offline computing). In the case of access log:
- Display the real-time market data using StreamCompute: Current PV, UV, and carrier information.
- Conduct detailed analysis of the full data set every night to compare growth, year-on-year/month-on-month growth, and Top data.
In the world of Internet, two classic models are as follow:
- Lamdba Architecture: When data comes in, the architecture can stream and at the same time save the data into the data warehouse. However, when you initiate a query, the results are returned from real-time computing and offline-computing based on query conditions and complexities.
- Kappa Architecture: Kafka-based Architecture. The offline computing feature is weaken, and all data is stored in Kafka, and all queries are fulfilled with real-time computing.
The Log Service provides a model more similar to that of Lamdba Architecture.
Create a LogStore first, and configure LogShipper in the console to enable data warehouse connection. Currently, the following services are supported:
- OSS (Massive Object Storage Service):
- TableStore (NoSQL Data Storage Service)
LogShipper provides the following features:
- Quasi-real time: Data warehousing in minutes.
- Enormous data volume: No need to consider concurrency.
- Retry on error: Automatic retry or API-based manual retry in case of faults.
- Task API: Acquire log shipping status for different time frames using API.
- Auto compression: Data compression to reduce storage bandwidth.
Suppose, A is responsible for maintaining a forum and part of his job is to conduct audits and offline analysis of all access logs on the forum.
- Department G wants A to capture user visits over the past 180 days, and to provide the access logs within a given period of time on demand.
- The operation team must prepare an access log report on a quarterly basis.
Using the Log Service (LOG) to collect log data from the servers, A turns on the log shipping (LogShipper) function, allowing the Log Service to automatically collect, ship, and compress logs. When an audit is required, the logs within the time frame can be authorized to a third party. To conduct offline analysis, use E-MapReduce to run a 30-minute offline task, getting two jobs done at minimal cost.
As an open source software enthusiast, B prefers to use Spark for data analysis. His requirements are as follow:
- Collect logs from the mobile client using API.
- Conduct real-time log analysis using Spark Streaming and collect statistics on online user visits.
- Use Hive to conduct T +1 offline analysis.
- Grant downstream agencies access to the log data for analysis in other dimensions.
Using combination of LOG+OSS+EMR+RAM, you can fulfill such requirements.