The data shipping feature of Log Service ships can log data to storage services, such as Object Storage Service (OSS), Tablestore, or MaxCompute. After the log data is shipped, you can use E-MapReduce (Spark and Hive) or MaxCompute to perform offline computing on the log data.

Offline computing (data warehousing)

Offline computing (data warehousing) is a supplement to real-time computing. The two modes are used for different scenarios.

Mode Advantage Disadvantage Scenario
Real-time computing Fast Simple Mainly used for incremental data computation in monitoring and real-time analysis
Offline computing (data warehousing) Accurate and powerful Relatively slow Mainly used for full data computation in business intelligence (BI), data measurement, and data comparison

To meet data analysis requirements, we recommend that you perform both real-time computing and offline computing (data warehousing) on the same set of data. For example, you can perform the following operations on access logs:

  • Use stream computing to display real-time data, including the current page view (PV), unique visitor (UV), and operator information, on dashboards.
  • Conduct detailed analysis of the full data every night to obtain information, such as the data growth, year-on-year or month-on-month growth, and top ranking data.

The following typical data computing architectures are available:

  • Lamdba architecture: After data is collected, the system dispatches the data to different layers for both stream computing and storage. The data is stored in data warehouses. However, when you start a query, the results of real-time computing or offline computing are returned based on query conditions and complexities.
  • Kappa architecture: Kafka-based architecture. The architecture removes offline computing. All data is stored in Apache Kafka and computed based on stream computing.

Log Service provides an architecture that is similar to the Lamdba architecture.

LogHub and LogShipper for both real-time and offline computing

After a Logstore is created, you must configure LogShipper in the Log Service console to support connection to data warehouses. The following data warehouse services are supported:

  • OSS: large-scale object storage service.
  • Tablestore: NoSQL data storage service that stores data collected by Log Service. For more information, see Log data shipping.
  • MaxCompute: big data computing service.
Offline computing scenarios

LogShipper provides the following features:

  • Near real-time shipping: ships data to data warehouses within minutes.
  • Robust processing capability: supports high concurrency.
  • Automatic retry mechanism: automatically retries the data shipping task when faults occur. You can also call the API to manually retry the shipping task.
  • Support for API operations: allows you to obtain the status of log shipping tasks in different periods of time.
  • Automatic compression: compresses data to reduce storage usage.

Typical scenario 1: log audit

Jame is responsible for maintaining a forum and needs to conduct audits and offline analysis of access logs on the forum.

  • Department G requires Jame to capture user visits over the last 180 days and to provide the access logs that are generated within a specific period of time.
  • The operations team is required to prepare an access log report on a quarterly basis.

Jame uses Log Service to collect log data from the servers and enables the data shipping feature. This way, Log Service automatically collects, ships, and compresses logs. When an audit is required, a third party can be authorized to access the logs in the specified period of time. To conduct offline analysis, Jame uses E-MapReduce to run a 30-minute offline task. This way, the log audit and offline analysis are performed at a low cost. Jame can also use Data Lake Analytics (DLA) to analyze the log data.

Typical scenario 2: real-time and offline analysis of log data

Alice is an open source software enthusiast and prefers to use Spark for data analysis. Alice has the following requirements:

  • Collect logs from the mobile devices by using APIs.
  • Conduct real-time log analysis by using Spark Streaming and collect statistics from online user visits.
  • Conduct offline analysis in T+1 mode by using Hive.
  • Grant downstream agencies the access permissions on the logs for analysis in a variety of dimensions.

The combination of Log Service, OSS, E-MapReduce or DLA, and Resource Access Management (RAM) can meet these requirements.