This topic describes classic scenarios of Message Queue for Apache Kafka.
Website activity tracking
Successful website operations require close tracking and analysis of the user behavior of the website. By using Message Queue for Apache Kafka, you can collect website activity data in real time, including user behaviors such as webpage browsing and searches, and implement the following functions based on the publish/subscribe pattern:
- Publish messages to different topics based on the type of the business data.
- Through real-time delivery of subscribed messages, use message streams for real-time monitoring and service analysis, or load the message streams to offline data warehouse systems such as Hadoop and MaxCompute for offline processing and service reporting.
- High throughput: High throughput is required to support the large amount of behavior information generated by all users on the website.
- Elastic scaling: The website activity causes a sharp increase in behavior data, and the cloud platform can be quickly scaled out on demand.
- Big data analysis: Message Queue for Apache Kafka can connect to real-time stream computing engines such as Storm and Spark, as well as offline data warehouse systems such as Hadoop and MaxCompute.
Many platforms, such as Taobao and Tmall, generate a large number of logs every day, which are usually streaming data, such as page visits (PVs) and query records of search engines. Compared with log-oriented systems such as Scribe and Flume, Message Queue for Apache Kafka features higher efficiency, longer data persistence, and shorter end-to-end response times (RT). With these benefits, Message Queue for Apache Kafka is suitable for log collection:
- In Message Queue for Apache Kafka, file details are ignored and logs of multiple hosts or applications are abstracted as log or event message streams and then asynchronously sent to the Message Queue for Apache Kafka cluster, greatly reducing the RT.
- The Message Queue for Apache Kafka client submits and compresses messages in batches, without increasing the performance overhead of the producer.
- Consumers can use offline warehouse systems such as Hadoop and MaxCompute and real-time online analysis systems such as Storm and Spark to perform statistical analysis of logs.
- Application and analysis decoupling: Message Queue for Apache Kafka serves as a bridge between an application system and an analysis system and decouples the two systems.
- High scalability: Message Queue for Apache Kafka is scalable. When the data size increases, you can add nodes to quickly scale out your application.
- Online and offline analysis systems: Message Queue for Apache Kafka supports real-time online analysis systems and offline analysis systems such as Hadoop.
In many fields, such as stock market trend analysis, meteorological data monitoring and control, and website user behavior analysis, due to the huge amount of data generated in real time, it is difficult to collect and store all the data in the database before processing it. Therefore, traditional data processing architectures cannot meet user needs.
Different from traditional architectures, Message Queue for Apache Kafka and stream computing engines such as Storm, Samza, and Spark can efficiently solve the preceding problems. The stream computing model captures and processes data in real time during data flow, computes and analyzes the data based on service requirements, and then saves or distributes the results to relevant components.
- Data flow: Message Queue for Apache Kafka serves as a bridge between an application system and an analysis system and decouples the two systems.
- High scalability: Message Queue for Apache Kafka is highly scalable to cope with the huge amount of data generated in real time.
- Stream computing engine: Message Queue for Apache Kafka can connect to open-source Storm, Samza, and Spark, and Alibaba Cloud products such as E-MapReduce, Blink, and Realtime Compute.
Data transfer hub
Over the past 10 years, dedicated systems such as key-value storage (HBase), search (Elasticsearch), stream processing (Storm, Spark Streaming, and Samza), and time series database (OpenTSDB) have emerged. These systems are designed for single problems, making it easy and cost-effective to build distributed systems on commercial hardware.
Generally, the same dataset needs to be injected into multiple dedicated systems. For example, when application logs are used for offline analysis, searching for a single log is also required. However, it is impractical to construct independent workflows to collect data of each type and then import the data to their own dedicated systems. In this case, you can use Message Queue for Apache Kafka as a data transfer hub to import the same data record to different dedicated systems.
- High-capacity storage: Message Queue for Apache Kafka can store a large amount of data on commercial hardware and implement a horizontally scalable distributed system.
- One-to-many consumption model: Based on the publish/subscribe pattern, the same dataset can be consumed multiple times.
- Real-time and batch processing: Message Queue for Apache Kafka supports local data persistence and page cache, and transmits messages to consumers for real-time and batch processing at the same time without performance loss.