• UID619
  • Fans3
  • Follows2
  • Posts59

[Share]On Kafka Stream: A lightweight stream computing mode

More Posted time:Sep 14, 2016 9:47 AM
Confluent Inc (a company started by the former author of LinkedIn) forecast in June to launch Kafka Stream in Kafka 0.10.
Given the Storm, Spark, Samza and the latest popular Flink systems for stream computing, why does Kafka invest the efforts to make a steam computing system of its own? What are the advantages of Kafka Stream compared with these frameworks? Samza Consumer Group has encapsulated Kafka lightweight consumption functions, isn’t it enough?
What is stream computing? Stream computing is a type of continuous computing.
1. Single: such as HTTP, send a Request and return a Response

2. Batch: submit a group of jobs to the computer and then return back a group; its advantage is to reduce I/O wait time

3. Stream: Batch asynchronous process, with no explicit boundaries among tasks

What are the general stream computing methods?
DIY simple implementation
Taking wordcount for example. We can initiate a server, set up a HashMap in the memory. First we divide words for input, and update HashMap based on the word view. It is easy, isn't it? But what issues will it bring up?
• What should I do if the memory goes down, and all data is wiped away? What about repeated data?
• What should I do if there is a large amount of data which cannot be stored in memory?
• How to guarantee the allocation strategy and sequence if the deployment is conducted on multiple machines?
We classify these problems into the following categories:
• Ordering
• Scale and shard
• Recovery from exception
• Computing of state class (such as TopK and UV)
• Recomputing
• Related issues such as Time and Window.
Use of the existing framework
The frameworks with comparative maturity include: Apache Spark, Storm (open source Jstorm by our company), Flink and Samza. Third-party frameworks include: Google’s DataFlow and AWS Lambda.
What are the benefits of existing frameworks?
Powerful computing capability. For example, Spark Streaming already incorporated Graph Compute, and MLLib that are suitable for iterative calculation libraries and are very handy in some specific scenarios.
What are the issues?
• Complicated to use. For example, you need to migrate the business logic to the full framework, such as Spark RDD and Spout. Some jobs try to provide SQL among other easy-to-use modes to lower down the development threshold, but for personalized ETL (most ETL does not require heavyweight stream computing frameworks), you need to write UDF in SQL, and as a result, stream computing framework degenerates to a pure container or sandbox.
• The author thinks that deploying Storm or Spark needs to reserve cluster resources, which is also a burden for developers.

Kafka Stream positions itself as a lightweight stream computing class library. Why do you say it is simple?
• All the functions are implemented in Lib and the implemented programs do not rely on the separate execution environment.
o Mesos, K8S, Yarn and Ladmda independent scheduling can be used to execute binary files. Just think: isn’t it cool to implement a pay-as-you-go and elastically resizable stream computing system through Lamdba+Kafka?
o Support single integration, single thread and multiple threads.
• Supports Stateless and Stateful computing in a programming model.
• Programming models are concise. Developed based on Kafka Consumer Lib and Key-Affinity feature, code is only required to handle the execution logic. Failover and scale issues are solved by Kafka’s own features.
In my personal opinion, Kafka Lib is an enhanced version of Samza (Samza is also a stream computing framework by deep integration of Linkedin and Kafka) and may replace Samza in future. But it cannot shatter the status of Spark or Flink which enjoy more advanced semantics, and can only handle some lightweight stream processing scenarios (such as ETL, data integration and cleaning).
Example of Kafka Stream
First, let’s look at an example developed through Kafka Stream code:

There are a few things done here:
1. The data serialization/deserialization modes in Kafka is constructed.
2. Two computing nodes are constructed.
o Word segmentation (flatMapValues), and the results are mapped according to keys.
o Reduce (calculate results based on keys)
3. Write the results into a result topic of Kafka (incremental)
In the two settlement nodes, a Kafka Topic is used for serialization/deserialization of the computing results. It is equivalent to the Streamline in Map-Reduce.
This program can be executed in a thread, or on multiple machines, mainly because Kafka Consumer Lib can help to decouple data and calculation.
Basic concepts
Processor: Processor is a basic computing node
public interface Processor<K, V> {
void process (K key, V Value);
void punctuate(long time stampe);

Stream:  Result output after processing by processor
The relationship between them is as shown in figure:

How does Stream Kafka solve the six problems in stream computing:
For Kafka, in a partition (shard), data strictly abides by the first-in-first-out rule, so it is not a problem.

Partition & Scalability
The size of the stream computing depends on two factors: whether the data is available for linear expansion, and whether the computing is available for linear expansion.
Data in Kafka is divided through the partition, with each partition in strict order and being elastic (In fact, the scalability in the current version is not complete, and Kafka 0.10 can provide full scalability capability).

Kafka provides Group Consumer function for the client side to expand the consumption instance to achieve the same level of expansion capabilities with the partition. During the process, a consumption instance can only consume one partition.

Fault tolerance
Kafka Consumer Group has implemented load balancing, so when a consumption instance crashes, the uncompleted jobs can also be completed quickly without data loss, but repeated data may occur (depending on the consumption checkpoint cooperation).

State processing
This problem is relatively complex. In stream computing scenarios, there are two categories of computing:
• Stateless: such as Filter, Map and Joins, in which streaming data once is enough with no dependency on prior or subsequent states.
• Stateful: this mainly is based on the Time Aggregation, such as TopK and UV in a period of time. When the data reaches the computing node, the value is calculated based on the state in the memory.
Kafka Stream provides an abstract concept of KTable, KStream. For KStream’s solution to state storage and data changes, see the following section.
The concept of reprocessing is not difficult to understand after you understand the RedoLog and State.

Time, Windowsing
Time is an important attribute of stream computing, because in the real process of data acquisition, it is often not very perfect. The arrival of historical data will interrupt our assumptions on the calculation. There are two concepts of time:
• Event Time:  Objective time in physical time, representing the moment when the event occurs.
• Processing Time:  The actual processing time (the time of arriving at the server)
Though the Processing Time will make the processing easier, the Event Time is still more accurate because of the effects of historical data. A typical scenario in retailing is counting the sales volume of each product in every 10 min (or the statistics of UV, PV of a website at each time point). Sales data may flow in from different channels in real time, so we have to rely on the time points generated by the sales data to serve as a window, rather than the data arrival to the point of calculation.

Kafka Stream has adopted a simple and direct way to solve this problem. It will give all windows a state that only represents the value at the current time.When new data reaches this window, the state will be changed. For windows based aggregation, the solution by Kafka Stream is:
Table (state data) + Library = Stateful Service
Stream & Table
To realize the concept of state, Kafka abstracted two entities: Kstream and KTable.
• Stream is equivalent to the Change log in the database.
• Table is equivalent to the snapshot of databases at a time point, and two different snapshots are caused by one or more changelogs.

Let's assume there are two streams, one is delivery, the other one is sales. We join the two streams and get the current inventory status:
shipment stream:

sale stream:

When the records in the two streams arrive successively, the inventory status will be affected and the variation status of the whole inventory is shown as follows:

When we put the two streams into the Kafka Stream, we will see the state variation of one processor node as follows:

Based on the state data, we can define the processed logic at the node:
if (state.inventory[item].size < 10)
notify the manager;
else if (state.inventory[item] > 100)
on sale;

KTable and KStream may be abstract. KafkaStream packages high-level DSL and directly provides filter, map, join and other operators. Of course, if there is a personalized need, you can use abstraction APIs of lower levels.
Superficial views
In the stream computing scenario, will there be two extremes: complex memory operations + iterative calculations, and lightweight data processing + ETL? What are their proportions? In common ETL scenarios, a majority of them are actually lightweight operations such as Filter, LookUP and Write Storage. Sometimes we have to use an execution container to select the stream computing framework to process the data. Lamdba and Docker can solve such problems, but a certain amount of stream computing development work is required.
I think for lightweight ETL scenarios, ideal architecture is a lightweight computing library such as Kafka Stream + Lamdba, so that you can achieve on-demand stream computing mode.
Kafka Stream fails to solve some of the key issues, for example, in the join scenario, the number of data shards of two source topics should be of a certain value, because it does not offer MapJoin. In the previous version, EventTime and other Meta fields are also not provided.