The Practices and Exploration of Alibaba Cloud ApsaraMQ for Kafka Ecosystem Integration

By Chenhui

An Introduction to ApsaraMQ for Kafka

Apache Kafka is a distributed streaming platform. It is a widely used, indispensable message component in the Internet field. Kafka is generally used as the core hub for message forwarding. The upstream and downstream systems use Kafka to implement asynchronous peak-load shifting. Kafka is also irreplaceable in the big data processing and real-time data processing fields.

Kafka is widely used in fields (such as log collection, big data processing, and database). Kafka has standardized modules to connect with upstream and downstream components. For example, log collection includes Flume, Filebeat, and Logstash, and big data processing includes Spark and Flink. At the same time, there are no ready-made tools for direct docking in some niche fields (such as docking a niche database or a user's customized system). At this time, the general docking method is to develop Kafka production and consumption program docking.

The following issues are commonly encountered when different systems are connected:

Different teams in the company have docking requirements for the same system. Each develops itself, and the implementation methods are different. The cost of upgrading and operation and maintenance is high.
Each subsystem is developed by different teams, so there are natural inconsistencies in the content and format of the data in each system. Format processing is needed to eliminate the differences in format between the data of each system.

Based on the extensive use of Kafka and the diversity of upstream and downstream systems, Kafka provides a built-in framework for connecting upstream and downstream systems: Kafka Connect.

An Introduction to Kafka Connect

Kafka Connect is a framework for transferring data streams into and out of Kafka. The following describes some of the main concepts of connectors:

Connectors: A high-level abstraction that orchestrates data flow by managing tasks
Tasks: Implementation of how to copy data to or from Kafka
Workers: Execute the running processes of connectors and tasks
Converters: Code for converting data between Connect and sending or receiving data from an external system
Transforms: Simple logic that changes each message generated by or sent to the connector

Connectors

The connector in Kafka Connect defines where data should be copied from. A connector instance is a logical job that manages data replication between Kafka and another system.

There are some open-source implementations of the connector. You can also write a new connector plug-in from scratch. The writing process generally works like this:

Tasks

Task is the primary role in the Connect data model that deals with data. Each connector instance coordinates a set of tasks that replicate data. Kafka Connect provides built-in support for parallelism and scalable data replication with minimal configuration by allowing the connector to decompose a single job into multiple tasks. These tasks do not store any state. Task states are stored in special topic config.storage.topic and status.storage.topic in Kafka. Thus, tasks can be started, stopped, or restarted at any time to provide a flexible, scalable data pipeline.

Task Rebalancing

When a connector is first submitted to the cluster, workers rebalance all the connectors in the cluster and their tasks so each worker has roughly the same amount of work. The same rebalancing process is used when connectors increase or decrease the number of tasks they require or when the configuration of a connector is changed. When a worker fails, tasks are rebalanced among the active workers. When a task fails, rebalancing is not triggered because the task failure is considered an exception. Therefore, failed tasks are not automatically restarted by the framework and should be restarted via the REST API.

Converters

Converter is required for Kafka Connect to support specific data formats when writing to or reading data from Kafka. Task uses a converter to change data format from bytes to concatenate internal data format and vice versa.

The following converters are provided by default:

AvroConverter: Used together with Schema Registry
JsonConverter: Suitable for structured data
StringConverter: Simple string format
ByteArrayConverter: Provide the pass option without conversion

The converters are decoupled from the connectors themselves, so the converters are naturally reused between connectors.

Transforms

The connector can configure transformations to make simple and lightweight modifications to individual messages. This is convenient for small data adjustments and event routing, and multiple transformations can be chained together in a connector configuration.

Open-Source Issues

When Kafka Connect is deployed separately offline, the design is good. When it is provided as a cloud service, there are still many problems, mainly reflected in the following points.

The integration with cloud services is not good. Cloud vendors have many closed-source products, and there will be problems (such as access control for cloud managing versions of open-source products).
Occupies Kafka Cluster Resources: Each connector task requires three built-in metadata topics that occupy cloud product resources. Maloperations on metadata topics can cause task exceptions.
Simple O&M Control Interfaces and Monitoring: The control interfaces cannot control the granularity of running resources, and monitoring lacks the metrics of connector tasks.
Poor Integration with Cloud-Native Architectures: The initial design of the architecture is not cloud-native, the isolation between tasks is not enough, the load-balancing algorithm is simple, and there is no dynamic self-balancing capability.

Based on the various problems of Kafka Connect deployment on the cloud, the Message Queue for Apache Kafka Team re-implemented the Kafka Connect module in a cloud-native manner while being compatible with the native Kafka Connect framework.

Alibaba Cloud MQ for Kafka Connect Solution

An Introduction to the Alibaba Cloud Message Queue for Apache Kafka Connect Framework

The architecture design separates the control plane from the running plane and performs task distribution and module communication through the database and Etcd. The underlying runtime environment uses Kubernetes clusters to better control the granularity and isolation of resources. The overall architecture diagram is listed below:

This architecture solves the problems encountered by Apache Kafka Connect modules on the cloud.

Interconnection with Cloud Services: When the runtime environment is deployed, the default network is connected, and the runtime interface is connected with access control modules.
Occupy Kafka Cluster Resources: The metadata is stored in databases and Etcd and does not occupy Kafka topic resources.
Enhanced O&M and Control Interfaces: Enhanced control APIs at the resource level to control the running resources of each task in a fine-grained way
Enhanced Monitoring Metrics: Metrics are collected throughout the task dimension. Monitor the operation of data at different stages (from inflow to outflow). If problems occur, locate the problems in time.
Cloud-Native Architecture Design: The control plane coordinates global resources, monitors cluster loads in real-time, and can automatically complete O&M operations (such as load balancing, failure restart, and abnormal drift).

An Introduction to Alibaba Cloud Kafka Connect

Alibaba Cloud Message Queue for Apache Kafka Connect can be implemented in the following ways:

Complete the direct connection between external systems and Kafka by extending the Kafka Connect framework.
For task types that require data processing, the processing logic can be flexibly customized on FC through Kafka → Function Compute → external system.

The sections below detail how Connect is implemented:

Database

Backups between databases generally do not use Kafka. MySQL → Kafka is generally used to distribute data to downstream subscriptions and make alerts or other responses when MySQL data changes. Link MySQL → Kafka → subscriptions → alerts/changes to other systems.

Data Warehouse

MaxCompute is commonly used in data warehouses on Alibaba Cloud. Tasks are characterized by high throughput and also require data cleaning. The general process is Kafka → MaxCompute, and then MaxCompute internal tasks perform data conversion. You can also clean data before it is transferred to MaxCompute. Generally, the link is Kafka → Flink → MaxCompute. You can use Function Compute to replace Flink on tasks with simple data conversion or small data volume. The link is Kafka → FC → MaxCompute.

Data Retrieval and Reporting

Common data retrieval and reports are generally performed through ES, and data needs to be cleaned before being passed into ES. The suitable path is Kafka → Flink → ES/Kafka → FC → ES.

Alert System

In the alert system, the general process of Kafka is used as the pre-module → Kafka → subscription → alert module. The best way is to use the pre-module → Kafka → FC → alert.

Backup Requirements

Some data may need to be archived regularly for long-term storage. OSS is a good medium. In this scenario, only the original data needs to be saved, so a good way may be Kafka → OSS. If data needs to be processed, you can use Kafka → FC → OSS links.

Alibaba Cloud MQ for Kafka Ecosystem Planning

The connections currently supported by Message Queue for Apache Kafka are all developed independently using the self-developed new architecture, which has good coverage for mainstream usage scenarios. However, we can see Kafka ecology is developing rapidly, and Kafka usage scenarios are also increasing. Open-source Kafka Connect is also developing continuously. Next, Message Queue for Apache Kafka will connect to open-source Kafka Connect, so open-source Kafka Connect can seamlessly run on the self-developed architecture without modification.

Summary

Kafka has occupied an important position in Internet architecture and is also actively expanding upstream and downstream. In addition to Kafka Connect, Kafka Streams, Ksql, Kafka Rest Proxy, and other modules are continuously improving and maturing. I believe Kafka will play a more important role in the software architecture in the subsequent development.

Community

The Practices and Exploration of Alibaba Cloud ApsaraMQ for Kafka Ecosystem Integration

An Introduction to ApsaraMQ for Kafka

An Introduction to Kafka Connect

Connectors

Tasks

Task Rebalancing

Converters

Transforms

Open-Source Issues

Alibaba Cloud MQ for Kafka Connect Solution

An Introduction to the Alibaba Cloud Message Queue for Apache Kafka Connect Framework

An Introduction to Alibaba Cloud Kafka Connect

Database

Data Warehouse

Data Retrieval and Reporting

Alert System

Backup Requirements

Alibaba Cloud MQ for Kafka Ecosystem Planning

Summary

Read previous post:

Read next post:

Alibaba Cloud Native

You may also like

Comments

Alibaba Cloud Native

Related Products

Message Queue for Apache Kafka

ApsaraMQ for RocketMQ

Realtime Compute

ApsaraVideo Media Processing