This topic describes how to use DataWorks Data Integration to migrate data from a Kafka cluster to MaxCompute.
Prerequisites
- MaxCompute is activated. For more information, see Activate MaxCompute.
- DataWorks is activated.
- A workflow is created in DataWorks. In this example, a DataWorks workspace in basic mode is used. For more information, see Create a workflow.
- A Kafka cluster is created.
Before data migration, make sure that your Kafka cluster works as expected. In this topic, Alibaba Cloud E-MapReduce is used to automatically create a Kafka cluster. For more information, see Kafka Quick Start.
This topic uses the following version of E-MapReduce Kafka:- E-MapReduce version: EMR-3.12.1
- Cluster type: Kafka
- Software: Ganglia 3.7.2, ZooKeeper 3.4.12, Kafka 2.11-1.0.1, and Kafka Manager 1.3.3.16
The Kafka cluster is deployed in a virtual private cloud (VPC) in the China (Hangzhou) region. The Elastic Compute Service (ECS) instances in the master instance group are configured with public and private IP addresses.
Background information
Kafka is distributed middleware used to publish and subscribe to messages. Kafka is widely used because of its high performance and high throughput. Kafka can process millions of messages per second. Kafka is applicable to streaming data processing, and is used in scenarios such as user behavior tracing and log collection.
A typical Kafka cluster contains several producers, brokers, consumers, and a ZooKeeper cluster. A Kafka cluster uses ZooKeeper to manage configurations and coordinate services in the cluster.
A topic is the most commonly used collection of messages in a Kafka cluster, and is a logical concept for message storage. Topics are not stored on physical disks. Instead, messages in each topic are stored on the disks of each node by partition. Multiple producers can publish messages to a topic, and multiple consumers can subscribe to messages in a topic.
When a message is stored to a partition, the message is allocated an offset, which is the unique ID of the message in the partition. The offsets of messages in each partition start from 0.
Step 1: Prepare Kafka test data
You must prepare test data in the Kafka cluster. To make sure that you can log on to the header node of the E-MapReduce cluster and that MaxCompute and DataWorks can communicate with the header node, configure a security group rule for the header node to allow requests on TCP ports 22 and 9092.
Step 2: Create a destination table in a DataWorks workspace
Create a destination table in a DataWorks workspace to receive data from Kafka.
Step 3: Synchronize the data
What to do next
You can create a data development job and run SQL statements to check whether the
data has been synchronized from Message Queue for Apache Kafka to the current table.
This topic uses the select * from testkafka
statement as an example. Specific steps are as follows:
- In the left-side navigation pane, choose .
- Right-click and choose .
- In the Create Node dialog box, enter the node name, and then click Submit.
- On the page of the created node, enter
select * from testkafka
and then click the Run icon.
