Building a Cloud-Native Feed Streaming System with Apache Kafka and Spark on Alibaba Cloud – Part A: Service Setup

In this 3-part blog series, we'll show you how to build a simple, intelligent, cloud-native feed streaming system with Apache Kafka and Spark on Alibaba Cloud.

By Chi Wai Chan, Product Development Solution Architect at Alibaba Cloud

Figure 1. Feed Stream System Solution Overview

A fast-growing FinTech company faces technical challenges in evolving its financial news feed stream systems for its social network platform. It is very difficult to address the highly volatile market demand for financial news. Technically, they are overloading the relational database system (e.g., MySQL) when persisting the news feed publication and newsfeed subscription. The database system gets stuck in the spiking loading period.

This article series illustrates a simple, loosely coupled, scalable, and intelligent feed stream system architecture with open-source technologies on Apache Kafka, Apache Spark, Apache HBase, and MongoDB, with reference to a cloud-native environment on Alibaba Cloud. This makes integration simple, time to market fast, and a more reliable platform with built-in features like high availability, high reliability, and more scalable with high elasticity.

In this 3-part blog series, we'll show you how to build a simple, intelligent, cloud-native feed streaming system with Apache Kafka and Spark on Alibaba Cloud. Part A focuses on the service setup.

There are three parts to this series:

Part A: Step-by-Step Guide on Feed Stream System Service Setup on Alibaba Cloud
Part B: Demo Solution on Building the Spark Streaming Project for Streaming Processing
Part C: Demo Solution on Building the Spark MLlib for Intelligent Profile Tagging

Feed Stream System Solution Overview

As shown in Figure 1. Feed Stream System Solution Overview, there are a few major components building on the aforementioned open-source technologies. The arrow represents the following data flow description:

The feed consumer (e.g., a mobile app) posts the interested financial newsfeed channels subscription (1a), which could be news updates related to a stock, an investment advisor, a local market, an industry sector, to Kafka (a reliable distributed message queue) via API Gateway with Function Compute attached as backend service, (1b) data feed is being transformed and enriched with spark streaming (a streaming processing engine), and finally, persisted to HBase (a high concurrent write/read wide columnar store).
The feed producer (e.g., a mobile app/web application) posts a financial newsfeed for various channels with the same data processing pathway as feed consumers on API Gateway with Function Compute attached as backend service (2a), and then publishes to Kafka MQ (2b), with additional subscriber profile mapping in step 3 for intelligent profile tagging to preferred users/channels/industries
The subscribed message is fetched into the streaming processing engine with Apache Spark on Alibaba Cloud hosted serverless compute cluster (3a), the same spark session fetches the consumer subscription relationship from HBase (3b), performs market basket analysis with MLlib, which is Spark's machine learning (ML) library, and persists the newsfeed with user/channel/industry profiles tagging as a structure free document into MongoDB for better search experience (3c), and then the enrichment financial newsfeed with profile tagging are written into HBase for long term storage (3d), and lastly pushed to other topics on the Kafka MQ to get ready for distribution (3e).
The feed consumer (e.g., a mobile app) fetches the interested financial newsfeed channels with the same data processing pathway as making a subscription (4b) via API Gateway with Function Compute attached as backend service, and then subscribed to Kafka MQ (4a).

The following section is a step-by-step guide on building the Feed Streams Processor Component highlighted in blue on Figure 1 on Alibaba Cloud. The demo source code is available on Github.

Create a Free Trial Account on Alibaba Cloud

First, go to Alibaba Cloud and register for a free trial account.

Figure 2. Alibaba Cloud Landing Page for Free Trial Account Creation

Set up AlibabaMQ for Apache Kafka

You can access the workspace console, and search for various products (e.g., Kafka, HBase, and Data Lake Analytics) on the search bar under the "Products and Services" dropdown menu by clicking on the top left menu icon. Let's set up the Kafka MQ first by typing "ka" on the search bar as shown below, and click on the "AlibabaMQ for Apache Kafka" on the filtered results. You will be redirected to the AliabaMQ for Apache Kafka product landing page.

Figure 3. Alibaba Workspace Console for Product Selection

Make sure that you have selected your preferred region for your feed stream system deployment. For security and latency reasons, all of the components of the feed stream system should be deployed in the same region and the same virtual private cloud (VPC). For international markets, we can deploy all required components in the Hong Kong and Indonesia (Jakarta) regions. I will use the Hong Kong region for this demo. Now, click the "purchase instance" button and complete the purchase action by following the instructions for the Kafka service subscription.

Figure 4. AlibabaMQ for Apache Kafka Product Landing Page – Subscription

After that, go back to the Kafka landing page, and you will see a new subscription on Kafka that is ready for deployment. Click the deploy icon at the selected instances. Given that you have an existing VPC, select the preferred VPC and availability zone to complete the deployment. If you don't have an existing VPC, you can create a new VPC by following these instructions.

Figure 5. AlibabaMQ for Apache Kafka Product Landing Page – Deployment

After redirecting to the selected Kafka instance management page, assure that the instance status is running. Then, you can set up the security group and access the whitelist by clicking the "Security Configuration" icon on the bottom of the page. For this demo, I will write "0.0.0.0/0" for universal data access on the VPC whitelist, but you should use a more restricted IP for your connecting source system in the VPC in the production environment. Afterward, create your topics and consumer groups accordingly. I will create the "default" topic and "default" consumer groups. Now, your Kafka is ready for message delivery.

Figure 6. AlibabaMQ for Apache Kafka Product Landing Page – Configuration

Setup Apache HBase Cluster

Click the top left icon for the dropdown menu, Type "hbase" in the search bar under the "Products and Services" dropdown menu, select the HBase in the filtered results, and then you will be redirected to the HBase product homepage. Follow the instructions on creating an Apache HBase cluster with version 2.0 using the Hong Kong region and the same VPC as the Kafka cluster located.

Figure 7. Apache HBase Product Landing Page – Subscription

Go to the access control to set up the whitelist and security group. Since there are more strict security settings for HBase, "0.0.0.0/0" is not allowed. Instead, I will write "192.168.0.0/16,139.224.4.0/24,10.152.163.0/24,10.152.69.0/24" to access the whitelist for data management service and data lake service access, respectively, and to open the ports "2181, 9099, and 8765" for Zookeeper, thrift server, and query server access, respectively.

Figure 8. Apache HBase Access Control Page – Whitelist

Afterward, click "SQL Service" on the menu bar to enable Apache Phoenix on top of HBase for supporting SQL syntax and building a secondary index for simple searches. Apache Phoenix enables OLTP and operational analytics on HBase for low latency applications by combining the best of both worlds: standard SQL and JDBC APIs with full ACID transaction capabilities and flexibility of late-bound, schema-on-read capabilities from the NoSQL world on HBase.

It supports global indexes only for SQL Service on HBase at Alibaba Cloud. Global indexing targets read-heavy use cases. With global indexes, all the performance penalties for indexes occur at the write time. Phoenix intercepts the data table updates on write (DELETE, UPSERT VALUES, and UPSERT SELECT), builds the index update, and sends any necessary updates to all interested index tables. At read time, Phoenix will select the index table that will produce the fastest query time and directly scan it like any other HBase table. By default, unless hinted at, an index will not be used for a query that references a column that isn't part of the index. Please check this link for more details about global indexes on Apache Phoenix.

Figure 9. Apache HBase Management Page – Enabling the SQL Service

Next, configure the service parameter for HBase and Phoenix. There are two pairs of entries (remark: 4 total entries) you have to make sure are false. Search twice for phoenix.schema.isNamespaceMappingEnabled and phoenix.schema.mapSystemTablesToNamespace, and set the value to "false" if it is not already set.

Remark: There are 4 total entries. The first pair (two entries) is for the setup on the HBase server side. The second pair (two entries) is for the setup on Phoenix Client Side. You have to set all four entries to "true."

Figure 10. Apache HBase Management Page – Configuring Service Parameters

Lastly, go back to "SQL Service," and click "Restart HBase SQL Service." Now, your HBase is accessible via the client access URL on the "SQL Service" tab as shown in Figure 9.

Set up an Apache Spark Serverless Virtual Cluster in Alibaba Data Lake Analytics (DLA)

Serverless is a way to describe the services, practices, and strategies that enable you to build more agile applications for innovation and faster response time. With serverless computing, infrastructure management tasks like capacity provisioning and patching are handled by the cloud provider, so you can focus on writing code that serves your customers. Serverless services like Alibaba Data Lake Analytics (DLA) come with automatic scaling, built-in high availability, and a pay-for-value billing model. DLA is a twin models data processing engines build on Apache Spark and the Presto on compute service that enables you to run code in response to events from various natively-integrated Alibaba Database services and many other SaaS sources without managing any servers.

Click the top left icon for the dropdown menu. Type "data lake" in the search bar under the "Products and Services" dropdown menu, select "Data Lake Analytics" from the filtered results, and will be redirected to the DLA product homepage. Click "virtual cluster" on the left menu bar, click the "new virtual cluster" on the top right corner, and then follow the instructions to create a virtual cluster using the Hong Kong region and the same VPC as the Kafka cluster located (e.g., name: "test-hk").

Figure 11. Data Lake Analytics Management Page

Now, your Data Lake Analytics spark virtual cluster is ready.

Set up MongoDB

Go to the search bar under the "Products and Services" dropdown menu by clicking on the top left menu icon. Let's set up MongoDB by typing "mongo" in the search bar as shown below. Click on ApsaraDB for MongoDB under the filtered results. After that, you will be redirected to the MongoDB product landing page.

Select "Sharded Cluster Instance" on the menu bar on the left hand side, click "Create Instance," and follow the instruction to complete the provisioning. After MongoDB is ready, go to the Whitelist Group for "0.0.0.0/0" and security group for opening the ports (default: 3417).

Figure 12. MongoDB Connection Information

The connection information "ConnectionStringURI" is provided on the "database connection" tab of the MongoDB management landing page as shown in Figure 12.

Set up OSS (Object Storage Service)

Go to the search bar under the "Products and Services" dropdown menu by clicking on the top left menu icon. Let's set up the OSS by typing "oss" in the search bar as shown below. Click on Object Storage Service under the filtered results. After that, you will be redirected to the OSS product landing page.

Figure 13. OSS Folder Structure

Click "Bucket" on the left hand side menu bar, click "Create Bucket," and follow the instructions to create your OSS Buckets (e.g., "dla-spark-hk"). Then, go to the "Files" tab on the menu bar, and create the corresponding folders "dla-jars" and "dla-logs," as shown in Figure 13. OSS Folder Structure.

Feed Stream System Service Set up Summary

Congratulations, you have all the components required (VPC, Kafka, HBase, Spark virtual cluster, and OSS) set up! Now, you are ready to build the streaming processing solution on spark streaming on the Alibaba Cloud Data Lake Analytics service.

Please click this link to follow Part B of this series.

Source:
https://www.linkedin.com/pulse/building-simple-intelligence-cloud-native-feed-streaming-chi-wai-chan-1c/

Community

Building a Cloud-Native Feed Streaming System with Apache Kafka and Spark on Alibaba Cloud – Part A: Service Setup

Feed Stream System Solution Overview

Create a Free Trial Account on Alibaba Cloud

Set up AlibabaMQ for Apache Kafka

Setup Apache HBase Cluster

Set up an Apache Spark Serverless Virtual Cluster in Alibaba Data Lake Analytics (DLA)

Set up MongoDB

Set up OSS (Object Storage Service)

Feed Stream System Service Set up Summary

Read previous post:

Read next post:

Alibaba Clouder

You may also like

Comments

Alibaba Clouder

Related Products

Message Queue for Apache Kafka

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

ApsaraMQ for RocketMQ