By Chi Wai Chan, Product Development Solution Architect at Alibaba Cloud
Figure 1. Feed Stream System Solution Overview
A fast-growing FinTech company faces technical challenges in evolving its financial news feed stream systems for its social network platform. It is very difficult to address the highly volatile market demand for financial news. Technically, they are overloading the relational database system (e.g., MySQL) when persisting the news feed publication and newsfeed subscription. The database system gets stuck in the spiking loading period.
This article series illustrates a simple, loosely coupled, scalable, and intelligent feed stream system architecture with open-source technologies on Apache Kafka, Apache Spark, Apache HBase, and MongoDB, with reference to a cloud-native environment on Alibaba Cloud. This makes integration simple, time to market fast, and a more reliable platform with built-in features like high availability, high reliability, and more scalable with high elasticity.
In this 3-part blog series, we'll show you how to build a simple, intelligent, cloud-native feed streaming system with Apache Kafka and Spark on Alibaba Cloud. Part A focuses on the service setup.
There are three parts to this series:
As shown in Figure 1. Feed Stream System Solution Overview, there are a few major components building on the aforementioned open-source technologies. The arrow represents the following data flow description:
The following section is a step-by-step guide on building the Feed Streams Processor Component highlighted in blue on Figure 1 on Alibaba Cloud. The demo source code is available on Github.
First, go to Alibaba Cloud and register for a free trial account.
Figure 2. Alibaba Cloud Landing Page for Free Trial Account Creation
You can access the workspace console, and search for various products (e.g., Kafka, HBase, and Data Lake Analytics) on the search bar under the "Products and Services" dropdown menu by clicking on the top left menu icon. Let's set up the Kafka MQ first by typing "ka" on the search bar as shown below, and click on the "AlibabaMQ for Apache Kafka" on the filtered results. You will be redirected to the AliabaMQ for Apache Kafka product landing page.
Figure 3. Alibaba Workspace Console for Product Selection
Make sure that you have selected your preferred region for your feed stream system deployment. For security and latency reasons, all of the components of the feed stream system should be deployed in the same region and the same virtual private cloud (VPC). For international markets, we can deploy all required components in the Hong Kong and Indonesia (Jakarta) regions. I will use the Hong Kong region for this demo. Now, click the "purchase instance" button and complete the purchase action by following the instructions for the Kafka service subscription.
Figure 4. AlibabaMQ for Apache Kafka Product Landing Page – Subscription
After that, go back to the Kafka landing page, and you will see a new subscription on Kafka that is ready for deployment. Click the deploy icon at the selected instances. Given that you have an existing VPC, select the preferred VPC and availability zone to complete the deployment. If you don't have an existing VPC, you can create a new VPC by following these instructions.
Figure 5. AlibabaMQ for Apache Kafka Product Landing Page – Deployment
After redirecting to the selected Kafka instance management page, assure that the instance status is running. Then, you can set up the security group and access the whitelist by clicking the "Security Configuration" icon on the bottom of the page. For this demo, I will write "0.0.0.0/0" for universal data access on the VPC whitelist, but you should use a more restricted IP for your connecting source system in the VPC in the production environment. Afterward, create your topics and consumer groups accordingly. I will create the "default" topic and "default" consumer groups. Now, your Kafka is ready for message delivery.
Figure 6. AlibabaMQ for Apache Kafka Product Landing Page – Configuration
Click the top left icon for the dropdown menu, Type "hbase" in the search bar under the "Products and Services" dropdown menu, select the HBase in the filtered results, and then you will be redirected to the HBase product homepage. Follow the instructions on creating an Apache HBase cluster with version 2.0 using the Hong Kong region and the same VPC as the Kafka cluster located.
Figure 7. Apache HBase Product Landing Page – Subscription
Go to the access control to set up the whitelist and security group. Since there are more strict security settings for HBase, "0.0.0.0/0" is not allowed. Instead, I will write "192.168.0.0/16,18.104.22.168/24,10.152.163.0/24,10.152.69.0/24" to access the whitelist for data management service and data lake service access, respectively, and to open the ports "2181, 9099, and 8765" for Zookeeper, thrift server, and query server access, respectively.
Figure 8. Apache HBase Access Control Page – Whitelist
Afterward, click "SQL Service" on the menu bar to enable Apache Phoenix on top of HBase for supporting SQL syntax and building a secondary index for simple searches. Apache Phoenix enables OLTP and operational analytics on HBase for low latency applications by combining the best of both worlds: standard SQL and JDBC APIs with full ACID transaction capabilities and flexibility of late-bound, schema-on-read capabilities from the NoSQL world on HBase.
It supports global indexes only for SQL Service on HBase at Alibaba Cloud. Global indexing targets read-heavy use cases. With global indexes, all the performance penalties for indexes occur at the write time. Phoenix intercepts the data table updates on write (DELETE, UPSERT VALUES, and UPSERT SELECT), builds the index update, and sends any necessary updates to all interested index tables. At read time, Phoenix will select the index table that will produce the fastest query time and directly scan it like any other HBase table. By default, unless hinted at, an index will not be used for a query that references a column that isn't part of the index. Please check this link for more details about global indexes on Apache Phoenix.
Figure 9. Apache HBase Management Page – Enabling the SQL Service
Next, configure the service parameter for HBase and Phoenix. There are two pairs of entries (remark: 4 total entries) you have to make sure are false. Search twice for
phoenix.schema.mapSystemTablesToNamespace, and set the value to "false" if it is not already set.
Remark: There are 4 total entries. The first pair (two entries) is for the setup on the HBase server side. The second pair (two entries) is for the setup on Phoenix Client Side. You have to set all four entries to "true."
Figure 10. Apache HBase Management Page – Configuring Service Parameters
Lastly, go back to "SQL Service," and click "Restart HBase SQL Service." Now, your HBase is accessible via the client access URL on the "SQL Service" tab as shown in Figure 9.
Serverless is a way to describe the services, practices, and strategies that enable you to build more agile applications for innovation and faster response time. With serverless computing, infrastructure management tasks like capacity provisioning and patching are handled by the cloud provider, so you can focus on writing code that serves your customers. Serverless services like Alibaba Data Lake Analytics (DLA) come with automatic scaling, built-in high availability, and a pay-for-value billing model. DLA is a twin models data processing engines build on Apache Spark and the Presto on compute service that enables you to run code in response to events from various natively-integrated Alibaba Database services and many other SaaS sources without managing any servers.
Click the top left icon for the dropdown menu. Type "data lake" in the search bar under the "Products and Services" dropdown menu, select "Data Lake Analytics" from the filtered results, and will be redirected to the DLA product homepage. Click "virtual cluster" on the left menu bar, click the "new virtual cluster" on the top right corner, and then follow the instructions to create a virtual cluster using the Hong Kong region and the same VPC as the Kafka cluster located (e.g., name: "test-hk").
Figure 11. Data Lake Analytics Management Page
Now, your Data Lake Analytics spark virtual cluster is ready.
Go to the search bar under the "Products and Services" dropdown menu by clicking on the top left menu icon. Let's set up MongoDB by typing "mongo" in the search bar as shown below. Click on ApsaraDB for MongoDB under the filtered results. After that, you will be redirected to the MongoDB product landing page.
Select "Sharded Cluster Instance" on the menu bar on the left hand side, click "Create Instance," and follow the instruction to complete the provisioning. After MongoDB is ready, go to the Whitelist Group for "0.0.0.0/0" and security group for opening the ports (default: 3417).
Figure 12. MongoDB Connection Information
The connection information "ConnectionStringURI" is provided on the "database connection" tab of the MongoDB management landing page as shown in Figure 12.
Go to the search bar under the "Products and Services" dropdown menu by clicking on the top left menu icon. Let's set up the OSS by typing "oss" in the search bar as shown below. Click on Object Storage Service under the filtered results. After that, you will be redirected to the OSS product landing page.
Figure 13. OSS Folder Structure
Click "Bucket" on the left hand side menu bar, click "Create Bucket," and follow the instructions to create your OSS Buckets (e.g., "dla-spark-hk"). Then, go to the "Files" tab on the menu bar, and create the corresponding folders "dla-jars" and "dla-logs," as shown in Figure 13. OSS Folder Structure.
Congratulations, you have all the components required (VPC, Kafka, HBase, Spark virtual cluster, and OSS) set up! Now, you are ready to build the streaming processing solution on spark streaming on the Alibaba Cloud Data Lake Analytics service.
Please click this link to follow Part B of this series.
Alibaba Clouder - November 23, 2020
Apache Flink Community China - May 14, 2021
Alibaba EMR - March 16, 2021
ApsaraDB - February 25, 2021
Alibaba Clouder - September 17, 2020
Apache Flink Community China - September 27, 2020
A fully-managed Apache Kafka service to help you quickly build data pipelines for your big data analytics.Learn More
Alibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.Learn More
Alibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.Learn More
AlibabaMQ for Apache RocketMQ is a distributed message queue service that supports reliable message-based asynchronous communication among microservices, distributed systems, and serverless applications.Learn More
More Posts by Alibaba Clouder