How to Build a Cloud-Native Open-Source Big Data Platform | Best Practices of InMobi

By Murray Zhu, Head of Technical Operations and Maintenance of InMobi

The high integration of open-source technologies and cloud-native has allowed the open-source big data platform of Alibaba Cloud to accumulate rich practical experience in functionality, usability, and security. It has been serving thousands of enterprises and helped them focus on the advantages of their core business, shorten the development cycle, reduce difficulties of operations and maintenance, and expand more business innovations.

This article mainly shares the best practices of InMobi based on the open-source big data service of Alibaba Cloud.

1. Enterprise Introduction

InMobi is an AI-driven and effect-driven global platform for mobile advertising and marketing technology. Based on a large number of applications and users connected globally, InMobi provides services for mobile advertising promotion and marketing technology for domestic brands and applications and offers application advertising commercialization and monetization services for application developers. The platform was established in 2007 and entered the Chinese market in 2011. It is oriented by technology research and development and occupies an important position in the mobile advertising platform industry. Its professional technologies are leading in China and globally. InMobi has reached over one billion monthly active and independent users through localization service teams in 23 countries and regions globally. It provides tens of thousands of fine-grained audience classifications, thousands of dimensional tags, data from customized sample libraries of tens of millions of users, and accurate mobile advertising using location-based services (LBS).

As a leading technology enterprise, InMobi was acclaimed as one of the CNBC Disruptor 50 in 2019 and one of the Most Innovative Companies by the Fast Company magazine in 2018.

2. Big Data Solution of InMobi in China

The preceding figure shows the original big data cluster architecture for InMobi in China, which is divided into the data ingestion layer, storage layer, compute layer, and reporting layer. First, advertising data in the advertising front-end part (especially RR and other data) is ingested through the data ingestion layer. Then, the data is stored in the offline HDFS big data cluster, and data tasks are processed through the compute cluster. Finally, the processed tasks are displayed to end users through reports.

Some problems are gradually exposed during the operations and maintenance of the big data cluster:

T he big data cluster is built in the IDC, which is not conducive to resource scaling and expansion.

When compute resources are insufficient, some tasks need to be allocated (or even suspended), and important tasks need to be run first, which is not good for generating reports.

Data reports are faced with poor real-time capability.

The real-time capability of data reports is poor, which cannot match the needs of the business party to display reports in minutes.

Vertica database, which is used to process real-time report data, is relatively expensive.

3. Big Data Cluster Optimization Scheme of InMobi in China

How InMobi Thinks about Big Data Cluster Optimization

Based on the three typical problems above, InMobi thinks about the optimization solutions in the following part:

Build a hybrid cloud architecture and introduce the big data service of Alibaba Cloud to solve problems of extensibility of scalable storage and compute resources.

Open more big data service nodes on the cloud and overcome the shortage of the compute and storage capabilities by using the elastic capabilities of the big data service, especially for some temporary scenarios where resource usage is relatively tight, such as the 618 and Double 11 shopping festivals.

Replace the Vertica database with EMR ClickHouse to improve the efficiency of real-time report data queries and save costs.

As an open-source product, ClickHouse has been implemented on a large scale in the business scenarios of Internet companies in China.

Build a real-time data warehouse system based on Flink and EMR ClickHouse to completely solve the problem of the real-time capability of data reports

The real-time data warehouse system solves the problem of the real-time capability of data reports, making it reach the minute level at least and reach the second level for reports with special requirements.

Specific Optimization Solutions to Big Data Cluster

Decouple real-time from offline data warehouses
In an IDC big data cluster, offline data reporting resources and real-time reporting resources are completely decoupled.
In an IDC big data cluster, offline data reporting tasks and real-time reporting tasks are completely decoupled.
Reconstruct a real-time data warehouse
Migrate a Kafka log cluster to Alibaba Cloud
On Alibaba Cloud, reconstruct a real-time data warehouse cluster based on Flink and EMR ClickHouse
In the IDC, migrate the original Storm task to the new real-time data warehouse cluster
Optimize the offline data warehouse
Optimize and recycle HDP big data cluster resources in the IDC to save costs
Establish an offline data warehouse, Hive
Open new data nodes on Alibaba Cloud and add them to the offline big data cluster to expand storage and compute resources
Build a new Flume cluster on Alibaba Cloud and store the raw data in Kafka to the Hadoop Distributed File System (HDFS)

An Optimized Big Data Cluster Architecture

As shown in the preceding figure, the optimized big data cluster architecture is divided into two parts:

AliCloud (Real Time): Alibaba Cloud is mainly responsible for real-time data processing.

Read RR logs from Kafka and write them to real-time reports through ClickHouse. Read useful data from Kafka and store them to MySQL and PostgreSQL based on business requirements.

IDC (Offline): IDC is mainly responsible for processing offline data and reporting business.

Use Flume to store all raw data in Kafka to the entire HDFS cluster and then perform data analysis and data regulation. In the offline big data cluster, all the business requirements of offline reports run through Spark tasks. Finally, the tasks are written back to ClickHouse to display offline data reports.

4. More Technical Explorations and Implementation in the Future

Build a Real-Time Data Warehouse Integrating Stream and Batch Processing Based on Flink and Hologres

The architecture of Hologres separates storage and compute. The compute is fully deployed on Kubernetes, and the storage can use shared storage. You can select HDFS or OSS on the cloud based on business requirements to implement elastic expansion of resources and solve concurrent problems caused by insufficient resources. It is very suitable for the advertising business scenarios of InMobi.

Flink performs ETL processing for stream and batch data, writes the processed data into Hologres for unified storage and queries and enables the business end to directly connect to Hologres to provide online services, improving production efficiency.

These are the best practices of InMobi based on the open-source big data service of Alibaba Cloud.

Community

How to Build a Cloud-Native Open-Source Big Data Platform | Best Practices of InMobi

1. Enterprise Introduction

2. Big Data Solution of InMobi in China

3. Big Data Cluster Optimization Scheme of InMobi in China

How InMobi Thinks about Big Data Cluster Optimization

Specific Optimization Solutions to Big Data Cluster

An Optimized Big Data Cluster Architecture

4. More Technical Explorations and Implementation in the Future

Build a Real-Time Data Warehouse Integrating Stream and Batch Processing Based on Flink and Hologres

Read previous post:

Read next post:

Alibaba EMR

You may also like

Comments

Alibaba EMR

Related Products

Realtime Compute for Apache Flink

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Hologres