What is DataHub?

DataHub is a real-time data distribution platform designed to process streaming data. You can publish and subscribe to applications for streaming data in DataHub and distribute the data to other platforms. DataHub allows you to analyze streaming data and build applications based on the streaming data. DataHub collects, stores, and processes streaming data from mobile devices, applications, website services, and sensors. You can write your own applications or use Realtime Compute to process streaming data in DataHub, such as real-time website access logs, application logs, and events. The processing results such as alerts and statistics presented in graphs and tables are updated in real time.

Introduction to DataHub

DataHub is developed from the Apsara system of Alibaba Cloud. DataHub features high availability, scalability, and throughput but low latency. DataHub is seamlessly integrated with Realtime Compute and allows you to use SQL to analyze streaming data.

DataHub can also distribute streaming data to Alibaba Cloud services such as MaxCompute and Object Storage Service (OSS).

The following figure shows the architecture of DataHub.

Benefits

High throughput

DataHub allows you to write up to 160 million records per day to a single shard.

Timeliness

DataHub makes it easy to collect and process various types of streaming data in real time so you can provide a speedy response to new business data.

Ease of use

DataHub provides a variety of SDKs for programming languages, such as C++, Java, Python, and Go.
DataHub provides the RESTful API service for you to call the API of DataHub.
DataHub provides common plug-ins, such as Fluentd, Logstash, and Flume. You can use these plug-ins to write streaming data to DataHub.
DataHub supports structured and unstructured data. You can write untyped, unstructured data to DataHub. For example, create a topic of the BLOB type. You can also create a schema for the data before the data is written to DataHub. For example, create a topic of the TUPLE type.

High availability

DataHub provides service availability of at least 99.9%.
The processing capacity of DataHub is automatically expanded without affecting your services. DataHub provides data durability of at least 99.999%.
DataHub automatically stores multiple copies of data for backup.

Scalability

You can dynamically increase or decrease the throughput of each topic. The maximum throughput of a topic is 256,000 records per second.

High security

DataHub provides enterprise-level security measures and isolates resources between users.
DataHub provides several methods of authentication and authorization, such as whitelist configuration and RAM user management.

Scenarios

As a streaming data processing platform, DataHub can be used together with various Alibaba Cloud services to provide one-stop data processing services.

9EC3C836-47AC-4a2c-AE60-45E2CF87DA7D

Realtime Compute

As a stream computing engine of Alibaba Cloud, Realtime Compute allows you to use a language similar to SQL to analyze streaming data. Data can be transferred from DataHub to Realtime Compute. For more information, see Create a DataHub source table.

Data utilization

You can build an application to consume the data in DataHub, process the data in real time, and then generate the processing results. You can also use another application to process the streaming data output from the previous application to form a directed acyclic graph (DAG)-based data processing procedure.

Data archiving

The streaming data can be archived to MaxCompute. To periodically archive the streaming data in DataHub to MaxCompute, you need only to create and configure a DataConnector.