Use cases - DataWorks - Alibaba Cloud Documentation Center

DataWorks, combined with MaxCompute and Hologres, delivers an integrated data platform on the Data Lakehouse architecture. By consolidating batch and real-time workloads into a single environment, you can reduce data analysis cycles from days to minutes or seconds.

What is a Data Lakehouse?

Traditional data architectures force a choice between two models:

Architecture	Strengths	Limitations
Data warehouse	Structured data, fast SQL queries, strong governance	Expensive at scale, rigid schema, no streaming support
Data lake	Low-cost storage, flexible formats, supports ML workloads	Poor query performance, weak governance, no ACID guarantees

Data Lakehouse combines the strengths of both: structured query performance and governance from data warehouses, with the cost efficiency and flexibility of data lakes. A single storage layer serves both batch and real-time workloads, eliminating the need for separate systems.

Challenges of a dual-stack approach

Most enterprises run two separate technology stacks -- one for batch processing (Hive, Spark) and another for real-time streams (Flink, Kafka). This dual-stack approach creates four problems:

Challenge	Description
Architecture fragmentation	Maintaining two separate stacks increases development and operational costs. Keeping data consistent across both systems is difficult.
Delayed insights	Offline warehouse data is not immediately available for ad hoc queries. Business users often wait hours or a full day before they can explore new data. Correlating real-time events with large historical datasets is particularly difficult.
Low resource efficiency	Reserving capacity for peak batch workloads and real-time traffic spikes results in low utilization and high Total Cost of Ownership (TCO).
Staffing overhead	Operating two separate big data systems requires a large, highly skilled team.

Architecture

The platform follows a four-stage data flow, from ingestion through unified analytics.

Architecture diagram

Stage 1: Unified data ingestion and layering

Data Integration ingests data from multiple source types into a unified cloud data lake or data warehouse:

Source type	Examples
Structured databases	MySQL, PostgreSQL, Oracle
Log files	Application logs, access logs
Real-time message queues	Kafka, other streaming sources

Ingested data follows a standard layering model -- ODS, DWD, DWS, and ADS -- so a single copy of the data serves both batch and real-time computing. This eliminates data silos at the source.

Stage 2: Batch processing

MaxCompute SQL nodes in Data Studio handle large-scale data processing. The scheduling system automatically runs Extract, Transform, Load (ETL) tasks daily after midnight, processing terabytes to petabytes of historical data for:

Decision analysis
User profiling
Machine learning

Stage 3: Real-time and near-real-time computing

The platform supports two latency tiers:

Processing mode	Engine	Latency	Use cases
Real-time	Realtime Compute for Apache Flink (Flink SQL nodes)	Milliseconds	Real-time risk control, live dashboards, real-time recommendations
Near-real-time (ad hoc)	Hologres	Seconds	Interactive drill-downs, self-service exploration via BI tools

Hologres runs interactive queries on massive datasets in the data lake or data warehouse. Business analysts and operations staff can perform multi-dimensional drill-downs directly on the latest data, without waiting for scheduled reports.

Stage 4: Integrated analytics and unified services

Hologres directly accelerates queries on MaxCompute data, enabling federated analysis across real-time and historical datasets without duplicating data between systems.

DataService Studio packages analysis results into standard APIs, providing a single data service endpoint for:

Business applications
BI reports and dashboards
Downstream systems

Component summary

Component	Role	Connects to
Data Integration	Ingests batch and streaming data from external sources	MaxCompute, Hologres
MaxCompute	Stores and batch-processes historical data (TB/PB scale)	Hologres (for accelerated queries)
Hologres	Runs real-time interactive queries on both live and historical data	MaxCompute, DataService Studio
Flink SQL	Processes data streams with millisecond latency	Hologres, MaxCompute
Data Studio	Development environment for authoring and scheduling SQL nodes	MaxCompute, Flink SQL
DataService Studio	Exposes query results as standard APIs	Business applications, BI tools

Benefits

Benefit	Details
Lower TCO	A single storage layer, one development platform, and multiple compute engines reduce development and operational complexity, lowering TCO by over 50%.
Faster time to insight	Data analysis cycles drop from days to minutes or seconds, shifting decisions from periodic reviews to real-time insights.
Self-service analytics	High-performance interactive queries let business users explore data independently, reducing manual ad hoc data requests for analysts.
Data-driven innovation	A unified, real-time data foundation supports user behavior analysis, precision marketing, financial risk control, and intelligent supply chains.

Customer case study

Financial services: A data lakehouse implementation at an Internet finance company