All Products
Search
Document Center

E-MapReduce:Scenarios

Last Updated:Mar 26, 2026

Alibaba Cloud E-MapReduce (EMR) supports scalable computing clusters, multi-source heterogeneous data integration and governance, and unified batch and stream processing. EMR is well-suited for data-intensive workloads including financial risk control, e-commerce precision marketing, and IoT time-series data processing. This document describes how EMR applies to four scenarios: data lake, data analytics, real-time data streaming, and data serving.

Data lake scenario

EMR DataLake clusters are designed for teams that need to centralize storage, governance, and analysis for enterprise data lakes built on OSS-HDFS. Three core capabilities work together to cover the full lifecycle from raw data ingestion to downstream application.

Core capability Component Description
Unified storage layer OSS-HDFS An object storage layer compatible with the Hadoop Distributed File System (HDFS) protocol. OSS-HDFS replaces on-premises HDFS, decouples compute from storage, and scales compute nodes independently.
Lake metadata governance Data Lake Formation (DLF) A unified metadata catalog across Object Storage Service (OSS), databases, and file systems. DLF provides automatic metadata discovery, fine-grained permission management, and data lineage tracking.
Full-stack analysis engine Spark, Hive, and Presto/Trino Offline extract, transform, load (ETL) via Spark or Hive, and interactive queries via Presto or Trino. Both DataWorks and Quick BI integrations are supported.

The following diagram shows the end-to-end data flow for the EMR DataLake scenario.

End-to-end data flow for the EMR DataLake scenario

Step 1: Ingest data from multiple sources

Data from different source systems lands in OSS-HDFS in its original format.

  • Relational databases (MySQL, Oracle): Use Sqoop or DataX to extract full or incremental data on a regular schedule and write it to OSS-HDFS based on the business table schema.

  • Non-relational databases (MongoDB, Redis): Use a custom script or a Spark connector to export JSON or binary data to OSS-HDFS.

  • Log data: Use Logstash or Flume to collect incremental logs — user behavior logs, system logs — and write them to OSS-HDFS with minute-level latency.

  • File data: Use JindoSDK with the HDFS API to upload CSV, Parquet, and other files to OSS-HDFS in bulk, or upload files directly from the OSS console.

Step 2: Process and analyze data

Raw data in OSS-HDFS is refined into queryable business metrics.

  • Batch processing: Run Spark and Hive jobs to cleanse, associate, and aggregate raw logs and business data. Typical outputs include daily active users, 30-day user retention rates, and new-order counts by Stock Keeping Unit (SKU).

  • Interactive queries: Use Trino or Presto with standard SQL to query big data at sub-second response times, supporting ad hoc analysis for operations teams.

Step 3: Apply data to downstream systems

Processed data powers a range of business applications.

  • Data science: Expose processed data through API services to downstream systems such as risk control engines and recommendation systems.

  • Business intelligence: Connect via the Java Database Connectivity (JDBC) API to tools such as Quick BI to build interactive reports.

  • Predictive analysis: Push processing results and feature data to a machine learning platform to train models — for example, an SKU sales prediction model — and store the results back in the data lake.

  • Data visualization: Connect via the JDBC API to visualization tools such as DataV to display complex data on dashboards.

Data analytics scenario

EMR Online Analytical Processing (OLAP) clusters are built for teams running high-performance big data analytics. They integrate with OLAP engines — StarRocks, Doris, and ClickHouse — that share three performance characteristics: efficient data compression, column-oriented storage, and parallel query execution. These characteristics make OLAP clusters well-suited for user profiling, recipient selection, and business intelligence workloads.

The following diagram uses the StarRocks analytics engine to show the end-to-end data flow for the EMR OLAP scenario.

End-to-end data flow for the EMR OLAP scenario using StarRocks

Step 1: Collect data

Data arrives in StarRocks from both real-time and offline sources.

  • Real-time: Flume captures log data; Message Queue for Apache Kafka buffers data streams with high throughput and low latency to stabilize real-time ingestion.

  • Offline: Sqoop or DataX extracts data from relational databases such as MySQL and Oracle on a regular schedule and loads it into StarRocks.

Step 2: Layer data in StarRocks

StarRocks organizes data into four hierarchical layers — a pattern similar to the medallion architecture used in modern data lakehouses — so each layer builds on the previous and can be reused independently.

  • DIM layer (dimension layer): Stores dimension data such as user attributes and commodity categories, and supports multi-granularity analysis.

  • ODS layer (operational data store layer): Holds raw data in its original state, enabling backtracking analysis.

  • DWD layer (data warehouse detail layer): Applies data cleansing, format standardization, and basic association to produce detailed datasets.

  • DWS layer (data warehouse summary layer): Pre-aggregates metrics by business subject — user behavior, order conversion — to reduce query latency.

Step 3: Apply data

Layered data drives three main business applications.

  • User profiling: Combine DIM-layer attribute tags with DWS-layer behavioral data to build user profiles for precision marketing.

  • Recipient selection: Filter users by multiple conditions, such as users who were highly active but made no purchases in the past 30 days.

  • Business intelligence: Connect via the JDBC API to Quick BI to generate daily reports, weekly reports, and real-time dashboards.

Real-time data streaming scenario

EMR Dataflow clusters are designed for workloads that require continuous, low-latency data ingestion and analysis — such as real-time risk control and real-time dashboards. The cluster integrates three core components that cover storage, stream processing, and incremental data management.

  • OSS-HDFS: A scalable storage layer compatible with the HDFS protocol. Supports persistent storage of petabytes of real-time data, millisecond-level writes, and low-cost cold and hot data tiering.

  • Flink: Performs ETL on data streams (log parsing, dimension association), window aggregation (minute-level Gross Merchandise Volume (GMV) measurement), and complex event processing (risk control rule evaluation).

  • Paimon: Manages real-time incremental data and historical snapshots in a streaming data lake. Supports change data capture (CDC) synchronization, ACID (atomicity, consistency, isolation, and durability) transactions, and time-travel queries.

The following diagram shows how Flink, Paimon, and OSS-HDFS work together to build a streaming data lakehouse that powers a real-time dashboard.

End-to-end data flow for the EMR Dataflow streaming data lakehouse scenario

Step 1: Ingest data in real time from multiple sources

Use Flink connectors to collect database changes, logs, and event tracking data from multiple upstream systems simultaneously.

Step 2: Build the streaming data lakehouse

Flink and Paimon process and store data in a structured, queryable lakehouse.

  • Flink: As an integrated batch and stream data computing engine, Flink consumes data streams in real time and performs cleansing, transformation (log parsing, event tracking point standardization), and dimension association.

  • Paimon: Stores processing results in a streaming data lake with two key mechanisms:

    • Changelog: Records data insertions, updates, and deletions to guarantee ACID transaction integrity and real-time incremental synchronization.

    • Hierarchical modeling: Combines Paimon with ODS, DWD, and DWS layers to build a layered data architecture for incremental data accumulation and reuse.

  • OSS-HDFS: Persistently stores raw logs, Paimon incremental snapshots, and historical archived data.

Step 3: Deliver business insights

Processed lakehouse data feeds real-time reporting and decision-making tools.

  • Use StarRocks to generate real-time business reports such as GMV monitoring and user retention analysis.

  • Connect to Quick BI to build dashboards that enable T+0 decision making.

Data serving scenario

EMR DataServing clusters are built for workloads that require storing and querying massive datasets at millisecond-level latency. The cluster integrates OSS-HDFS, HBase, and Phoenix to cover the full path from raw data storage to complex query execution — supporting use cases such as user behavior analysis and precision marketing.

  • HBase: A distributed, column-oriented database that provides high-throughput real-time read and write for large-scale datasets, including millisecond-level point queries for order status and user behavior records. HBase persists HFiles to OSS-HDFS, decoupling storage from compute and enabling fast cluster recreation.

  • Phoenix: The SQL query engine for HBase. Phoenix maps NoSQL data to standard relational tables and supports complex SQL analysis — multi-table association and aggregate computing — with sub-second response times at hundreds of billions of records. Secondary indexes and query pushdown accelerate tag-based selection and user grouping, lowering the development effort for business teams.

The following diagram shows how EMR DataServing clusters use the HBase + OSS-HDFS storage architecture and the Phoenix query engine to support user behavior analysis.

End-to-end data flow for the EMR DataServing user behavior analysis scenario

Step 1: Process data

Two processing paths — stream and batch — feed data into the HBase cluster.

  • Stream processing (Flink): Consumes log data streams in real time, applies data cleansing (denoising, format standardization), window aggregation (real-time unique visitor (UV) counts), and event alerting (abnormal traffic detection), then writes results to HBase via the HBase API.

  • Batch processing (Spark): Runs periodic batch jobs on relational database data, performs ETL operations such as user tag calculation and data deduplication, and writes the output to HBase.

Step 2: Store data at scale

Two storage layers handle different access patterns.

  • OSS-HDFS: Stores raw logs and HBase HFiles persistently. JindoCache reduces OSS-HDFS read and write latency.

  • HBase cluster: Handles real-time writes (user behavior records) and high-frequency point queries (order status lookups).

Step 3: Query user behavior

Business teams run Phoenix SQL queries against HBase to extract behavioral insights for precision marketing — for example, finding users who purchased a product category and clicked related ads in the past seven days.