All Products
Search
Document Center

E-MapReduce:Component overview

Last Updated:Apr 27, 2025

E-MapReduce (EMR) provides open source components and self-developed components at the data development, compute engine, data service, resource management, data storage, and data integration layers. You can select and configure the components based on your business requirements.

Note

If the components that you want to use are not available when you create a cluster or the open source components that you want to use are available only to existing users, you can manually install and manage the components based on your business requirements.

EMR integrates Alibaba Cloud services and open source components and provides self-developed components and the cluster management feature. You can view the use scenarios of EMR and view the big data components provided by EMR based on the EMR architecture, as shown in the following figure.

image

Data development

The services and components at the data development layer provide visualization tools to manage code, collect data, cleanse data, build data models, analyze data, and schedule jobs. This helps enterprises improve the efficiency in managing and utilizing data assets.

We recommend that you use Alibaba Cloud DataWorks to develop data in EMR. The following table describes the services.

Service name

Description

References

DataWorks

DataWorks provides the data integration, data development, data governance, data quality management, data O&M, and security control features. You can use DataWorks in scenarios where complex data integration and data governance are required.

You can use the open source component, Hue or Superset, at the data development layer. The following table describes the open source components.

Type

Component name

Description

References

Open source

Hue

Hue is available only to existing users.

Hue provides a web UI that allows you to interact with the Apache Hadoop ecosystem.

Hue

Superset

Superset is available only to existing users.

Superset is a data visualization platform that provides features for you to visualize data and configure dashboards.

Superset

Compute engine

EMR supports various mainstream compute engines, such as compute engines that are used for batch processing, interactive analysis, stream computing, and machine learning. You can use the compute engines to transform the structure and logic of data to meet requirements of different big data scenarios.

Type

Component name

Description

References

Open source

Spark

Spark is a fast and general-purpose big data processing engine that provides in-memory data processing capabilities and supports multiple data processing modes, such as batch processing, real-time processing, machine learning, and graph computing.

Hive

Hive is a Hadoop-based data warehouse tool that allows you to use SQL-like language, such as HiveQL, to store, query, and analyze large-scale data in Hadoop.

StarRocks

StarRocks is a next-generation, high-speed data analytics engine that is built based on the Massively Parallel Processing (MPP) framework. StarRocks is suitable for various scenarios, such as online analytical processing (OLAP) analysis, high-concurrency queries, and real-time data analysis.

Doris

Apache Doris is a high-performance, real-time analytical database that you can use in scenarios such as report analysis, ad hoc queries, and acceleration of federated queries across data lakes.

ClickHouse

ClickHouse is an open source column-oriented database management system (DBMS) that is used to implement efficient OLAP analysis and fast queries of large amounts of data.

Trino

Trino, formerly called PrestoSQL, is an open source distributed SQL query engine that is suitable for interactive analytic queries.

Flink

Flink is a streaming execution engine that is used to process large-scale, real-time data streams.

Presto

Presto, formerly called PrestoDB, is a flexible and scalable distributed SQL query engine. You can use Presto to perform interactive analytic queries.

Tez

Apache Tez is a distributed big data processing framework that supports directed acyclic graphs (DAGs). You can replace MapReduce with Tez to improve the performance and efficiency of queries and batch tasks.

Tez

Phoenix

Phoenix is an SQL intermediate layer built on top of HBase. Phoenix allows you to execute standard SQL statements to query and manage HBase data.

Phoenix

Impala

Impala is available only to existing users.

Impala provides high-performance and low-latency SQL queries for data stored in Apache Hadoop.

Kudu

Kudu is available only to existing users.

Kudu is a distributed, scalable, and column-oriented data storage system that supports low-latency random data reads and writes and delivers efficient analytics on data.

Druid

Druid is available only to existing users.

Druid is a distributed, real-time, in-memory analytics system that delivers fast and interactive queries and analysis on large-scale datasets.

Druid

DataServing

The components at the data service layer provide various features, such as data encryption, access control, data query, data access, and API operations, to improve data security, and data operation and analysis efficiency in the big data environment.

Type

Component name

Description

References

Open source

Ranger

Ranger is a centralized security management framework that is mainly used for permission management and auditing within the Hadoop ecosystem.

Kerberos

Kerberos is an identity authentication protocol based on symmetric-key cryptography. Kerberos provides the identity authentication feature for other services and supports single sign-on (SSO).

OpenLDAP

OpenLDAP is the open source implementation of the Lightweight Directory Access Protocol (LDAP). OpenLDAP is used to manage and store information about users and resources and authenticate identities.

OpenLDAP

Kyuubi

Kyuubi is a distributed and multi-tenant gateway that simplifies data analysis and queries and provides query services, such as SQL queries, for data lake query engines.

ZooKeeper

ZooKeeper is an efficient distributed coordination service. ZooKeeper provides various features for distributed applications, such as distributed configuration, synchronization, and naming. ZooKeeper also provides consistent, high-performance, and reliable cluster management solutions.

Knox

Knox is a REST API gateway that simplifies the procedure of securely accessing the services within the Hadoop ecosystem and provides centralized identity authentication and access control.

Knox

Livy

Livy is a service that interacts with Spark by using a RESTful API or a remote procedure call (RPC) client library.

Livy

Kafka Manager

Kafka Manager is available only to existing users.

Kafka Manager is a cluster management tool designed for Kafka. Kafka Manager provides a web UI that allows you to manage and monitor Kafka clusters.

Kafka Manager

Self-developed

DLF-Auth

DLF-Auth is provided by Data Lake Formation (DLF). You can use DLF-Auth to implement fine-grained permission management on databases, tables, columns, and functions. This way, you can implement centralized permission management on data lakes.

DLF-Auth

Resource management

The components at the resource management layer provide efficient resource scheduling and management capabilities. You can use the components to implement automated task scheduling, intelligent resource allocation, and elastic cluster scaling. This improves the efficiency and reliability of big data processing.

Type

Component name

Description

References

Open source

YARN

YARN is a resource management system of Hadoop. You can use YARN to schedule and manage the resources of a cluster to allow different types of distributed computing tasks to efficiently run on the cluster.

Data storage

The components at the data storage layer support distributed storage of structured and unstructured data. You can select an appropriate storage method to meet the requirements of specific compute engines.

Type

Component name

Description

References

Self-developed

OSS-HDFS

OSS-HDFS is an object storage solution that is compatible with Hadoop Distributed File System (HDFS) APIs. OSS-HDFS enables big data computing tasks to directly access data stored in Alibaba Cloud Object Storage Service (OSS) based on a standard HDFS protocol.

JindoCache

JindoCache is a distributed cache solution that is used to accelerate large-scale data access. JindoCache caches data blocks in memory to improve data read performance and reduce pressure on the underlying storage system.

ESS

Remote Shuffle Service (ESS) is available only to existing users. The first time you use EMR, we recommend that you use Celeborn.

ESS is used to optimize the read and write performance of shuffle operations.

ESS

JindoData

JindoData is available only to existing users. The first time you use EMR, we recommend that you use JindoCache.

JindoData is a self-developed storage acceleration suite for data lake systems. JindoData provides end-to-end solutions for the data lake systems of Alibaba Cloud and other vendors in the big data and AI industries.

JindoData

SmartData

SmartData is available only to existing users. The first time you use EMR, we recommend that you use OSS-HDFS.

SmartData is a self-developed component of EMR. SmartData optimizes storage, caching, and computing for various compute engines in EMR in a centralized manner and extends multiple storage features in the compute engines. SmartData is used in data access, data governance, and data security scenarios.

SmartData overview

Open source

Paimon

Paimon is a data lake platform that allows you to process data in streaming and batch modes. Paimon supports high-throughput data writing and low-latency data queries.

Hudi

Hudi is a data lake framework that allows you to update and delete data in Hadoop compatible file systems. Hudi also allows you to consume changed data.

Iceberg

Iceberg is an open data lake table format that delivers high-performance reads and writes and provides the metadata management feature.

Delta Lake

Delta Lake serves as an open source data storage layer that supports atomicity, consistency, isolation, durability (ACID) transactions, scalable metadata processing, and centralized streaming and batch processing.

HDFS

HDFS is a distributed file system used to store large datasets. HDFS features high fault tolerance and high throughput. HDFS can store redundantly store data across multiple nodes in a cluster.

HBase

HBase is a distributed, column-oriented, open-source database that is built based on the Hadoop file system. HBase delivers low-latency random reads and writes and supports high-reliability storage for large datasets.

Celeborn

Celeborn is a service that processes intermediate data. Celeborn can improve the stability, flexibility, and performance of big data compute engines.

Celeborn

HBase-HDFS

HBase-HDFS is built on top of HDFS and is used to store the WAL files of HBase in scenarios in which storage and computing are decoupled.

HBASE-HDFS

Alluxio

Alluxio is available only to existing users.

Alluxio is a cloud-oriented open source data orchestration technology for data analytics and AI. Alluxio supports data access from a unified interface and cross-layer underlying storage.

Alluxio

Data Integration

The components at the data integration layer provide capabilities that allow you to transmit data in batches, process messages in streaming mode, and collect distributed logs. This helps you improve data transmission efficiency and data collection reliability.

Type

Component name

Description

References

Open source

Flume

Flume is a distributed, reliable, and highly available system. You can use Flume to collect, aggregate, and move large amounts of log data and store the data in a centralized manner.

Sqoop

Sqoop is a tool used to efficiently transmit data between Hadoop and relational databases. You can use Sqoop to import and export large amounts of data.

Kafka

Kafka is available only to existing users.

Kafka is an open source distributed event streaming platform that features high throughput, low latency, and persistence. Kafka is widely used to process real-time data streams and build data pipelines.

References

  • For information about the architecture of EMR, see the Architecture section in the "What is EMR on ECS" topic.

  • For information about the services and service versions supported by EMR clusters of different versions, see Services supported by EMR clusters of different versions.

  • For information about the use scenarios and services of different types of EMR clusters, see the Big data scenarios section in the "Select configurations" topic.