This topic describes the architecture of E-MapReduce (EMR).

The following figure shows the architecture of EMR. Architecture
EMR consists of four types of services:
  • Open source services
    Apache big data services, such as Hadoop, Hive, and HBase, are integrated into EMR. The versions of the open source services are updated with EMR versions. For more information, see release notes in Overview.
    Notice You are not allowed to update the version of a service in an existing EMR cluster.
  • Open source services enhanced by EMR
    EMR enhances the performance and features of some open source services. Examples:
    • Spark Streaming SQL is added to Spark. This significantly improves the performance of Spark. For more information, see Common keywords.
    • The Z-ordering and Data Skipping features are added to Delta Lake. For more information, see Overview.
  • Self-developed services of EMR
    EMR provides the following self-developed services, which ensure that open source components and services can better run on the Alibaba Cloud infrastructure:
    • Shuffle Service is an extended component of EMR. It is used to optimize the shuffle operations of computing engines. For more information, see ESS overview.
    • SmartData optimizes storage, caching, and computing for various EMR computing engines in a centralized manner and extends storage features. For more information, see SmartData.
  • Alibaba Cloud services

    EMR connects to both open source big data ecosystems and the Alibaba Cloud ecosystem. You can deploy EMR clusters on Alibaba Cloud Elastic Compute Service (ECS) instances or Container Service for Kubernetes (ACK) clusters and store data in Alibaba Cloud Object Storage Service (OSS). You can learn and use Machine Learning Platform for AI (PAI) in an EMR Data Science cluster. EMR is integrated into DataWorks, and you can use EMR as a job computing engine or data storage engine in DataWorks.