EMR, Hadoop, cluster, cluster type - E-MapReduce - Alibaba Cloud Documentation Center

Alibaba Cloud E-MapReduce (EMR) provides the following predefined business scenarios of clusters: Data Lake, Data Analytics, Real-time Data Streaming, and Data Service. If you want to flexibly deploy services in an EMR cluster based on your business requirements, you can create a custom cluster to build a big data platform that adapts to your business characteristics. This topic describes the differences among the business scenarios. You can select a business scenario based on your business requirements.

Business scenario selection

Business scenario (cluster type)	Supported service	Core capability	Scenario
Data Lake (DataLake cluster)	Computing: Spark, Hive, Tez, Trino, Kyuubi, and Presto Data storage: Hadoop Distributed File System (HDFS), OSS-HDFS, Celeborn, and JindoCache Data integration: Flume and Sqoop Data lake formats: Hudi, Iceberg, and Paimon Resource management: YARN Distributed coordination: ZooKeeper Security and permissions: OpenLDAP, Ranger, DLF-Auth, and Knox	Unified storage Multiple compatible compute engines Support for various data lake formats	Offline extract, transform, and load (ETL) such as ETL of data warehouses, and interactive queries such as ad hoc analysis





Data Analytics (OLAP cluster)	Online analytical processing (OLAP) analysis: StarRocks, ClickHouse, Doris Distributed coordination: ZooKeeper	Subsecond-level query response Column-oriented storage optimization Federated queries	Complex aggregation analysis, such as user profile analysis, user group identification, and business intelligence (BI)

Real-time Data Streaming (Dataflow cluster)	Stream computing: Flink Data storage: HDFS and OSS-HDFS Data lake format: Paimon Resource management: YARN Distributed coordination: ZooKeeper Security and permissions: OpenLDAP and Knox	Unified batch and stream processing Low latency State consistency guarantee	Real-time ETL, such as ETL of streaming warehouses




Data Service (DataServing cluster)	Computing: Phoenix Column-oriented storage: HBase Data storage: HDFS, OSS-HDFS, and JindoCache Distributed coordination: ZooKeeper Security and permissions: OpenLDAP, Ranger, and Knox	Millisecond-level point queries SQL interface optimization Read/write splitting	High-concurrency queries, such as behavior analysis and precision marketing



Custom Cluster	Computing: Spark, Hive, Tez, Trino, Kyuubi, Presto, Flink, and Phoenix OLAP analysis: StarRocks Column-oriented storage: HBase Data storage: HDFS, OSS-HDFS, Celeborn, and JindoCache Data integration: Flume and Sqoop Data lake formats: Hudi, Iceberg, and Paimon Resource management: YARN Distributed coordination: ZooKeeper Security and permissions: OpenLDAP, Ranger, DLF-Auth, and Knox	Flexible deployment of services, such as Spark, Flink, and HBase Support for mixed workloads, such as workloads of real-time processing, offline processing, and data analysis Note In the mixed workload scenarios, offline and real-time businesses may affect each other. In this case, we recommend that you create different types of clusters based on your business requirements.	Offline ETL, real-time ETL, complex aggregation analysis, and high-concurrency queries

Note

The versions of services that can be deployed in an EMR cluster vary based on the EMR version. For more information, see Release versions. We recommend that you use the latest EMR version to experience more features, improve performance, and ensure security.
If a custom cluster cannot fully meet your business requirements, you can deploy the required services on your own after you evaluate the compatibility and security of the services.

Subsequent cluster planning

After you select a business scenario for your cluster, you can continue to plan the storage architecture, metadata service, hardware specifications, and network specifications. For more information, see Select a region and plan storage configurations, Select a metadata service, and Plan hardware and network configurations.