E-MapReduce (EMR) provides four predefined cluster types — Data Lake, Data Analytics, Real-time Data Streaming, and Data Service — each pre-configured for a specific workload. If none of these fits your requirements, use a Custom Cluster to deploy any combination of services.
Choose a cluster type
Match your workload to the cluster type using the table below.
| Cluster type | Included services | Core capabilities | Typical workloads |
|---|---|---|---|
| Data Lake (DataLake cluster) |
Computing: Spark, Hive, Tez, Trino, Kyuubi, Presto Storage: Hadoop Distributed File System (HDFS), OSS-HDFS, Celeborn, JindoCache Data integration: Flume, Sqoop Data lake formats: Hudi, Iceberg, Paimon Resource management: YARN Coordination: ZooKeeper Security: OpenLDAP, Ranger, DLF-Auth, Knox |
Unified storage, multiple compatible compute engines, support for Hudi/Iceberg/Paimon formats | Offline extract, transform, and load (ETL) — data warehouse ETL, ad hoc analysis |
| Data Analytics (OLAP cluster) |
Online Analytical Processing (OLAP): StarRocks, ClickHouse, Doris Coordination: ZooKeeper |
Subsecond-level query response, column-oriented storage optimization, federated queries | Complex aggregation analysis — user profile analysis, user group identification, business intelligence (BI) |
| Real-time Data Streaming (Dataflow cluster) |
Stream computing: Flink Storage: HDFS, OSS-HDFS Data lake format: Paimon Resource management: YARN Coordination: ZooKeeper Security: OpenLDAP, Knox |
Unified batch and stream processing, low latency, state consistency guarantee | Real-time ETL — streaming warehouse ETL |
| Data Service (DataServing cluster) |
Computing: Phoenix Column-oriented storage: HBase Storage: HDFS, OSS-HDFS, JindoCache Coordination: ZooKeeper Security: OpenLDAP, Ranger, Knox |
Millisecond-level point queries, SQL interface optimization, read/write splitting | High-concurrency queries — behavior analysis, precision marketing |
| Custom Cluster |
Computing: Spark, Hive, Tez, Trino, Kyuubi, Presto, Flink, Phoenix OLAP: StarRocks Column-oriented storage: HBase Storage: HDFS, OSS-HDFS, Celeborn, JindoCache Data integration: Flume, Sqoop Data lake formats: Hudi, Iceberg, Paimon Resource management: YARN Coordination: ZooKeeper Security: OpenLDAP, Ranger, DLF-Auth, Knox |
Flexible service deployment, mixed workloads (real-time, offline, and analytical) | Offline ETL, real-time ETL, complex aggregation analysis, and high-concurrency queries |
Service versions available in a cluster depend on the EMR version. Use the latest EMR version to access the most features, better performance, and security improvements. For a full list of available versions, see Release versions.
When to use a Custom Cluster
A Custom Cluster gives you full control over which services to deploy. Use it when your workload spans multiple cluster types — for example, running Spark, Flink, and HBase together on a single cluster.
Use a Custom Cluster if:
-
Your workload combines offline ETL, real-time processing, and analytical queries
-
No predefined cluster type covers all the services you need
Use separate dedicated clusters instead if:
-
Your offline and real-time workloads have different latency or resource requirements — mixing them on one cluster can cause interference
If a Custom Cluster still cannot fully meet your requirements, deploy additional services manually after evaluating their compatibility and security.
What's next
After selecting a cluster type, plan the remaining cluster configuration: