System architecture with integration and separation - E-MapReduce

E-MapReduce (EMR) Serverless StarRocks is a fully managed StarRocks service. Its architecture spans three layers — storage, StarRocks instances, and product capabilities — and supports two storage architectures: compute-storage integration for the lowest query latency, and compute-storage separation for elastic scaling and lower storage costs.

EMR Serverless StarRocks architecture

The architecture of EMR Serverless StarRocks consists of three layers from the bottom up:

Storage layer — stores your data in one of three modes:
- Compute-storage integration: internal tables stored in StarRocks Table format on cloud disks or local disks.
- Compute-storage separation: internal tables stored in StarRocks Table format in Object Storage Service (OSS) or Hadoop Distributed File System (HDFS).
- Data lake analysis: external tables read Hive data or lake-format data from OSS or HDFS in Data Lake Table format.
StarRocks instances — all instances are hosted on the cloud. You do not need to handle operations and maintenance (O&M). Each instance consists of frontend (FE) nodes and backend (BE) nodes or compute nodes (CNs). Warehouses provide flexible resource configuration and isolation. Scalability can ensure the use of resources at low costs. A built-in cache mechanism accelerates queries in compute-storage separation and data lake analysis scenarios.
Product capabilities — built-in tooling so you can focus on your business applications:
- Instance management: resource and configuration management, alerting, automatic health reports, and automatic upgrades are all handled for you — no manual O&M required.
- Data management: an SQL editor for visualized code development, task import, slow query analysis, data audit, metadata management, and permission configuration — available out of the box.

These capabilities support use cases such as operations analysis, user personas, self-service reports, order analysis, and user report generation.

StarRocks system architecture

StarRocks consists of only FE nodes and BE nodes or CNs, which simplifies deployment and maintenance. Nodes can be added online. Metadata and business data are replicated to prevent single points of failure (SPOFs). StarRocks supports the MySQL protocol and standard SQL syntax, so you can connect with any MySQL client.

Architecture choices

StarRocks supports two storage architectures. The right choice depends on your workload:

	Compute-storage integration	Compute-storage separation
Best for	Real-time queries requiring the lowest latency	Workloads with variable compute demand or large data volumes
Data storage	Local disks on BE nodes (multi-replica)	Remote object storage (OSS or HDFS)
Scaling	Requires data rebalancing	Add or remove CNs in seconds, no rebalancing
Storage cost	Higher (triplicate mechanism required)	Lower (cost-effective remote storage)
StarRocks version	Earlier than 3.0	3.0 and later

The following diagram shows the evolution from compute-storage integration to compute-storage separation.

The diagram and specific information in this section are sourced from Architecture in the open-source StarRocks documentation.

Compute-storage integration

Local storage provides the lowest query latency, making compute-storage integration the right choice for real-time analytics that demand peak query performance.

In this architecture, StarRocks is a Massively Parallel Processing (MPP) database. BE nodes handle both data storage and SQL execution. During a query, data is read directly from local storage on the same node, eliminating network transmission. Multi-replica storage enhances high-concurrency query throughput and data reliability.

The architecture consists of two node types: FE and BE.

FE nodes

FE nodes manage metadata, handle client connections, and coordinate query planning and scheduling. Each FE node holds a complete replica of metadata in memory to ensure consistency.

Role	Metadata management	Leader election
Leader	Read and write	Automatically elected from Follower nodes
Follower	Read	Participates in leader election
Observer	Read	Does not participate in leader election

Leader election requires more than half of the Follower FE nodes in a cluster to be active. The Leader uses Berkeley DB Java Edition (BDB JE) to synchronize metadata changes to Follower and Observer FE nodes. Follower FE nodes asynchronously synchronize data based on metadata logs of the Leader FE node. If the Leader fails, a new Leader is automatically elected from the remaining Followers. Observer nodes asynchronously synchronize data based on metadata logs of the Leader FE node, and increase query concurrency without adding election overhead.

BE nodes

BE nodes handle data storage and SQL execution using local storage with a multi-replica mechanism.

Data storage: FE nodes distribute data to BE nodes based on predefined policies. BE nodes convert incoming data into a storable format and generate indexes.
SQL execution: FE nodes parse SQL queries into logical execution plans, then transform them into physical plans distributed to the relevant BE nodes. Because the data is local, no cross-network data transfer is needed.

Limitations of compute-storage integration:

High storage costs: data reliability requires a triplicate mechanism, so storage capacity and compute resources scale together even when only one is needed.
Maintenance overhead: multi-replica consistency requirements make the architecture harder to manage at scale.
Limited scalability: scaling operations trigger data rebalancing, which can degrade performance during the process.

Compute-storage separation

Object storage and HDFS provide virtually unlimited capacity, cost-effective pricing, and the ability to scale compute independently — making compute-storage separation the right choice for large data volumes or variable workloads.

In this architecture, data is persistently stored in OSS or HDFS. CN nodes are stateless: they cache frequently accessed (hot) data on local disks and handle data import, query processing, and cache management. Because storage and compute are fully decoupled, you can add or remove CNs dynamically within seconds without rebalancing data.

The system still consists of FE and CN nodes. FE nodes retain the same responsibilities as in compute-storage integration. The key difference is that you must specify a remote backend storage service when you create the instance.

Storage backends

Two backend storage options are supported:

Alibaba Cloud OSS
HDFS, deployed in a self-managed Hadoop cluster or an Alibaba Cloud EMR DataLake cluster

The data file format is the same as in compute-storage integration, and all indexing technologies are supported. Tablet metadata is redesigned to work efficiently with object storage.

Cache

To maintain query performance on remote data, StarRocks uses a three-tier cache:

Tier	Storage location	Data type
Hot	Memory	Frequently accessed data
Warm	Local disk on CN	Moderately accessed data
Cold	Remote object storage service	Infrequently accessed data

When a query hits cached data, performance is comparable to compute-storage integration. For cold data, StarRocks fetches it from the backend object storage service, caches it locally, and applies read-ahead and parallel scanning to reduce subsequent remote access. Define rules for hot and cold data classification to match your access patterns and performance requirements.

To enable caching, turn on the cache feature when you create a table. With caching enabled, data is written simultaneously to local disk and the backend object storage service. During queries, CNs read from local disk first; on a cache miss, they fetch from the backend and cache the result locally.