StarRocks Fully Managed OLAP on EMR Overview - E-MapReduce

E-MapReduce (EMR) Serverless StarRocks is a fully managed StarRocks service on Alibaba Cloud. Create StarRocks instances and manage instances and data from the EMR console — without configuring, operating, or scaling clusters yourself.

What is StarRocks

StarRocks is an analytic database built for fast, real-time, and efficient multi-dimensional data analysis. It uses a Massively Parallel Processing (MPP) architecture with a vectorized execution engine, a cost-based optimizer (CBO), intelligent materialized views, and a real-time updatable columnar storage engine. StarRocks is compatible with the MySQL protocol, so any MySQL client or common BI tool connects to it directly. It scales horizontally and is designed for high availability and high reliability.

StarRocks fits these analytical scenarios:

Real-time data warehouses — synchronize changes from transactional databases in seconds and query up-to-date data
Online analytical processing (OLAP) — run multi-dimensional reports, self-service dashboards, and ad hoc queries
Data lake analysis — query data in Apache Hive, Apache Iceberg, Apache Hudi, and Delta Lake without migrating it

Core capabilities

MPP framework

StarRocks splits every query into physical computing units that run in parallel across machines, each with dedicated CPU and memory. When you scale out the cluster, single-query performance scales with it.

Vectorized execution engine

The vectorized execution engine optimizes all operators, functions, scanning and filtering modules, and import and export modules at the CPU level. It uses single instruction, multiple data (SIMD) instructions to process more data per clock cycle — benchmarks on standard datasets show a 3–10x improvement in overall operator performance.

The engine also includes Operation on Encoded Data, which runs join, aggregation, and expression operators directly on encoded strings without decoding. This reduces SQL execution complexity and improves query speed by more than two times.

Compute-storage separation

Introduced in StarRocks 3.0, the compute-storage separation architecture decouples computing from storage so each can scale independently. Compute nodes scale within seconds, eliminating the over-provisioning required when compute and storage must grow together.

The storage layer uses various object storage services with nearly unlimited capacity and is compatible with Hadoop Distributed File System (HDFS). The compute-storage separation architecture retains full feature parity with the compute-storage integration architecture — data updates, data lake analysis, and materialized view acceleration all work the same way. The performance of data writing and hot data query is almost the same in the two architectures.

Cost-based optimizer

In complex multi-table join queries, the number of valid execution plans grows exponentially with the number of tables, making optimal plan selection NP-hard. The StarRocks CBO uses a cascades-like architecture customized for the vectorized execution engine. It supports:

Common sub-expression reuse and subquery rewriting
Lateral Join and Join Reorder
Distributed join execution policy selection
Low-cardinality dictionary encoding optimization

The CBO supports all 99 TPC-DS SQL statements.

Real-time columnar storage engine

StarRocks stores data in a columnar format, improving compression ratios, reducing disk I/O, and accelerating queries that read only a subset of columns — the common pattern in OLAP workloads. StarRocks allows you to load data within seconds and provides near-real-time data processing capabilities.

The storage engine guarantees ACID (atomicity, consistency, isolation, and durability) for data imports. Batch imports either succeed or fail atomically, and concurrent transactions benefit from snapshot isolation. The engine also supports partial update and upsert operations, and uses primary key indexes with a Delete-and-Insert mode to avoid sort-and-merge overhead during reads. Secondary indexes handle high-throughput data update scenarios.

Intelligent materialized views

Materialized views in StarRocks work automatically:

Automatic synchronization — when data in a source table changes, the corresponding materialized view detects and applies the update in real time, keeping data consistent
Transparent query rewriting — during query planning, StarRocks detects when a materialized view can accelerate a query and rewrites the query automatically; no application changes required
Background lifecycle management — create and delete materialized views without manual intervention; the system handles the operation in the background
ETL replacement — use materialized views to transform and process data in place, replacing traditional extract, transform, and load (ETL) pipelines and upstream pre-processing

Data lake analysis

Use external catalogs to query data lakes directly — no data migration required. StarRocks supports:

Table formats: Apache Hive, Apache Iceberg, Apache Hudi, Delta Lake
File formats: Parquet, ORC, CSV
Storage services: HDFS, Amazon Simple Storage Service (S3), Object Storage Service (OSS)

In this model, data lakes serve as the single source of truth (SSOT) for BI, AI, ad hoc queries, and reporting workloads, while StarRocks handles compute and analysis using its vectorized engine and CBO.

What Serverless StarRocks adds

Running StarRocks yourself means provisioning clusters, planning version upgrades, configuring security, and monitoring the system. EMR Serverless StarRocks eliminates this operational overhead:

No cluster management — the service is fully managed and O&M-free; skip cluster sizing, setup, and ongoing tuning
Visualized instance management — manage instances and run O&M tasks from the EMR console
Visualized monitoring — built-in monitoring and O&M dashboards
Automatic version upgrades — major and minor StarRocks versions upgrade automatically
Enterprise-grade management with EMR StarRocks Manager:
- *Security*: manage users and permissions
- *Diagnostic analysis*: identify slow SQL queries and analyze SQL execution with visual tools
- *Data management*: browse databases, tables, partitions, shards, and tasks to streamline O&M