×
Community Blog Alibaba Cloud E-MapReduce: Serverless Open-Source Big Data Platform

Alibaba Cloud E-MapReduce: Serverless Open-Source Big Data Platform

Fully managed cloud-native big data platform for elastic Data+AI lakehouse solutions.

Alibaba Cloud EMR not only provides leading technology and elastic resources, but is also committed to becoming a reliable partner for global enterprises in digital transformation. Whether you want to build a modern data lake or a high-performance real-time data warehouse, EMR can provide a one-stop solution. Leveraging our global infrastructure footprint, we empower Chinese enterprises to expand overseas efficiently, while also providing overseas enterprises with stable, secure, and compliant big data services—exploring the future driven by data together with you.

What is Alibaba Cloud EMR?

Alibaba Cloud E-MapReduce (EMR), an open-source big data platform, is a fully managed cloud-native big data platform built on the open-source ecosystem. It is deeply compatible with Hadoop, Spark, StarRocks, Presto, Hudi, Iceberg, and Paimon, and provides three flexible deployment models: Serverless, on ECS, and on ACK.

As one of the core compute and analytics engines of the OpenLake solution, EMR not only supports traditional offline batch processing and real-time stream processing, but also seamlessly integrates AI capabilities, empowering enterprises to build a unified data lake foundation and accelerate the large-scale adoption of Data + AI.

Core Products

EMR supports three deployment models—EMR Serverless, EMR on ECS, and EMR on ACK—allowing elastic scaling within seconds based on business workloads. The Serverless model is billed on usage and requires no reserved resources. Combined with a hot-and-cold data tiered storage strategy, it significantly reduces the total cost of ownership (TCO) for enterprises.

img

EMR Serverless Series

EMR Serverless Spark

A high-performance Lakehouse product for Data + AI, with a built-in Fusion Engine delivering up to 4x the performance of open-source Spark. It requires no cluster management, offers second-level elasticity and pay-as-you-go pricing, and is ideal for complex ETL tasks, machine learning training, and interactive analytics. It also natively supports GPU scheduling and the Ray distributed framework, fully covering complex needs from traditional data processing to AI multimodal scenarios.

EMR Serverless StarRocks

A cloud-native, fully managed Lakehouse analytics service that is 100% compatible with open-source StarRocks and delivers up to 10x better lake table query performance than the open-source version. Enterprises can use StarRocks to build AI-enabled, efficient OLAP analytics and accelerate the implementation of real-time data warehouses, lakehouse analytics, and lightweight data warehouse solutions.

EMR on ECS

This deployment model combines EMR’s big data processing capabilities with the containerized deployment advantages of ECS, enabling more flexible configuration and management of EMR clusters. It is suitable for batch processing, stream processing, data lakes, and other scenarios. It gives users full control over cluster configuration and resources, making it ideal for long-running offline jobs, customized environment requirements, and enterprises that need complete control over infrastructure.

EMR on ACK

This model deploys EMR on Alibaba Cloud Container Service for Kubernetes (ACK), leveraging ACK’s strengths in service deployment and container application management to reduce operational effort on underlying cluster resources, allowing you to focus more on the big data tasks themselves. It enables big data workloads to share resources with online applications and maximizes resource utilization through tidal scheduling. For enterprises that have already adopted a containerized environment and are seeking unified operations and maintenance, this is the best choice, as it allows reuse of container resources and supports mixed online and offline deployments.

Product Advantages

High Performance and High Reliability

EMR is 100% compatible with open-source community components and delivers 3–5x higher performance than open-source versions. EMR products evolve in step with open-source releases, avoiding version compatibility issues between open-source components. By optimizing open-source components and enhancing the Alibaba Cloud deployment environment, performance is far superior to open-source versions.

In the TPC-H benchmark for “data analytics,” Alibaba Cloud EMR Serverless StarRocks (Stella 1.2.0 engine) won the global championship with a QphH score of over 7.54 million, outperforming the second place by 111%.

img

In the TPC-DS benchmark for “decision support,” Alibaba Cloud EMR Serverless Spark (Fusion 2.0 engine), combined with DLF, won the global championship with a QphDS score of over 65.68 million, with performance and cost-performance leading the second place by 100% and 500%, respectively.

img

EMR services achieve 99.9% availability and industry-leading data reliability, while also providing comprehensive data encryption, access control, and audit logging capabilities to fully protect enterprise data security.

Deep Integration of Data + AI

img

EMR introduces EMR Agent, which allows users to query resource information, trigger related operations, diagnose component anomalies, and obtain technical support using natural language.

EMR also integrates AI Function capabilities, directly encapsulating LLM capabilities into standard SQL or PySpark functions. Without building model services or writing API calls, users can perform sentiment analysis, text translation, information extraction, and intelligent reasoning through simple function calls. It is widely used in log analysis, customer profiling, IoT data processing, and financial risk control, effectively reducing IT costs and simplifying operations, enabling enterprises to focus on core business innovation.

Typical Application Scenarios

img

Data Lakes and Lakehouse Construction

Based on OSS and Hudi/Iceberg/Paimon, EMR builds a compute-storage decoupled architecture to support unified streaming and batch processing, real-time data ingestion into the lake, and unified metadata management. It empowers enterprises to build standardized data platforms and meet the needs of multi-source data integration, self-service analytics, and long-term data governance.

Real-Time Data Warehousing and Risk Control Decisions

With Flink and StarRocks, EMR builds a real-time data processing system that enables second-level data ingestion, computation, and query response. It is widely used in financial risk control, e-commerce real-time marketing, user behavior analysis, IoT real-time monitoring, and other scenarios.

Large-Scale Offline ETL and Data Warehousing

Based on Spark and Hive, EMR supports massive data cleansing, transformation, aggregation, and modeling, enabling enterprise T+1 reporting, BI analytics, user profiling, and business decision-making systems.

Unify AI and Data Science

EMR provides Serverless Spark and a managed Notebook environment, supporting the full workflow of feature engineering, data exploration, and model training. It integrates seamlessly with deep learning frameworks, empowering enterprises to quickly implement AI applications such as recommendation systems, anomaly detection, and predictive analytics.

Customer Cases

Case 1: Hypergryph Built a Cloud-Native Big Data Architecture Based on EMR Serverless Spark

Hypergryph is a young and innovative game company dedicated to developing game products that are both challenging and artistically valuable. The company’s business currently spans the entire lifecycle of game development, operation, and publishing. As its business expanded, Hypergryph evolved from a single hit title to a multi-track, multi-platform, and global strategic layout, and it carried out a comprehensive optimization and upgrade of its data operations.

img

Hypergryph built a cloud-native big data architecture based on Alibaba Cloud EMR Serverless Spark, applying it to game businesses such as Arknights to address the data surges, resource elasticity, and stability requirements brought by frequent in-game events. By supporting Hive/Paimon metadata, integrating Airflow/DolphinScheduler, and incorporating the built-in Fusion engine and Celeborn service to improve engine performance, along with high community compatibility and professional technical support, EMR Serverless Spark enabled optimizations across data ingestion (self-developed tools + Flink CDC), offline scheduling (dual-engine integration), and online computing (StarRocks + BI). After migration, development efficiency improved significantly while maintaining stability: computation in metric calculation scenarios accelerated by 50%, and the core SLA chain was shortened by 1.5 hours.

Click here to read the entire case.

Case 2: Xiaohongshu Achieved the Industry’s Largest Zero-Failure Data Lake Migration with Alibaba Cloud EMR

Xiaohongshu is a social platform for young people and one of China’s leading internet companies. In recent years, as business has grown rapidly, the demand for online data processing has continued to increase, while offline processing has accumulated a large number of historical issues. To solve challenges such as large existing data scale, disorganized historical data (unowned tasks and many non-standard operations), difficult dual-run validation for business systems, and complex cross-team collaboration, Xiaohongshu decided to move its data lake to Alibaba Cloud. Based on Alibaba Cloud EMR + DLF, it migrated a hundred-PB-scale data lake to the cloud. Using DLF’s standardized product capabilities for unified incremental synchronization, along with a bidirectional dual-run strategy and data validation/repair, it achieved a core data deviation rate of less than 0.1%. JindoSDK’s automatic routing ensured seamless migration of non-standard tasks during the cutover phase. Ultimately, it completed the industry’s largest zero-failure data lake migration, involving 500PB of data, 110,000 tasks, 1,500 participants, and more than 40 departments.

Click here to read the entire case.

Contact Us

Want to see how EMR powers elastic, fully managed big data analytics with Spark, StarRocks, and lakehouse-native AI? 👉 Try EMR on Alibaba Cloud or talk to our solution architect to explore how you can build data lakes, real-time warehouses, and Data + AI workflows with seamless scalability.

More Resources:

E-MapReduce Document

E-MapReduce Serverless Spark Free Trial:1000 CU*H 3 months !

0 1 0
Share on

You may also like

Comments

Related Products