All Products
Search
Document Center

E-MapReduce:Spark

Last Updated:Jun 03, 2026

Spark is a distributed analytics engine for big data workloads. It combines in-memory caching with a directed acyclic graph (DAG) execution engine to deliver high performance across batch processing, interactive queries, and real-time streaming.

Architecture

Spark Core is the foundation of the Spark platform. Four specialized libraries build on top of it:

  • Spark SQL: supports offline extract, transform, and load (ETL) and online analytical processing (OLAP)

  • Spark Streaming: supports real-time stream processing

  • MLlib: supports machine learning workloads

  • GraphX: supports graph computing

    image

For the full library reference, visit the Apache Spark official website.

Scenarios

  • Offline ETL

    Offline ETL covers data warehousing scenarios where large volumes of data are extracted, transformed, and loaded in bulk. These jobs typically run on a schedule rather than on demand. Spark's in-memory processing and DAG-based optimization make it well suited for large-scale batch ETL pipelines.

  • OLAP

    OLAP covers business intelligence (BI) scenarios where analysts submit interactive queries and expect fast results. Common OLAP engines include Presto, Impala, and Spark. Choose Spark when your queries involve complex transformations or when you need a single engine that handles ETL and ML workloads on the same cluster. The main features of Spark 3.0 are supported in E-MapReduce (EMR) Spark 2.4. For more information, see Spark SQL guide.

  • Stream processing

    Stream processing covers real-time data scenarios such as dashboard updates, risk management, recommendations, monitoring, and alerting. Two common engines run on EMR: Spark Streaming and Flink. Spark Streaming offers two APIs — DStream and Structured Streaming. Structured Streaming uses the same DataFrame model as batch jobs, so it has a low learning curve. Use Flink when your workload requires the lowest possible end-to-end latency. Use Spark Streaming when you need higher throughput. For more information, see Structured Streaming programming guide.

  • Machine learning

    MLlib is a scalable machine learning library built into Spark. It includes algorithms for classification, regression, collaborative filtering, and aggregation, as well as tools for model selection, automatic hyperparameter tuning, and cross-validation. MLlib focuses on non-deep-learning algorithms and integrates directly with Spark's distributed data model — no separate training cluster required. For more information, see Machine Learning Library (MLlib) guide.

  • Graph computing

    GraphX is Spark's graph computing library. It supports property operators, structural operators, join operators, and neighborhood aggregation operators, covering most graph analytics workloads without requiring a dedicated graph database. For more information, see GraphX programming guide.