Spark is a distributed analytics engine for big data workloads. It combines in-memory caching with a directed acyclic graph (DAG) execution engine to deliver high performance across batch processing, interactive queries, and real-time streaming.
Architecture
Spark Core is the foundation of the Spark platform. Four specialized libraries build on top of it:
Spark SQL: supports offline extract, transform, and load (ETL) and online analytical processing (OLAP)
Spark Streaming: supports real-time stream processing
MLlib: supports machine learning workloads
-
GraphX: supports graph computing
For the full library reference, visit the Apache Spark official website.
Scenarios
-
Offline ETL
Offline ETL covers data warehousing scenarios where large volumes of data are extracted, transformed, and loaded in bulk. These jobs typically run on a schedule rather than on demand. Spark's in-memory processing and DAG-based optimization make it well suited for large-scale batch ETL pipelines.
-
OLAP
OLAP covers business intelligence (BI) scenarios where analysts submit interactive queries and expect fast results. Common OLAP engines include Presto, Impala, and Spark. Choose Spark when your queries involve complex transformations or when you need a single engine that handles ETL and ML workloads on the same cluster. The main features of Spark 3.0 are supported in E-MapReduce (EMR) Spark 2.4. For more information, see Spark SQL guide.
-
Stream processing
Stream processing covers real-time data scenarios such as dashboard updates, risk management, recommendations, monitoring, and alerting. Two common engines run on EMR: Spark Streaming and Flink. Spark Streaming offers two APIs — DStream and Structured Streaming. Structured Streaming uses the same DataFrame model as batch jobs, so it has a low learning curve. Use Flink when your workload requires the lowest possible end-to-end latency. Use Spark Streaming when you need higher throughput. For more information, see Structured Streaming programming guide.
-
Machine learning
MLlib is a scalable machine learning library built into Spark. It includes algorithms for classification, regression, collaborative filtering, and aggregation, as well as tools for model selection, automatic hyperparameter tuning, and cross-validation. MLlib focuses on non-deep-learning algorithms and integrates directly with Spark's distributed data model — no separate training cluster required. For more information, see Machine Learning Library (MLlib) guide.
-
Graph computing
GraphX is Spark's graph computing library. It supports property operators, structural operators, join operators, and neighborhood aggregation operators, covering most graph analytics workloads without requiring a dedicated graph database. For more information, see GraphX programming guide.