Spark on MaxCompute is a computing service that is provided by MaxCompute and compatible with open source Spark. This service provides a Spark computing framework based on unified computing resources and a dataset permission system. The service allows you to use your preferred development method to submit and run Spark jobs. Spark on MaxCompute can fulfill diverse data processing and analysis requirements.

Limits

  • Spark on MaxCompute can be used to perform the following operations:
    • Run all offline jobs in Java and Scala, such as GraphX, MLlib, RDD, Spark SQL, and PySpark.
    • Read data from and write data to MaxCompute tables.
    • Process the unstructured data that is stored in Object Storage Service (OSS).
    • Read data from and write data to services deployed in virtual private clouds (VPCs), such as ApsaraDB RDS, ApsaraDB for Redis, and services deployed on Elastic Compute Service (ECS) instances.
  • Spark on MaxCompute cannot be used to perform the following operations:
    • Perform stream processing.
    • Run interactive jobs, such as Spark-Shell, Spark-SQL-Shell, and PySpark-Shell jobs.

Features

  • Supports different versions of native Spark jobs.

    MaxCompute supports native community Spark and is fully compatible with the APIs of all native Spark versions. Different versions of Spark can run in MaxCompute at the same time. Spark on MaxCompute provides native Spark Web UIs.

  • Runs based on unified computing resources.

    Similar to MaxCompute SQL and MapReduce, Spark on MaxCompute runs based on the unified computing resources that are purchased for MaxCompute projects.

  • Supports unified data and permission management.

    Spark on MaxCompute complies with the permissions you configured for MaxCompute projects. This allows you to query data without the need for additional permission modifications.

  • Provides the same user experience as open source systems.

    Spark on MaxCompute provides the same user experience as open source systems, such as open source application UIs and online interactions. Specifically, it supports native, open source, and real-time UIs that are essential for debugging open source applications. It also allows you to query historical logs. For some open source applications, it enables real-time interactions in the background. This implements an interactive experience.

Architecture

Spark on MaxCompute is an Alibaba Cloud solution that enables native Spark to run in MaxCompute.

The left part of the preceding figure shows the architecture of native Spark. The right part shows the architecture of Spark on MaxCompute, which runs on the Cupid platform developed by Alibaba Cloud. The Cupid platform is fully compatible with the computing framework supported by open source YARN.