Spark on MaxCompute is a computing service that is provided by MaxCompute and compatible with open source Spark. This service provides a Spark computing framework based on unified computing resources and a dataset permission system. The service allows you to use your preferred development method to submit and run Spark jobs. Spark on MaxCompute can meet a wide range of data processing and analysis requirements.

Limits

  • You can use Spark on MaxCompute to perform the following operations:
    • Run all offline jobs in Java and Scala, such as GraphX, MLlib, RDD, Spark SQL, and PySpark.
    • Read data from and write data to MaxCompute tables.
    • Process the unstructured data that is stored in Object Storage Service (OSS).
    • Read data from and write data to services deployed in virtual private clouds (VPCs), such as ApsaraDB RDS, ApsaraDB for Redis, and services deployed on Elastic Compute Service (ECS) instances.
    • Perform stream processing. To enable this feature, submit a ticket to contact MaxCompute technical support.
  • Spark on MaxCompute does not support interactive jobs, such as Spark-Shell, Spark-SQL-Shell, and PySpark-Shell jobs.

Features

  • Supports different versions of native Spark jobs.

    MaxCompute supports native community Spark and is fully compatible with the APIs of all native Spark versions. Different versions of Spark can run in MaxCompute at the same time. Spark on MaxCompute provides native Spark web UIs.

  • Runs based on centralized computing resources.

    Similar to MaxCompute SQL and MapReduce, Spark on MaxCompute runs based on the centralized computing resources that are purchased for MaxCompute projects.

  • Supports centralized data and permission management.

    Spark on MaxCompute complies with the permissions that are configured for MaxCompute projects. This allows you to query data without the need to modify permissions on your MaxCompute projects.

  • Provides the same user experience as open source systems.

    Spark on MaxCompute provides the same user experience as open source systems, such as open source application UIs and online interactions. Spark on MaxCompute supports native, open source, and real-time UIs that are used to debug open source applications. Spark on MaxCompute also allows you to query historical logs. For some open source applications, Spark on MaxCompute can run real-time interactions in the backend. This implements an interactive experience.

Architecture

Spark on MaxCompute is an Alibaba Cloud solution that allows native Spark to run in MaxCompute.

The left part of the preceding figure shows the architecture of native Spark. The right part shows the architecture of Spark on MaxCompute, which runs on the Cupid platform developed by Alibaba Cloud. The Cupid platform is fully compatible with the computing framework supported by open source YARN.